SlideShare a Scribd company logo
1 of 75
Download to read offline
Hamiltonian MCMC         NUTS             ε tuning          Numerical Results   Conclusion




                            No - U - Turn Sampler
                   based on M.D.Hoffmann and A.Gelman (2011)


                                    Marco Banterle

                       Presented at the ”Bayes in Paris” reading group


                                    11 / 04 / 2013
Hamiltonian MCMC          NUTS         ε tuning   Numerical Results   Conclusion



Outline




              1    Hamiltonian MCMC

              2    NUTS

              3    ε tuning

              4    Numerical Results

              5    Conclusion
Hamiltonian MCMC       NUTS      ε tuning   Numerical Results   Conclusion



Outline


     1   Hamiltonian MCMC
           Hamiltonian Monte Carlo
           Other useful tools

     2   NUTS

     3   ε tuning

     4   Numerical Results

     5   Conclusion
Hamiltonian MCMC           NUTS           ε tuning         Numerical Results   Conclusion



Hamiltonian dynamic

         Hamiltonian MC techniques have a nice foundation in physics

       Describe the total energy of a system composed by a frictionless
                     particle sliding on a (hyper-)surface

     Position q ∈ Rd and momentum p ∈ Rd are necessary to define the
      total energy H(q, p), which is usually formalized through the sum
      of a potential energy term U(q) and a kinetic energy term K(p).

                       Hamiltonian dynamics are characterized by
                              ∂q   ∂H                ∂p    ∂H
                                 =    ,                 =−
                              ∂t   ∂p                ∂t    ∂q
                   that describe how the system changes though time.
Hamiltonian MCMC         NUTS          ε tuning     Numerical Results   Conclusion



Properties of the Hamiltonian


         • Reversibility : the mapping from (q, p)t to (q, p)t+s is 1-to-1

         • Conservation of the Hamiltonian : ∂H = 0
                                             ∂t
                   total energy is conserved

         • Volume preservation : applying the map resulting from
            time-shifting to a region R of the (q, p)-space do not change
            the volume of the (projected) region

         • Symplecticness : stronger condition than volume
            preservation.
Hamiltonian MCMC       NUTS          ε tuning       Numerical Results    Conclusion



What about MCMC?




                              H(q, p) = U(q) + K(p)

        We can easily interpret U(q) as minus the log target density for
       the variable of interest q, while p will be introduced artificially.
Hamiltonian MCMC            NUTS          ε tuning        Numerical Results   Conclusion



What about MCMC?


               H(q, p) = U(q) + K(p) → − log (π(q|y )) − log(f (p))

                   Let’s examine its properties under a statistical lens:
         • Reversibility : MCMC updates that use these dynamics leave
            the desired distribution invariant
         • Conservation of the Hamiltonian : Metropolis updates
            using Hamiltonians are always accepted
         • Volume preservation : we don’t need to account for any
            change in volume in the acceptance probability for Metropolis
            updates (no need to compute the determinant of the Jacobian
            matrix for the mapping)
Hamiltonian MCMC            NUTS             ε tuning   Numerical Results   Conclusion



the Leapfrog


     Hamilton’s equations are not always1 explicitly available, hence the
         need for time discretization with some small step-size ε.

       A numerical integrator which serves our scopes is called Leapfrog
     Leapfrog Integrator
     For j = 1, . . . , L
         1    pt+ε/2 = pt − (ε/2) ∂U (qt )
                                  ∂q
         2    qt+ε = qt + ε ∂K (pt+ε/2 )
                            ∂p
         3    pt+ε = pt+ε/2 − (ε/2) ∂U (qt+ε )
                                    ∂q
              t =t +ε


         1
             unless they’re quadratic form
Hamiltonian MCMC            NUTS           ε tuning   Numerical Results   Conclusion



A few more words

         The kinetic energy is usually (for simplicity) assumed to be
         K(p) = p T M −1 p which correspond to p|q ≡ p ∼ N (0, M).
       This finally implies that in the leapfrog ∂K (pt+ε/2 ) = M −1 pt+ε/2
                                                ∂p

       Usually M, for which few guidance exists, is taken to be diagonal
                    and often equal to the identity matrix.

          The leapfrog method preserves volume exactly and due to its
        symmetry it is also reversible by simply negating p, applying the
          same number L of steps again, and then negating p again2 .
           It does not however conserve the total energy and thus this
         deterministic move to (q , p ) will be accepted with probability

                           min 1, exp(−H(q , p ) + H(q, p))

         2
             negating ε serves the same purpose
Hamiltonian MCMC            NUTS         ε tuning          Numerical Results   Conclusion



HMC algorithm

     We now have all the elements to construct an MCMC method
     based on the Hamiltoninan dynamics:
     HMC
     Given q0 , ε, L, M
     For i = 1, . . . , N
         1   p ∼ N (0, M)
         2   Set qi ← qi−1 , q ← qi−1 ,             p ←p
         3   for j = 1, . . . , L
                   update(q , p )through leapfrog
         4   Set qi ← q with probability

                         min 1, exp(−H(q , p ) + H(qi−1 , pi−1 ))
Hamiltonian MCMC             NUTS             ε tuning          Numerical Results   Conclusion



HMC benefits

     HMC make use of gradient information to move across the space
     and so its typical trajectories do not resemble random-walks.
     Moreover the error in the Hamiltonian stays bounded3 and hence
     we also have high acceptance rate.




         3
             even if theoretical justification for that are missing
Hamiltonian MCMC         NUTS        ε tuning       Numerical Results      Conclusion



HMC benefits


     Random-walk:
         • require changes proposed with magnitude comparable to the
            sd in the most constrained direction (square root of the
            smallest eigenvalue of the covariance matrix)
         • to reach a almost-independent state need a number of
            iterations mostly determined by how long it takes to explore
            the less constrained direction
         • proposals have no tendency to move consistently in the same
            direction.
     For HMC the proposal move accordingly to the gradient for L step,
     even if ε is still constrained by the smallest eigenvalue.
Hamiltonian MCMC       NUTS         ε tuning      Numerical Results   Conclusion



Connected difficulties



     Performance depends strongly on choosing suitable values for ε
     and L.
         • ε too large → inaccurate simulation & high reject rate
         • ε too small → wasted computational power (small steps)
         • L too small → random walk behavior and slow mixing
         • L too large → trajectories retrace their steps → U-TURN
Hamiltonian MCMC       NUTS         ε tuning      Numerical Results   Conclusion



Connected difficulties



     Performance depends strongly on choosing suitable values for ε
     and L.
         • ε too large → inaccurate simulation & high reject rate
         • ε too small → wasted computational power (small steps)
         • L too small → random walk behavior and slow mixing
         • L too large → trajectories retrace their steps → U-TURN

                         even worse: PERIODICITY!!
Hamiltonian MCMC          NUTS           ε tuning          Numerical Results   Conclusion



Periodicity

     Ergodicity of HMC may fail if a produced trajectory of length (Lε)
     is an exact periodicity for some state.
     Example
          Consider q ∼ N (0, 1), then H(q, p) = q 2 /2 + p 2 /2 and the
                          resulting Hamiltonian is
                                  dq          dp
                                     =p          = −q
                                  dt          dt
                           and hence has an exact solution

                   q(Lε) = r cos(a + Lε),      p(Lε) = −r sin(a + Lε)

                                 for some real a and r .
Hamiltonian MCMC          NUTS          ε tuning       Numerical Results   Conclusion



Periodicity

     Example
                        q ∼ N (0, 1), H(q, p) = q 2 /2 + p 2 /2

                                 dq          dp
                                    =p          = −q
                                 dt          dt
                   q(Lε) = r cos(a + Lε),     p(Lε) = −r sin(a + Lε)

     If we chose a trajectory length such that Lε = 2π at the end of the
     iteration we will return to the starting point!
Hamiltonian MCMC          NUTS          ε tuning       Numerical Results   Conclusion



Periodicity

     Example
                        q ∼ N (0, 1), H(q, p) = q 2 /2 + p 2 /2

                                 dq          dp
                                    =p          = −q
                                 dt          dt
                   q(Lε) = r cos(a + Lε),     p(Lε) = −r sin(a + Lε)

     If we chose a trajectory length such that Lε = 2π at the end of the
     iteration we will return to the starting point!

         • Lε near the period makes HMC ergodic but practically useless
         • interactions between variables prevent exact periodicities, but
            near periodicities might still slow HMC considerably!
Hamiltonian MCMC      NUTS         ε tuning      Numerical Results    Conclusion



Other useful tools - Windowed HMC




     The leapfrog introduces ”random” errors in H at each step and
     hence the acceptance probability may have an high variability.
     Smoothing (over windows of states) out this oscillations could lead
     to higher acceptance rates.
Hamiltonian MCMC              NUTS            ε tuning         Numerical Results        Conclusion



Other useful tools - Windowed HMC

     The leapfrog introduces ”random” errors in H at each step and
     hence the acceptance probability may have an high variability.
     Smoothing (over windows of states) out this oscillations could lead
     to higher acceptance rates.
     Windows map
            We map (q, p) → [(q0 , p0 ), . . . , (qW −1 , pW −1 )] by writing

                             (q, p) = (qs , ps ), s ∼ U(0, W − 1)

       and deterministically recover the other states. This windows has
                               probability density
                                                                   W −1
                                                               1
                   P([(q0 , p0 ), . . . , (qW −1 , pW −1 )]) =            P(qi , pi )
                                                               W
                                                                   i=0
Hamiltonian MCMC       NUTS           ε tuning             Numerical Results   Conclusion



Windowed HMC
     Similarly we perform L - W + 1 leapfrog steps starting from
     (qW −1 , pW −1 ), up to (qL , pL ) and then accept the window
     [(qL−W +1 , −pL−W +1 ), . . . , (qL , −pL )] with probability
                                     L
                                     i=L−W +1 P(qi , pi )
                        min 1,         W −1
                                       i=0 P(qi , pi )




     and finally select (qa , −pa ) with probability
                                     P(qa , pa )
                              i∈accepted window    P(qi , pi )
Hamiltonian MCMC        NUTS        ε tuning       Numerical Results    Conclusion



Slice Sampler

     Slice Sampling idea
           Sampling from f (x) is equivalent to uniform sampling from
                        SG(f ) = {(x, y )|0 ≤ y ≤ f (x)}
                               the subgraph of f

     We simply make use of this by sampling from f with an auxiliary
     variable Gibbs sampling:
         • y |x ∼ U(0, f (x))
         • x|y ∼ USy where Sy = {x|y ≤ f (x)} is the slice
Hamiltonian MCMC       NUTS      ε tuning   Numerical Results   Conclusion



Outline


     1   Hamiltonian MCMC

     2   NUTS
           Idea
           Flow of the simplest NUTS
           A more efficient implementation

     3   ε tuning

     4   Numerical Results

     5   Conclusion
Hamiltonian MCMC   NUTS                ε tuning                Numerical Results   Conclusion



What we (may) have and want to avoid




                     Opacity and size grows with leapfrog steps made
                               In black start and end-point
Hamiltonian MCMC       NUTS         ε tuning       Numerical Results    Conclusion



Main idea




     Define a criterion that helps us avoid these U-Turn, stopping when
     we simulated ”long enough”: instantaneous distance gain
                   ∂ 1                          ∂
     C (q, q ) =        (q −q)T (q −q) = (q −q)T (q −q) = (q −q)T p
                   ∂t 2                         ∂t
       Simulating until C (q, q ) < 0 lead to a non-reversible MC, so the
                      authors devised a different scheme.
Hamiltonian MCMC          NUTS          ε tuning          Numerical Results   Conclusion



Main Idea




         • NUTS augments the model with a slice variable u
         • Add a finite set C of candidates for the update
         • C ⊆ B deterministically chosen,
                   with B the set of all leapfrog steps
Hamiltonian MCMC          NUTS          ε tuning          Numerical Results   Conclusion



Main Idea

         • NUTS augments the model with a slice variable u
         • Add a finite set C of candidates for the update
         • C ⊆ B deterministically chosen,
                   with B the set of all leapfrog steps
     (At a high level) B is built by doubling and checking C (q, q ) on
     sub-trees
Hamiltonian MCMC          NUTS           ε tuning      Numerical Results   Conclusion



High level NUTS steps



         1   resample momentum p ∼ N (0, I )

         2   sample u|q, p ∼ U[0, exp(−H(qt , p))]

         3   generate the proposal from p(B, C|qt , p, u, ε)

         4   sample (qt+1 , p) ∼ T (·|qt , p, C)


        T (·|qt , p, C) s.t. leave the uniform distribution over C invariant
       (C contain (q , p ) s.t. u ≤ exp(−H(q , p )) and sat. reversibility)
Hamiltonian MCMC        NUTS           ε tuning        Numerical Results   Conclusion



Justify T (q , p |qt , p, C)

     Conditions on p(B, C|q, p, u, ε):
            C.1: Elements in C chosen in a volume-preserving way
            C.2: p((q, p) ∈ C|q, p, u, ε) = 1
            C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1
            C.4: If (q, p) ∈ C and (q , p ) ∈ C then

                         p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)


     by them

       p(q, p|u, B, C, ε) ∝ p(B, C|q, p, u, ε)p(q, p|u)
                          ∝ p(B, C|q, p, u, ε)I{u≤exp(−H(q ,p ))}   C .1
                          ∝ IC                            C .2&C .4 − C .3
Hamiltonian MCMC       NUTS             ε tuning           Numerical Results   Conclusion



p(B, C|q, p, u, ε) - Building B by doubling

        Build B by repeatedly doubling a binary tree with (q,p) leaves.
         • Chose a random ”direction in time” νj ∼ U({−1, 1})
         • Take 2j leap(frog)s of size νj ε from
                    (q − , p − )I{−1} (νj ) + (q + , p + )I{+1} (νj )
         • Continue until a stopping rule is met
Hamiltonian MCMC       NUTS             ε tuning           Numerical Results   Conclusion



p(B, C|q, p, u, ε) - Building B by doubling

        Build B by repeatedly doubling a binary tree with (q,p) leaves.
         • Chose a random ”direction in time” νj ∼ U({−1, 1})
         • Take 2j leap(frog)s of size νj ε from
                    (q − , p − )I{−1} (νj ) + (q + , p + )I{+1} (νj )
         • Continue until a stopping rule is met

        Given the start (q, p) and ε : 2j equi-probable height-j trees
     Reconstructing a particular height-j tree from any leaf has prob 2−j
Hamiltonian MCMC        NUTS               ε tuning         Numerical Results   Conclusion



p(B, C|q, p, u, ε) - Building B by doubling

        Build B by repeatedly doubling a binary tree with (q,p) leaves.
         • Chose a random ”direction in time” νj ∼ U({−1, 1})
         • Take 2j leap(frog)s of size νj ε from
                     (q − , p − )I{−1} (νj ) + (q + , p + )I{+1} (νj )
         • Continue until a stopping rule is met

        Given the start (q, p) and ε : 2j equi-probable height-j trees
     Reconstructing a particular height-j tree from any leaf has prob 2−j
                       Possible stopping rule: at height j
         • for one of the 2j − 1 subtrees

                   (q + , q − )T p − < 0          or    (q + , q − )T p + < 0

         • the tree includes a leaf s.t. log(u) − H(q, p) > ∆max
Hamiltonian MCMC        NUTS           ε tuning        Numerical Results   Conclusion



Satisfying Detailed-Balance



            We’ve defined p(B|q, p, u, ε) → deterministically select C

                           Remember the conditions:
            C.1: Elements in C chosen in a volume-preserving way
            C.2: p((q, p) ∈ C|q, p, u, ε) = 1
            C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1
            C.4: If (q, p) ∈ C and (q , p ) ∈ C then

                         p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)
Hamiltonian MCMC        NUTS           ε tuning        Numerical Results   Conclusion



Satisfying Detailed-Balance



            We’ve defined p(B|q, p, u, ε) → deterministically select C

                           Remember the conditions:
            C.1:   Satisfied because of the leapfrog
            C.2: p((q, p) ∈ C|q, p, u, ε) = 1
            C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1
            C.4: If (q, p) ∈ C and (q , p ) ∈ C then

                         p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)
Hamiltonian MCMC        NUTS           ε tuning        Numerical Results   Conclusion



Satisfying Detailed-Balance



            We’ve defined p(B|q, p, u, ε) → deterministically select C

                           Remember the conditions:
            C.1:   Satisfied because of the leapfrog
            C.2:   Satisfied if C includes the initial state
            C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1
            C.4: If (q, p) ∈ C and (q , p ) ∈ C then

                         p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)
Hamiltonian MCMC        NUTS           ε tuning        Numerical Results   Conclusion



Satisfying Detailed-Balance



            We’ve defined p(B|q, p, u, ε) → deterministically select C

                           Remember the conditions:
            C.1:   Satisfied because of the leapfrog
            C.2:   Satisfied if C includes the initial state
            C.3:   Satisfied if we exclude points outside the slice u
            C.4: If (q, p) ∈ C and (q , p ) ∈ C then

                         p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)
Hamiltonian MCMC            NUTS          ε tuning          Numerical Results      Conclusion



Satisfying Detailed-Balance



            We’ve defined p(B|q, p, u, ε) → deterministically select C

                               Remember the conditions:
            C.1:      Satisfied because of the leapfrog
            C.2:      Satisfied if C includes the initial state
            C.3:      Satisfied if we exclude points outside the slice u
            C.4: As long as C given B is deterministic

                   p(B, C|q, p, u, ε) = 2−j          or   p(B, C|q, p, u, ε) = 0
Hamiltonian MCMC        NUTS          ε tuning        Numerical Results   Conclusion



Satisfying Detailed-Balance




            We’ve defined p(B|q, p, u, ε) → deterministically select C

                           Remember the conditions:
            C.1:   Satisfied because of the leapfrog
            C.2:   Satisfied if C includes the initial state
            C.3:   Satisfied if we exclude points outside the slice u
            C.4:   Satisfied if we exclude states that couldn’t generate B
Hamiltonian MCMC        NUTS          ε tuning        Numerical Results   Conclusion



Satisfying Detailed-Balance



            We’ve defined p(B|q, p, u, ε) → deterministically select C

                           Remember the conditions:
            C.1:   Satisfied because of the leapfrog
            C.2:   Satisfied if C includes the initial state
            C.3:   Satisfied if we exclude points outside the slice u
            C.4:   Satisfied if we exclude states that couldn’t generate B

                   Finally we select (q’,p’) at random from C
Hamiltonian MCMC   NUTS   ε tuning   Numerical Results   Conclusion
Hamiltonian MCMC   NUTS   ε tuning   Numerical Results   Conclusion
Hamiltonian MCMC       NUTS          ε tuning      Numerical Results   Conclusion



Efficient NUTS




     The computational cost per-leapfrog step is comparable with HMC
     (just 2j+1 − 2 more inner products)
     However:
         • it requires to store 2j position-momentum states
         • long jumps are not guaranteed
         • waste time if a stopping criterion is met during doubling
Hamiltonian MCMC       NUTS          ε tuning      Numerical Results   Conclusion



Efficient NUTS




     The authors address these issues as follow:
         • waste time if a stopping criterion is met during doubling

     as soon as a stopping rule is met break out of the loop
Hamiltonian MCMC        NUTS                   ε tuning                   Numerical Results            Conclusion



Efficient NUTS



     The authors address these issues as follow:
         • long jumps are not guaranteed
     Consider the kernel:
                      I[w ∈C new ]
                         |C new |                                                     if |C new | > |C old |,
     T (w |w , C) =   I[w ∈C new ] |C new |               |C new |                                              ,
                         |C new |  |C old |
                                              + (1 −      |C old |
                                                                   )I[w   = w]        if |C new | ≤ |C old |

     where w is short for (q,p) and let C new and C old be respectively the set C
     introduced during the final iteration and the older elements already in C.
Hamiltonian MCMC        NUTS                   ε tuning                   Numerical Results            Conclusion



Efficient NUTS

     The authors address these issues as follow:
         • it requires to store 2j position-momentum states
         • long jumps are not guaranteed
     Consider the kernel:
                      I[w ∈C new ]
                         |C new |                                                     if |C new | > |C old |,
     T (w |w , C) =   I[w ∈C new ] |C new |                 new
                                                          |C |                                                  ,
                         |C new |  |C old |
                                              + (1 −      |C old |
                                                                   )I[w   = w]        if |C new | ≤ |C old |

     where w is short for (q,p) and let C new and C old be respectively the set C
     introduced during the final iteration and the older elements already in C.

     Iteratively applying the above after every doubling moreover require that
     we store only O(j) position (and sizes) rather than O(2j )
Hamiltonian MCMC       NUTS   ε tuning   Numerical Results   Conclusion



Outline


     1   Hamiltonian MCMC

     2   NUTS

     3   ε tuning
            Adaptive MCMC
            Random ε

     4   Numerical Results

     5   Conclusion
Hamiltonian MCMC        NUTS         ε tuning       Numerical Results     Conclusion



Adaptive MCMC


                          Classic adaptive MCMC idea:
             stochastic optimization on the parameter with vanishing
                                   adaptation:

                               θt+1 ← θt − ηt Ht

                where R ηt → 0 with the ”correct” pace and
          Ht = δ − αt defines a characteristic of interest of the chain.
     Example
        Consider the case of a random-walk MH with x |xt ∼ N (0, θt )
      adapted on the acceptance rate (αt ) in order to get to the desired
                             optimum δ = 0.234
Hamiltonian MCMC         NUTS        ε tuning      Numerical Results     Conclusion



Double Averaging


     This idea has however (in particoular) an intrinsic flaw:
         • the diminishing step sizes ηt give more weight to the early
            iterations
Hamiltonian MCMC            NUTS             ε tuning        Numerical Results    Conclusion



Double Averaging

     This idea has however (in particoular) an intrinsic flaw:
         • the diminishing step sizes ηt give more weight to the early
              iterations
     The authors rely then on the following scheme3 :
                              √         t
                          1     t
               εt+1   =µ−                    Hi    εt+1 = ηt εt+1 + (1 − ηt )˜t
                                                   ˜                         ε
                          γ t + t0
                                       i=1



                   adaptation                                    sampling
               γ shrinkage → µ                                  ηt = t −k
               t0 early iteration                              k ∈ (0.5, 1]


         3
             introduced by Nesterov (2009) for stochastic convex optimization
Hamiltonian MCMC         NUTS                ε tuning            Numerical Results   Conclusion



Select the criterion




     The authors advice to use Ht = δ − αt with δ = 0.65 and

                              HMC                       P(q , −p )
                             αt   = min 1,
                                                        P(qt , −pt )

                    NUTS        1                                 P(qt , −pt )
                   αt    =     final
                                                        min 1,
                             |Bt |               final
                                                                 P(qt−1 , −pt )
                                      (qt ,pt )∈Bt
Hamiltonian MCMC      NUTS         ε tuning         Numerical Results   Conclusion



Alternative: random ε


     Periodicity is still a problem and the ”optimal” ε may differ in
     different region of the target! (e.g. mixtures)
     The solution might be to randomize ε around some ε0
     Example


                      ε ∼ E(ε0 )     ε ∼ U(c1 ε0 , c2 ε0 )

                              c1 < 1,     c2 > 1
Hamiltonian MCMC      NUTS         ε tuning       Numerical Results    Conclusion



Alternative: random ε




     Periodicity is still a problem and the ”optimal” ε may differ in
     different region of the target! (e.g. mixtures)
     The solution might be to randomize ε around some ε0

     Helpfull especially for HMC, care is needed for NUTS as L is
     usually inversely proportional to ε

     What said about adaptation can be combined using ε0 = εt
Hamiltonian MCMC      NUTS       ε tuning   Numerical Results   Conclusion



Outline


     1   Hamiltonian MCMC

     2   NUTS

     3   ε tuning

     4   Numerical Results
           Paper’s examples
           Multivariate Normal

     5   Conclusion
Hamiltonian MCMC          NUTS            ε tuning         Numerical Results    Conclusion



Numerical Examples
     ”It at least as good as an ’optimally’ tuned HMC.”
     Tested on:
         1   (MVN) : 250-d Normal whose precision matrix was sampled
             from Wishart with identity scale matrix and 250 df
         2   (LR) : (24+1)-dim regression coefficient distribution on the
             German credit data with weak priors
         3   (HLR) : same dataset - exponentially distributed variance
             hyper parameters and included two-way interactions →
             (300+1)*2-d predictors
         4   (SV) : 3000 days of returns from S&P 500 index → 3001-d
             target following this scheme:

                       τ ∼ E(100);       ν ∼ E(100);      s1 ∼ E(100);
                                                     log yi − log yi−1
                   log si ∼ N (log si−1 , τ − 1);                      ∼ tν .
                                                             si
Hamiltonian MCMC       NUTS         ε tuning       Numerical Results     Conclusion



ε and L in NUTS




        Also the dual averaging usually does a good job of coercing the
                        statistic H to its desired value.

     Most of the trajectory lengths are integer powers of two, indicating
     that the U-turn criterion is usually satisfied only after a doubling is
          completed, which is desirable since it means that we only
     occasionally have to throw out entire half-trajectories to satisfy DB.
Hamiltonian MCMC   NUTS   ε tuning   Numerical Results   Conclusion
Hamiltonian MCMC      NUTS         ε tuning      Numerical Results   Conclusion



Multivariate Normal
     Highly correlated, 30-dim Multivariate Normal




       To get the full advantage of HMC, the trajectory has to be long
         enough (but not much longer) that in the least constrained
           direction the end-point is distant from the start point.
Hamiltonian MCMC     NUTS         ε tuning      Numerical Results   Conclusion



Multivariate Normal


     Highly correlated, 30-dim Multivariate Normal
Hamiltonian MCMC       NUTS          ε tuning       Numerical Results    Conclusion



Multivariate Normal
     Highly correlated, 30-dim Multivariate Normal




       Trajectory reverse direction before the least constrained direction
                               has been explored.
       This can produce a U-Turn when the trajectory is much shorter
                                than is optimal.
Hamiltonian MCMC                 NUTS      ε tuning    Numerical Results   Conclusion



Multivariate Normal
     Highly correlated, 30-dim Multivariate Normal




               Fortunately, too-short trajectories are cheap relative to
                              long-enough trajectories!
     but it does not help in avoiding RW
Hamiltonian MCMC                 NUTS   ε tuning   Numerical Results      Conclusion



Multivariate Normal
     One way to overcome the problem could be to check for U-Turn
     only in the least constrained




            But usually such an information is not known a-priori.
         Maybe after a trial run to discover the magnitude order of the
                                    variables?
     someone said periodicity?
Hamiltonian MCMC       NUTS   ε tuning   Numerical Results   Conclusion



Outline


     1   Hamiltonian MCMC

     2   NUTS

     3   ε tuning

     4   Numerical Results

     5   Conclusion
           Improvements
           Discussion
Hamiltonian MCMC       NUTS          ε tuning       Numerical Results    Conclusion



Non-Uniform weighting system
     Qin and Liu (2001) propose a similar (to Neal’s) weighting
     procedure:
       • Generate the whole trajectory from (q0 , p0 )= (q, p) to (qL , pL )
       • Select a point in the ”accept window”
         [(qL−W +1 , pL−W +1 ), . . . , (qL , pL )], say (qL−k , pL−k )
       • equivalent to select k ∼ U(0, W )
Hamiltonian MCMC         NUTS          ε tuning         Numerical Results   Conclusion



Non-Uniform weighting system
     Qin and Liu (2001) propose a similar (to Neal’s) weighting
     procedure:
         • Generate the whole trajectory from (q0 , p0 ) to (qL , pL )
         • Select a point (qL−k , pL−k )
         • Take k step backward from (q0 , p0 ) to (q−k , p−k )
         • Reject window : [(q−k , p−k ), . . . , (qW −k−1 , pW −k−1 )]
Hamiltonian MCMC        NUTS              ε tuning       Numerical Results   Conclusion



Non-Uniform weighting system

     Qin and Liu (2001) propose a similar (to Neal’s) weighting
     procedure:
         • Generate the whole trajectory from (q0 , p0 ) to (qL , pL )
         • Select a point (qL−k , pL−k )
         • Take k step backward from (q0 , p0 ) to (q−k , p−k )
         • Accept (qL−k , pL−k ) with probability
                                                                      
                                   W
                                        wi P(qL−W +i , −pL−W +i ) 
                                   i=1
                        min 1,
                                                                  
                                     W
                                                                   
                                                                  
                                          wi P(qi−k−1 , −pi−k−1 )
                                   i=1
Hamiltonian MCMC       NUTS              ε tuning       Numerical Results   Conclusion



Non-Uniform weighting system


     Qin and Liu (2001) propose a similar (to Neal’s) weighting
     procedure:
         • Accept (qL−k , pL−k ) with probability
                                                                     
                                  W
                                       wi P(qL−W +i , −pL−W +i ) 
                                  i=1
                       min 1,
                                                                 
                                    W
                                                                  
                                                                 
                                         wi P(qi−k−1 , −pi−k−1 )
                                  i=1

         • With uniform weights wi = 1/W ∀i
         • Other weights may favor states further from the start!
Hamiltonian MCMC        NUTS         ε tuning       Numerical Results      Conclusion



Riemann Manifold HMC



         • In statistical modeling the parameter space is a manifold

         • d(p(y |θ), p(y |θ + δθ)) can be defined as δθ T G (θ)δθ
            G (θ) is the expected Fisher information matrix [Rao (1945)]
         • G (θ) defines a position-specific Riemann metric.


       Girolami & Calderhead (2011) used this to tune K(p) using local
            2nd -order derivative information encoded in M = G (q)
Hamiltonian MCMC       NUTS         ε tuning       Numerical Results   Conclusion



Riemann Manifold HMC




       Girolami & Calderhead (2011) used this to tune K(p) using local
            2nd -order derivative information encoded in M = G (q)

        The deterministic proposal is guided not only by the gradient of
         the target density but also exploits local geometric structure.
        Possible optimality : paths that are produced by the solution of
         Hamiltoninan equations follow the geodesics on the manifold.
Hamiltonian MCMC           NUTS        ε tuning     Numerical Results   Conclusion



Riemann Manifold HMC


       Girolami & Calderhead (2011) used this to tune K(p) using local
            2nd -order derivative information encoded in M = G (q)


                                    Practically:
              p|q ∼ N (0, G (q)) resolving the scaling issues in HMC
         • tuning of ε less critical
         • non separable Hamiltonian

                                   1                   1
                   H(q, p) = U(q) + log (2π)D |G (q)| + p T G (q)−1 p
                                   2                   2
         • need for an (expensive) implicit integrator
            (generalized leapfrog + fixed point iterations)
Hamiltonian MCMC            NUTS         ε tuning      Numerical Results   Conclusion



RMHMC & Related
     ”In some cases, the computational overhead for solving implicit
     equations undermines RMHMCs benefits” . [Lan et al. (2012)]

                       Lagrangian Dynamics (RMLMC)
         Replaces momentum p (mass×velocity) in HMC by velocity v
                   volume correction through the Jacobian
         • semi-implicit integrator for ”Hamiltonian” dynamics
         • different fully explicit integrator for Lagrangian dynamics
                   two extra matrix inversions to update v

                            Adaptively Updated (AUHMC)
                    Replace G (q) with M(q, q ) = 1 [G (q) + G (q )]
                                                  2
         • fully explicit leapfrog integrator (M(q, q ) constant through
            trajectory)
         • less local-adaptive (especially for long trajectories)
         • is it really reversible?
Hamiltonian MCMC          NUTS           ε tuning       Numerical Results   Conclusion



Split Hamiltonian


     Variations on HMC obtained by using discretizations of
     Hamiltonian dynamics, splitting H into:

                   H(q, p) = H1 (q, p) + H2 (q, p) + · · · + HK (q, p)

     This may allows much of the movement to be done at low cost.
         • log of U(q) as the log of a Gaussian plus a second term
                   quadratic forms allow for explicit solution
         • H1 and its gradient can be eval. quickly, with only a slowly-
            varying H2 requiring costly computations. e.g. splitting data
Hamiltonian MCMC       NUTS          ε tuning      Numerical Results   Conclusion



Split Hamiltonian
         • log of U(q) as the log of a Gaussian plus a second term
                 quadratic forms allow for explicit solution
         • H1 and its gradient can be eval. quickly, with only a slowly-
           varying H2 requiring costly computations. e.g. splitting data
Hamiltonian MCMC        NUTS          ε tuning       Numerical Results    Conclusion



split-wRMLNUTS & co.



         • NUTS itself provide a nice automatic tune of HMC
         • very few citation to their work!
         • integration with RMHMC (just out!) may overcome ε-related
            problems and provide better criteria for the stopping rule!

       HMC may not be the definitive sampler, but is definitely equally
                             useful and unknown.
       It seems the direction taken may finally overcome the difficulties
                      connected with its use and spread.
Hamiltonian MCMC      NUTS          ε tuning       Numerical Results   Conclusion




                             Betancourt (2013b)
         • RMHMC has smoother movements on the surface but..
         • .. (q − q)T p has no more a meaning

       The paper address this issue by generalizing the stopping rule to
                          other M mass matrices.
Hamiltonian MCMC       NUTS         ε tuning      Numerical Results   Conclusion



References I



     Introduction:
         • M.D. Hoffmann & A. Gelman (2011) - The No-U-Turn
            Sampler
         • R.M. Neal (2011) - MCMC using Hamiltonian dynamics in
            Handbook of Markov Chain Monte Carlo
         • Z. S. Qin, and J.S. Liu (2001) - Multipoint Metropolis
            method with application to hybrid Monte Carlo
         • R.M. Neal (2003) - Slice Sampling
Hamiltonian MCMC       NUTS         ε tuning      Numerical Results   Conclusion



References II

     Further Readings:
         • M. Girolami, B. Calderhead (2011) - Riemann manifold
            Langevin and Hamiltonian Monte Carlo methods
         • Z. Wang, S. Mohamed, N. de Freitas (2013) - Adaptive
            Hamiltonian and Riemann Manifold Monte Carlo Samplers
         • S. Lan, V. Stathopoulos, B. Shahbaba, M. Girolami (2012) -
            Lagrangian Dynamical Monte Carlo
         • M. Burda, JM. Maheu (2013) - Bayesian Adaptively Updated
            Hamiltonian Monte Carlo with an Application to
            High-Dimensional BEKK GARCH Models
         • Michael Betancourt (2013) - Generalizing the No-U-Turn
            Sampler to Riemannian Manifolds

More Related Content

What's hot

A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...Jin-Hwa Kim
 
Hidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathHidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathLê Hòa
 
PRML輪読#9
PRML輪読#9PRML輪読#9
PRML輪読#9matsuolab
 
第3回NIPS読み会・関西発表資料
第3回NIPS読み会・関西発表資料第3回NIPS読み会・関西発表資料
第3回NIPS読み会・関西発表資料Takato Horii
 
統計的学習の基礎 第2章後半
統計的学習の基礎 第2章後半統計的学習の基礎 第2章後半
統計的学習の基礎 第2章後半Prunus 1350
 
ノンパラメトリックベイズを用いた逆強化学習
ノンパラメトリックベイズを用いた逆強化学習ノンパラメトリックベイズを用いた逆強化学習
ノンパラメトリックベイズを用いた逆強化学習Shota Ishikawa
 
データ解析のための統計モデリング入門 6.5章 後半
データ解析のための統計モデリング入門 6.5章 後半データ解析のための統計モデリング入門 6.5章 後半
データ解析のための統計モデリング入門 6.5章 後半Yurie Oka
 
Visualization using tSNE
Visualization using tSNEVisualization using tSNE
Visualization using tSNEYan Xu
 
Variational AutoEncoder
Variational AutoEncoderVariational AutoEncoder
Variational AutoEncoderKazuki Nitta
 
PRML輪読#10
PRML輪読#10PRML輪読#10
PRML輪読#10matsuolab
 
ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定Akira Masuda
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNEDavid Khosid
 
パターン認識と機械学習 §6.2 カーネル関数の構成
パターン認識と機械学習 §6.2 カーネル関数の構成パターン認識と機械学習 §6.2 カーネル関数の構成
パターン認識と機械学習 §6.2 カーネル関数の構成Prunus 1350
 
PRML輪読#11
PRML輪読#11PRML輪読#11
PRML輪読#11matsuolab
 
PRML読書会1スライド(公開用)
PRML読書会1スライド(公開用)PRML読書会1スライド(公開用)
PRML読書会1スライド(公開用)tetsuro ito
 
PRML輪読#6
PRML輪読#6PRML輪読#6
PRML輪読#6matsuolab
 
20190721 gaussian process
20190721 gaussian process20190721 gaussian process
20190721 gaussian processYoichi Tokita
 
Sliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデルSliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデルohken
 
確率的バンディット問題
確率的バンディット問題確率的バンディット問題
確率的バンディット問題jkomiyama
 

What's hot (20)

A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
A summary of Categorical Reparameterization with Gumbel-Softmax by Jang et al...
 
Hidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathHidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable Path
 
PRML輪読#9
PRML輪読#9PRML輪読#9
PRML輪読#9
 
第3回NIPS読み会・関西発表資料
第3回NIPS読み会・関西発表資料第3回NIPS読み会・関西発表資料
第3回NIPS読み会・関西発表資料
 
統計的学習の基礎 第2章後半
統計的学習の基礎 第2章後半統計的学習の基礎 第2章後半
統計的学習の基礎 第2章後半
 
ノンパラメトリックベイズを用いた逆強化学習
ノンパラメトリックベイズを用いた逆強化学習ノンパラメトリックベイズを用いた逆強化学習
ノンパラメトリックベイズを用いた逆強化学習
 
データ解析のための統計モデリング入門 6.5章 後半
データ解析のための統計モデリング入門 6.5章 後半データ解析のための統計モデリング入門 6.5章 後半
データ解析のための統計モデリング入門 6.5章 後半
 
Visualization using tSNE
Visualization using tSNEVisualization using tSNE
Visualization using tSNE
 
Variational AutoEncoder
Variational AutoEncoderVariational AutoEncoder
Variational AutoEncoder
 
PRML輪読#10
PRML輪読#10PRML輪読#10
PRML輪読#10
 
ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNE
 
1次式とノルムで構成された最適化問題とその双対問題
1次式とノルムで構成された最適化問題とその双対問題1次式とノルムで構成された最適化問題とその双対問題
1次式とノルムで構成された最適化問題とその双対問題
 
パターン認識と機械学習 §6.2 カーネル関数の構成
パターン認識と機械学習 §6.2 カーネル関数の構成パターン認識と機械学習 §6.2 カーネル関数の構成
パターン認識と機械学習 §6.2 カーネル関数の構成
 
PRML輪読#11
PRML輪読#11PRML輪読#11
PRML輪読#11
 
PRML読書会1スライド(公開用)
PRML読書会1スライド(公開用)PRML読書会1スライド(公開用)
PRML読書会1スライド(公開用)
 
PRML輪読#6
PRML輪読#6PRML輪読#6
PRML輪読#6
 
20190721 gaussian process
20190721 gaussian process20190721 gaussian process
20190721 gaussian process
 
Sliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデルSliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデル
 
確率的バンディット問題
確率的バンディット問題確率的バンディット問題
確率的バンディット問題
 

Viewers also liked

6 Maheene Par, U-Turn Sarkar
6 Maheene Par, U-Turn Sarkar6 Maheene Par, U-Turn Sarkar
6 Maheene Par, U-Turn Sarkarinc_2015
 
Juliaで学ぶ Hamiltonian Monte Carlo (NUTS 入り)
Juliaで学ぶ Hamiltonian Monte Carlo (NUTS 入り)Juliaで学ぶ Hamiltonian Monte Carlo (NUTS 入り)
Juliaで学ぶ Hamiltonian Monte Carlo (NUTS 入り)Kenta Sato
 
強化学習その2
強化学習その2強化学習その2
強化学習その2nishio
 
トランザクションの設計と進化
トランザクションの設計と進化トランザクションの設計と進化
トランザクションの設計と進化Kumazaki Hiroki
 

Viewers also liked (6)

6 Maheene Par, U-Turn Sarkar
6 Maheene Par, U-Turn Sarkar6 Maheene Par, U-Turn Sarkar
6 Maheene Par, U-Turn Sarkar
 
HMC and NUTS
HMC and NUTSHMC and NUTS
HMC and NUTS
 
Traffic Rules
Traffic RulesTraffic Rules
Traffic Rules
 
Juliaで学ぶ Hamiltonian Monte Carlo (NUTS 入り)
Juliaで学ぶ Hamiltonian Monte Carlo (NUTS 入り)Juliaで学ぶ Hamiltonian Monte Carlo (NUTS 入り)
Juliaで学ぶ Hamiltonian Monte Carlo (NUTS 入り)
 
強化学習その2
強化学習その2強化学習その2
強化学習その2
 
トランザクションの設計と進化
トランザクションの設計と進化トランザクションの設計と進化
トランザクションの設計と進化
 

Similar to no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Unbiased MCMC with couplings
Unbiased MCMC with couplingsUnbiased MCMC with couplings
Unbiased MCMC with couplingsPierre Jacob
 
Variational Bayes: A Gentle Introduction
Variational Bayes: A Gentle IntroductionVariational Bayes: A Gentle Introduction
Variational Bayes: A Gentle IntroductionFlavio Morelli
 
Markov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing themMarkov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing themPierre Jacob
 
Unbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloUnbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloJeremyHeng10
 
Introduction to hmc
Introduction to hmcIntroduction to hmc
Introduction to hmcKota Takeda
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferenceChristian Robert
 
Meeting w3 chapter 2 part 1
Meeting w3   chapter 2 part 1Meeting w3   chapter 2 part 1
Meeting w3 chapter 2 part 1mkazree
 
Meeting w3 chapter 2 part 1
Meeting w3   chapter 2 part 1Meeting w3   chapter 2 part 1
Meeting w3 chapter 2 part 1Hattori Sidek
 
legendre transformatio.pptx
legendre transformatio.pptxlegendre transformatio.pptx
legendre transformatio.pptxMohsan10
 
Statistical approach to quantum field theory
Statistical approach to quantum field theoryStatistical approach to quantum field theory
Statistical approach to quantum field theorySpringer
 
Unbiased Markov chain Monte Carlo methods
Unbiased Markov chain Monte Carlo methods Unbiased Markov chain Monte Carlo methods
Unbiased Markov chain Monte Carlo methods Pierre Jacob
 

Similar to no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm (20)

Unbiased MCMC with couplings
Unbiased MCMC with couplingsUnbiased MCMC with couplings
Unbiased MCMC with couplings
 
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Variational Bayes: A Gentle Introduction
Variational Bayes: A Gentle IntroductionVariational Bayes: A Gentle Introduction
Variational Bayes: A Gentle Introduction
 
LieGroup
LieGroupLieGroup
LieGroup
 
Shanghai tutorial
Shanghai tutorialShanghai tutorial
Shanghai tutorial
 
Markov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing themMarkov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing them
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Unbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloUnbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte Carlo
 
Introduction to hmc
Introduction to hmcIntroduction to hmc
Introduction to hmc
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
 
Meeting w3 chapter 2 part 1
Meeting w3   chapter 2 part 1Meeting w3   chapter 2 part 1
Meeting w3 chapter 2 part 1
 
Meeting w3 chapter 2 part 1
Meeting w3   chapter 2 part 1Meeting w3   chapter 2 part 1
Meeting w3 chapter 2 part 1
 
Basic concepts and how to measure price volatility
Basic concepts and how to measure price volatility Basic concepts and how to measure price volatility
Basic concepts and how to measure price volatility
 
Pres_Zurich14
Pres_Zurich14Pres_Zurich14
Pres_Zurich14
 
intro
introintro
intro
 
legendre transformatio.pptx
legendre transformatio.pptxlegendre transformatio.pptx
legendre transformatio.pptx
 
Statistical approach to quantum field theory
Statistical approach to quantum field theoryStatistical approach to quantum field theory
Statistical approach to quantum field theory
 
Unbiased Markov chain Monte Carlo methods
Unbiased Markov chain Monte Carlo methods Unbiased Markov chain Monte Carlo methods
Unbiased Markov chain Monte Carlo methods
 

More from Christian Robert

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceChristian Robert
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinChristian Robert
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?Christian Robert
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Christian Robert
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Christian Robert
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihoodChristian Robert
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)Christian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerChristian Robert
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Christian Robert
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussionChristian Robert
 

More from Christian Robert (20)

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 

Recently uploaded

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 

Recently uploaded (20)

Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 

no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

  • 1. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion No - U - Turn Sampler based on M.D.Hoffmann and A.Gelman (2011) Marco Banterle Presented at the ”Bayes in Paris” reading group 11 / 04 / 2013
  • 2. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Outline 1 Hamiltonian MCMC 2 NUTS 3 ε tuning 4 Numerical Results 5 Conclusion
  • 3. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Outline 1 Hamiltonian MCMC Hamiltonian Monte Carlo Other useful tools 2 NUTS 3 ε tuning 4 Numerical Results 5 Conclusion
  • 4. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Hamiltonian dynamic Hamiltonian MC techniques have a nice foundation in physics Describe the total energy of a system composed by a frictionless particle sliding on a (hyper-)surface Position q ∈ Rd and momentum p ∈ Rd are necessary to define the total energy H(q, p), which is usually formalized through the sum of a potential energy term U(q) and a kinetic energy term K(p). Hamiltonian dynamics are characterized by ∂q ∂H ∂p ∂H = , =− ∂t ∂p ∂t ∂q that describe how the system changes though time.
  • 5. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Properties of the Hamiltonian • Reversibility : the mapping from (q, p)t to (q, p)t+s is 1-to-1 • Conservation of the Hamiltonian : ∂H = 0 ∂t total energy is conserved • Volume preservation : applying the map resulting from time-shifting to a region R of the (q, p)-space do not change the volume of the (projected) region • Symplecticness : stronger condition than volume preservation.
  • 6. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion What about MCMC? H(q, p) = U(q) + K(p) We can easily interpret U(q) as minus the log target density for the variable of interest q, while p will be introduced artificially.
  • 7. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion What about MCMC? H(q, p) = U(q) + K(p) → − log (π(q|y )) − log(f (p)) Let’s examine its properties under a statistical lens: • Reversibility : MCMC updates that use these dynamics leave the desired distribution invariant • Conservation of the Hamiltonian : Metropolis updates using Hamiltonians are always accepted • Volume preservation : we don’t need to account for any change in volume in the acceptance probability for Metropolis updates (no need to compute the determinant of the Jacobian matrix for the mapping)
  • 8. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion the Leapfrog Hamilton’s equations are not always1 explicitly available, hence the need for time discretization with some small step-size ε. A numerical integrator which serves our scopes is called Leapfrog Leapfrog Integrator For j = 1, . . . , L 1 pt+ε/2 = pt − (ε/2) ∂U (qt ) ∂q 2 qt+ε = qt + ε ∂K (pt+ε/2 ) ∂p 3 pt+ε = pt+ε/2 − (ε/2) ∂U (qt+ε ) ∂q t =t +ε 1 unless they’re quadratic form
  • 9. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion A few more words The kinetic energy is usually (for simplicity) assumed to be K(p) = p T M −1 p which correspond to p|q ≡ p ∼ N (0, M). This finally implies that in the leapfrog ∂K (pt+ε/2 ) = M −1 pt+ε/2 ∂p Usually M, for which few guidance exists, is taken to be diagonal and often equal to the identity matrix. The leapfrog method preserves volume exactly and due to its symmetry it is also reversible by simply negating p, applying the same number L of steps again, and then negating p again2 . It does not however conserve the total energy and thus this deterministic move to (q , p ) will be accepted with probability min 1, exp(−H(q , p ) + H(q, p)) 2 negating ε serves the same purpose
  • 10. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion HMC algorithm We now have all the elements to construct an MCMC method based on the Hamiltoninan dynamics: HMC Given q0 , ε, L, M For i = 1, . . . , N 1 p ∼ N (0, M) 2 Set qi ← qi−1 , q ← qi−1 , p ←p 3 for j = 1, . . . , L update(q , p )through leapfrog 4 Set qi ← q with probability min 1, exp(−H(q , p ) + H(qi−1 , pi−1 ))
  • 11. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion HMC benefits HMC make use of gradient information to move across the space and so its typical trajectories do not resemble random-walks. Moreover the error in the Hamiltonian stays bounded3 and hence we also have high acceptance rate. 3 even if theoretical justification for that are missing
  • 12. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion HMC benefits Random-walk: • require changes proposed with magnitude comparable to the sd in the most constrained direction (square root of the smallest eigenvalue of the covariance matrix) • to reach a almost-independent state need a number of iterations mostly determined by how long it takes to explore the less constrained direction • proposals have no tendency to move consistently in the same direction. For HMC the proposal move accordingly to the gradient for L step, even if ε is still constrained by the smallest eigenvalue.
  • 13. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Connected difficulties Performance depends strongly on choosing suitable values for ε and L. • ε too large → inaccurate simulation & high reject rate • ε too small → wasted computational power (small steps) • L too small → random walk behavior and slow mixing • L too large → trajectories retrace their steps → U-TURN
  • 14. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Connected difficulties Performance depends strongly on choosing suitable values for ε and L. • ε too large → inaccurate simulation & high reject rate • ε too small → wasted computational power (small steps) • L too small → random walk behavior and slow mixing • L too large → trajectories retrace their steps → U-TURN even worse: PERIODICITY!!
  • 15. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Periodicity Ergodicity of HMC may fail if a produced trajectory of length (Lε) is an exact periodicity for some state. Example Consider q ∼ N (0, 1), then H(q, p) = q 2 /2 + p 2 /2 and the resulting Hamiltonian is dq dp =p = −q dt dt and hence has an exact solution q(Lε) = r cos(a + Lε), p(Lε) = −r sin(a + Lε) for some real a and r .
  • 16. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Periodicity Example q ∼ N (0, 1), H(q, p) = q 2 /2 + p 2 /2 dq dp =p = −q dt dt q(Lε) = r cos(a + Lε), p(Lε) = −r sin(a + Lε) If we chose a trajectory length such that Lε = 2π at the end of the iteration we will return to the starting point!
  • 17. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Periodicity Example q ∼ N (0, 1), H(q, p) = q 2 /2 + p 2 /2 dq dp =p = −q dt dt q(Lε) = r cos(a + Lε), p(Lε) = −r sin(a + Lε) If we chose a trajectory length such that Lε = 2π at the end of the iteration we will return to the starting point! • Lε near the period makes HMC ergodic but practically useless • interactions between variables prevent exact periodicities, but near periodicities might still slow HMC considerably!
  • 18. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Other useful tools - Windowed HMC The leapfrog introduces ”random” errors in H at each step and hence the acceptance probability may have an high variability. Smoothing (over windows of states) out this oscillations could lead to higher acceptance rates.
  • 19. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Other useful tools - Windowed HMC The leapfrog introduces ”random” errors in H at each step and hence the acceptance probability may have an high variability. Smoothing (over windows of states) out this oscillations could lead to higher acceptance rates. Windows map We map (q, p) → [(q0 , p0 ), . . . , (qW −1 , pW −1 )] by writing (q, p) = (qs , ps ), s ∼ U(0, W − 1) and deterministically recover the other states. This windows has probability density W −1 1 P([(q0 , p0 ), . . . , (qW −1 , pW −1 )]) = P(qi , pi ) W i=0
  • 20. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Windowed HMC Similarly we perform L - W + 1 leapfrog steps starting from (qW −1 , pW −1 ), up to (qL , pL ) and then accept the window [(qL−W +1 , −pL−W +1 ), . . . , (qL , −pL )] with probability L i=L−W +1 P(qi , pi ) min 1, W −1 i=0 P(qi , pi ) and finally select (qa , −pa ) with probability P(qa , pa ) i∈accepted window P(qi , pi )
  • 21. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Slice Sampler Slice Sampling idea Sampling from f (x) is equivalent to uniform sampling from SG(f ) = {(x, y )|0 ≤ y ≤ f (x)} the subgraph of f We simply make use of this by sampling from f with an auxiliary variable Gibbs sampling: • y |x ∼ U(0, f (x)) • x|y ∼ USy where Sy = {x|y ≤ f (x)} is the slice
  • 22. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Outline 1 Hamiltonian MCMC 2 NUTS Idea Flow of the simplest NUTS A more efficient implementation 3 ε tuning 4 Numerical Results 5 Conclusion
  • 23. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion What we (may) have and want to avoid Opacity and size grows with leapfrog steps made In black start and end-point
  • 24. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Main idea Define a criterion that helps us avoid these U-Turn, stopping when we simulated ”long enough”: instantaneous distance gain ∂ 1 ∂ C (q, q ) = (q −q)T (q −q) = (q −q)T (q −q) = (q −q)T p ∂t 2 ∂t Simulating until C (q, q ) < 0 lead to a non-reversible MC, so the authors devised a different scheme.
  • 25. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Main Idea • NUTS augments the model with a slice variable u • Add a finite set C of candidates for the update • C ⊆ B deterministically chosen, with B the set of all leapfrog steps
  • 26. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Main Idea • NUTS augments the model with a slice variable u • Add a finite set C of candidates for the update • C ⊆ B deterministically chosen, with B the set of all leapfrog steps (At a high level) B is built by doubling and checking C (q, q ) on sub-trees
  • 27. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion High level NUTS steps 1 resample momentum p ∼ N (0, I ) 2 sample u|q, p ∼ U[0, exp(−H(qt , p))] 3 generate the proposal from p(B, C|qt , p, u, ε) 4 sample (qt+1 , p) ∼ T (·|qt , p, C) T (·|qt , p, C) s.t. leave the uniform distribution over C invariant (C contain (q , p ) s.t. u ≤ exp(−H(q , p )) and sat. reversibility)
  • 28. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Justify T (q , p |qt , p, C) Conditions on p(B, C|q, p, u, ε): C.1: Elements in C chosen in a volume-preserving way C.2: p((q, p) ∈ C|q, p, u, ε) = 1 C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1 C.4: If (q, p) ∈ C and (q , p ) ∈ C then p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε) by them p(q, p|u, B, C, ε) ∝ p(B, C|q, p, u, ε)p(q, p|u) ∝ p(B, C|q, p, u, ε)I{u≤exp(−H(q ,p ))} C .1 ∝ IC C .2&C .4 − C .3
  • 29. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion p(B, C|q, p, u, ε) - Building B by doubling Build B by repeatedly doubling a binary tree with (q,p) leaves. • Chose a random ”direction in time” νj ∼ U({−1, 1}) • Take 2j leap(frog)s of size νj ε from (q − , p − )I{−1} (νj ) + (q + , p + )I{+1} (νj ) • Continue until a stopping rule is met
  • 30. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion p(B, C|q, p, u, ε) - Building B by doubling Build B by repeatedly doubling a binary tree with (q,p) leaves. • Chose a random ”direction in time” νj ∼ U({−1, 1}) • Take 2j leap(frog)s of size νj ε from (q − , p − )I{−1} (νj ) + (q + , p + )I{+1} (νj ) • Continue until a stopping rule is met Given the start (q, p) and ε : 2j equi-probable height-j trees Reconstructing a particular height-j tree from any leaf has prob 2−j
  • 31. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion p(B, C|q, p, u, ε) - Building B by doubling Build B by repeatedly doubling a binary tree with (q,p) leaves. • Chose a random ”direction in time” νj ∼ U({−1, 1}) • Take 2j leap(frog)s of size νj ε from (q − , p − )I{−1} (νj ) + (q + , p + )I{+1} (νj ) • Continue until a stopping rule is met Given the start (q, p) and ε : 2j equi-probable height-j trees Reconstructing a particular height-j tree from any leaf has prob 2−j Possible stopping rule: at height j • for one of the 2j − 1 subtrees (q + , q − )T p − < 0 or (q + , q − )T p + < 0 • the tree includes a leaf s.t. log(u) − H(q, p) > ∆max
  • 32. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Satisfying Detailed-Balance We’ve defined p(B|q, p, u, ε) → deterministically select C Remember the conditions: C.1: Elements in C chosen in a volume-preserving way C.2: p((q, p) ∈ C|q, p, u, ε) = 1 C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1 C.4: If (q, p) ∈ C and (q , p ) ∈ C then p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)
  • 33. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Satisfying Detailed-Balance We’ve defined p(B|q, p, u, ε) → deterministically select C Remember the conditions: C.1: Satisfied because of the leapfrog C.2: p((q, p) ∈ C|q, p, u, ε) = 1 C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1 C.4: If (q, p) ∈ C and (q , p ) ∈ C then p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)
  • 34. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Satisfying Detailed-Balance We’ve defined p(B|q, p, u, ε) → deterministically select C Remember the conditions: C.1: Satisfied because of the leapfrog C.2: Satisfied if C includes the initial state C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1 C.4: If (q, p) ∈ C and (q , p ) ∈ C then p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)
  • 35. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Satisfying Detailed-Balance We’ve defined p(B|q, p, u, ε) → deterministically select C Remember the conditions: C.1: Satisfied because of the leapfrog C.2: Satisfied if C includes the initial state C.3: Satisfied if we exclude points outside the slice u C.4: If (q, p) ∈ C and (q , p ) ∈ C then p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)
  • 36. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Satisfying Detailed-Balance We’ve defined p(B|q, p, u, ε) → deterministically select C Remember the conditions: C.1: Satisfied because of the leapfrog C.2: Satisfied if C includes the initial state C.3: Satisfied if we exclude points outside the slice u C.4: As long as C given B is deterministic p(B, C|q, p, u, ε) = 2−j or p(B, C|q, p, u, ε) = 0
  • 37. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Satisfying Detailed-Balance We’ve defined p(B|q, p, u, ε) → deterministically select C Remember the conditions: C.1: Satisfied because of the leapfrog C.2: Satisfied if C includes the initial state C.3: Satisfied if we exclude points outside the slice u C.4: Satisfied if we exclude states that couldn’t generate B
  • 38. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Satisfying Detailed-Balance We’ve defined p(B|q, p, u, ε) → deterministically select C Remember the conditions: C.1: Satisfied because of the leapfrog C.2: Satisfied if C includes the initial state C.3: Satisfied if we exclude points outside the slice u C.4: Satisfied if we exclude states that couldn’t generate B Finally we select (q’,p’) at random from C
  • 39. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion
  • 40. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion
  • 41. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Efficient NUTS The computational cost per-leapfrog step is comparable with HMC (just 2j+1 − 2 more inner products) However: • it requires to store 2j position-momentum states • long jumps are not guaranteed • waste time if a stopping criterion is met during doubling
  • 42. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Efficient NUTS The authors address these issues as follow: • waste time if a stopping criterion is met during doubling as soon as a stopping rule is met break out of the loop
  • 43. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Efficient NUTS The authors address these issues as follow: • long jumps are not guaranteed Consider the kernel: I[w ∈C new ] |C new | if |C new | > |C old |, T (w |w , C) = I[w ∈C new ] |C new | |C new | , |C new | |C old | + (1 − |C old | )I[w = w] if |C new | ≤ |C old | where w is short for (q,p) and let C new and C old be respectively the set C introduced during the final iteration and the older elements already in C.
  • 44. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Efficient NUTS The authors address these issues as follow: • it requires to store 2j position-momentum states • long jumps are not guaranteed Consider the kernel: I[w ∈C new ] |C new | if |C new | > |C old |, T (w |w , C) = I[w ∈C new ] |C new | new |C | , |C new | |C old | + (1 − |C old | )I[w = w] if |C new | ≤ |C old | where w is short for (q,p) and let C new and C old be respectively the set C introduced during the final iteration and the older elements already in C. Iteratively applying the above after every doubling moreover require that we store only O(j) position (and sizes) rather than O(2j )
  • 45. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Outline 1 Hamiltonian MCMC 2 NUTS 3 ε tuning Adaptive MCMC Random ε 4 Numerical Results 5 Conclusion
  • 46. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Adaptive MCMC Classic adaptive MCMC idea: stochastic optimization on the parameter with vanishing adaptation: θt+1 ← θt − ηt Ht where R ηt → 0 with the ”correct” pace and Ht = δ − αt defines a characteristic of interest of the chain. Example Consider the case of a random-walk MH with x |xt ∼ N (0, θt ) adapted on the acceptance rate (αt ) in order to get to the desired optimum δ = 0.234
  • 47. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Double Averaging This idea has however (in particoular) an intrinsic flaw: • the diminishing step sizes ηt give more weight to the early iterations
  • 48. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Double Averaging This idea has however (in particoular) an intrinsic flaw: • the diminishing step sizes ηt give more weight to the early iterations The authors rely then on the following scheme3 : √ t 1 t εt+1 =µ− Hi εt+1 = ηt εt+1 + (1 − ηt )˜t ˜ ε γ t + t0 i=1 adaptation sampling γ shrinkage → µ ηt = t −k t0 early iteration k ∈ (0.5, 1] 3 introduced by Nesterov (2009) for stochastic convex optimization
  • 49. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Select the criterion The authors advice to use Ht = δ − αt with δ = 0.65 and HMC P(q , −p ) αt = min 1, P(qt , −pt ) NUTS 1 P(qt , −pt ) αt = final min 1, |Bt | final P(qt−1 , −pt ) (qt ,pt )∈Bt
  • 50. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Alternative: random ε Periodicity is still a problem and the ”optimal” ε may differ in different region of the target! (e.g. mixtures) The solution might be to randomize ε around some ε0 Example ε ∼ E(ε0 ) ε ∼ U(c1 ε0 , c2 ε0 ) c1 < 1, c2 > 1
  • 51. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Alternative: random ε Periodicity is still a problem and the ”optimal” ε may differ in different region of the target! (e.g. mixtures) The solution might be to randomize ε around some ε0 Helpfull especially for HMC, care is needed for NUTS as L is usually inversely proportional to ε What said about adaptation can be combined using ε0 = εt
  • 52. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Outline 1 Hamiltonian MCMC 2 NUTS 3 ε tuning 4 Numerical Results Paper’s examples Multivariate Normal 5 Conclusion
  • 53. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Numerical Examples ”It at least as good as an ’optimally’ tuned HMC.” Tested on: 1 (MVN) : 250-d Normal whose precision matrix was sampled from Wishart with identity scale matrix and 250 df 2 (LR) : (24+1)-dim regression coefficient distribution on the German credit data with weak priors 3 (HLR) : same dataset - exponentially distributed variance hyper parameters and included two-way interactions → (300+1)*2-d predictors 4 (SV) : 3000 days of returns from S&P 500 index → 3001-d target following this scheme: τ ∼ E(100); ν ∼ E(100); s1 ∼ E(100); log yi − log yi−1 log si ∼ N (log si−1 , τ − 1); ∼ tν . si
  • 54. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion ε and L in NUTS Also the dual averaging usually does a good job of coercing the statistic H to its desired value. Most of the trajectory lengths are integer powers of two, indicating that the U-turn criterion is usually satisfied only after a doubling is completed, which is desirable since it means that we only occasionally have to throw out entire half-trajectories to satisfy DB.
  • 55. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion
  • 56. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Multivariate Normal Highly correlated, 30-dim Multivariate Normal To get the full advantage of HMC, the trajectory has to be long enough (but not much longer) that in the least constrained direction the end-point is distant from the start point.
  • 57. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Multivariate Normal Highly correlated, 30-dim Multivariate Normal
  • 58. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Multivariate Normal Highly correlated, 30-dim Multivariate Normal Trajectory reverse direction before the least constrained direction has been explored. This can produce a U-Turn when the trajectory is much shorter than is optimal.
  • 59. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Multivariate Normal Highly correlated, 30-dim Multivariate Normal Fortunately, too-short trajectories are cheap relative to long-enough trajectories! but it does not help in avoiding RW
  • 60. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Multivariate Normal One way to overcome the problem could be to check for U-Turn only in the least constrained But usually such an information is not known a-priori. Maybe after a trial run to discover the magnitude order of the variables? someone said periodicity?
  • 61. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Outline 1 Hamiltonian MCMC 2 NUTS 3 ε tuning 4 Numerical Results 5 Conclusion Improvements Discussion
  • 62. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Non-Uniform weighting system Qin and Liu (2001) propose a similar (to Neal’s) weighting procedure: • Generate the whole trajectory from (q0 , p0 )= (q, p) to (qL , pL ) • Select a point in the ”accept window” [(qL−W +1 , pL−W +1 ), . . . , (qL , pL )], say (qL−k , pL−k ) • equivalent to select k ∼ U(0, W )
  • 63. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Non-Uniform weighting system Qin and Liu (2001) propose a similar (to Neal’s) weighting procedure: • Generate the whole trajectory from (q0 , p0 ) to (qL , pL ) • Select a point (qL−k , pL−k ) • Take k step backward from (q0 , p0 ) to (q−k , p−k ) • Reject window : [(q−k , p−k ), . . . , (qW −k−1 , pW −k−1 )]
  • 64. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Non-Uniform weighting system Qin and Liu (2001) propose a similar (to Neal’s) weighting procedure: • Generate the whole trajectory from (q0 , p0 ) to (qL , pL ) • Select a point (qL−k , pL−k ) • Take k step backward from (q0 , p0 ) to (q−k , p−k ) • Accept (qL−k , pL−k ) with probability   W  wi P(qL−W +i , −pL−W +i )  i=1 min 1,   W    wi P(qi−k−1 , −pi−k−1 ) i=1
  • 65. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Non-Uniform weighting system Qin and Liu (2001) propose a similar (to Neal’s) weighting procedure: • Accept (qL−k , pL−k ) with probability   W  wi P(qL−W +i , −pL−W +i )  i=1 min 1,   W    wi P(qi−k−1 , −pi−k−1 ) i=1 • With uniform weights wi = 1/W ∀i • Other weights may favor states further from the start!
  • 66. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Riemann Manifold HMC • In statistical modeling the parameter space is a manifold • d(p(y |θ), p(y |θ + δθ)) can be defined as δθ T G (θ)δθ G (θ) is the expected Fisher information matrix [Rao (1945)] • G (θ) defines a position-specific Riemann metric. Girolami & Calderhead (2011) used this to tune K(p) using local 2nd -order derivative information encoded in M = G (q)
  • 67. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Riemann Manifold HMC Girolami & Calderhead (2011) used this to tune K(p) using local 2nd -order derivative information encoded in M = G (q) The deterministic proposal is guided not only by the gradient of the target density but also exploits local geometric structure. Possible optimality : paths that are produced by the solution of Hamiltoninan equations follow the geodesics on the manifold.
  • 68. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Riemann Manifold HMC Girolami & Calderhead (2011) used this to tune K(p) using local 2nd -order derivative information encoded in M = G (q) Practically: p|q ∼ N (0, G (q)) resolving the scaling issues in HMC • tuning of ε less critical • non separable Hamiltonian 1 1 H(q, p) = U(q) + log (2π)D |G (q)| + p T G (q)−1 p 2 2 • need for an (expensive) implicit integrator (generalized leapfrog + fixed point iterations)
  • 69. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion RMHMC & Related ”In some cases, the computational overhead for solving implicit equations undermines RMHMCs benefits” . [Lan et al. (2012)] Lagrangian Dynamics (RMLMC) Replaces momentum p (mass×velocity) in HMC by velocity v volume correction through the Jacobian • semi-implicit integrator for ”Hamiltonian” dynamics • different fully explicit integrator for Lagrangian dynamics two extra matrix inversions to update v Adaptively Updated (AUHMC) Replace G (q) with M(q, q ) = 1 [G (q) + G (q )] 2 • fully explicit leapfrog integrator (M(q, q ) constant through trajectory) • less local-adaptive (especially for long trajectories) • is it really reversible?
  • 70. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Split Hamiltonian Variations on HMC obtained by using discretizations of Hamiltonian dynamics, splitting H into: H(q, p) = H1 (q, p) + H2 (q, p) + · · · + HK (q, p) This may allows much of the movement to be done at low cost. • log of U(q) as the log of a Gaussian plus a second term quadratic forms allow for explicit solution • H1 and its gradient can be eval. quickly, with only a slowly- varying H2 requiring costly computations. e.g. splitting data
  • 71. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Split Hamiltonian • log of U(q) as the log of a Gaussian plus a second term quadratic forms allow for explicit solution • H1 and its gradient can be eval. quickly, with only a slowly- varying H2 requiring costly computations. e.g. splitting data
  • 72. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion split-wRMLNUTS & co. • NUTS itself provide a nice automatic tune of HMC • very few citation to their work! • integration with RMHMC (just out!) may overcome ε-related problems and provide better criteria for the stopping rule! HMC may not be the definitive sampler, but is definitely equally useful and unknown. It seems the direction taken may finally overcome the difficulties connected with its use and spread.
  • 73. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion Betancourt (2013b) • RMHMC has smoother movements on the surface but.. • .. (q − q)T p has no more a meaning The paper address this issue by generalizing the stopping rule to other M mass matrices.
  • 74. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion References I Introduction: • M.D. Hoffmann & A. Gelman (2011) - The No-U-Turn Sampler • R.M. Neal (2011) - MCMC using Hamiltonian dynamics in Handbook of Markov Chain Monte Carlo • Z. S. Qin, and J.S. Liu (2001) - Multipoint Metropolis method with application to hybrid Monte Carlo • R.M. Neal (2003) - Slice Sampling
  • 75. Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion References II Further Readings: • M. Girolami, B. Calderhead (2011) - Riemann manifold Langevin and Hamiltonian Monte Carlo methods • Z. Wang, S. Mohamed, N. de Freitas (2013) - Adaptive Hamiltonian and Riemann Manifold Monte Carlo Samplers • S. Lan, V. Stathopoulos, B. Shahbaba, M. Girolami (2012) - Lagrangian Dynamical Monte Carlo • M. Burda, JM. Maheu (2013) - Bayesian Adaptively Updated Hamiltonian Monte Carlo with an Application to High-Dimensional BEKK GARCH Models • Michael Betancourt (2013) - Generalizing the No-U-Turn Sampler to Riemannian Manifolds