Upcoming SlideShare
×

Like this presentation? Why not share!

# HMC and NUTS

## on Apr 25, 2013

• 808 views

Presentation of the NUTS Algorithm by M. Hoffmann and A. Gelman

Presentation of the NUTS Algorithm by M. Hoffmann and A. Gelman

(disclamer: informal work, the huge amount of interesting work by R.Neal is not entirely referenced)

### Views

Total Views
808
Views on SlideShare
774
Embed Views
34

Likes
8
7
0

## HMC and NUTSPresentation Transcript

• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionNo - U - Turn Samplerbased on M.D.Hoﬀmann and A.Gelman (2011)Marco BanterlePresented at the ”Bayes in Paris” reading group11 / 04 / 2013
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionOutline1 Hamiltonian MCMC2 NUTS3 ε tuning4 Numerical Results5 Conclusion
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionOutline1 Hamiltonian MCMCHamiltonian Monte CarloOther useful tools2 NUTS3 ε tuning4 Numerical Results5 Conclusion View slide
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionHamiltonian dynamicHamiltonian MC techniques have a nice foundation in physicsDescribe the total energy of a system composed by a frictionlessparticle sliding on a (hyper-)surfacePosition q ∈ Rd and momentum p ∈ Rd are necessary to deﬁne thetotal energy H(q, p), which is usually formalized through the sumof a potential energy term U(q) and a kinetic energy term K(p).Hamiltonian dynamics are characterized by∂q∂t=∂H∂p,∂p∂t= −∂H∂qthat describe how the system changes though time. View slide
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionProperties of the Hamiltonian• Reversibility : the mapping from (q, p)t to (q, p)t+s is 1-to-1• Conservation of the Hamiltonian : ∂H∂t = 0total energy is conserved• Volume preservation : applying the map resulting fromtime-shifting to a region R of the (q, p)-space do not changethe volume of the (projected) region• Symplecticness : stronger condition than volumepreservation.
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionWhat about MCMC?H(q, p) = U(q) + K(p)We can easily interpret U(q) as minus the log target density forthe variable of interest q, while p will be introduced artiﬁcially.
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionWhat about MCMC?H(q, p) = U(q) + K(p) → − log (π(q|y)) − log(f (p))Let’s examine its properties under a statistical lens:• Reversibility : MCMC updates that use these dynamics leavethe desired distribution invariant• Conservation of the Hamiltonian : Metropolis updatesusing Hamiltonians are always accepted• Volume preservation : we don’t need to account for anychange in volume in the acceptance probability for Metropolisupdates (no need to compute the determinant of the Jacobianmatrix for the mapping)
• Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusionthe LeapfrogHamilton’s equations are not always1 explicitly available, hence theneed for time discretization with some small step-size ε.A numerical integrator which serves our scopes is called LeapfrogLeapfrog IntegratorFor j = 1, . . . , L1 pt+ε/2 = pt − (ε/2)∂U∂q (qt)2 qt+ε = qt + ε∂K∂p (pt+ε/2)3 pt+ε = pt+ε/2 − (ε/2)∂U∂q (qt+ε)t = t + ε1unless they’re quadratic form
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionA few more wordsThe kinetic energy is usually (for simplicity) assumed to beK(p) = pT M−1p which correspond to p|q ≡ p ∼ N(0, M).This ﬁnally implies that in the leapfrog ∂K∂p (pt+ε/2) = M−1pt+ε/2Usually M, for which few guidance exists, is taken to be diagonaland often equal to the identity matrix.The leapfrog method preserves volume exactly and due to itssymmetry it is also reversible by simply negating p, applying thesame number L of steps again, and then negating p again2.It does not however conserve the total energy and thus thisdeterministic move to (q , p ) will be accepted with probabilitymin 1, exp(−H(q , p ) + H(q, p))2negating ε serves the same purpose
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionHMC algorithmWe now have all the elements to construct an MCMC methodbased on the Hamiltoninan dynamics:HMCGiven q0, ε, L, MFor i = 1, . . . , N1 p ∼ N(0, M)2 Set qi ← qi−1, q ← qi−1, p ← p3 for j = 1, . . . , Lupdate(q , p )through leapfrog4 Set qi ← q with probabilitymin 1, exp(−H(q , p ) + H(qi−1, pi−1))
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionHMC beneﬁtsHMC make use of gradient information to move across the spaceand so its typical trajectories do not resemble random-walks.Moreover the error in the Hamiltonian stays bounded3 and hencewe also have high acceptance rate.3even if theoretical justiﬁcation for that are missing
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionHMC beneﬁtsRandom-walk:• require changes proposed with magnitude comparable to thesd in the most constrained direction (square root of thesmallest eigenvalue of the covariance matrix)• to reach a almost-independent state need a number ofiterations mostly determined by how long it takes to explorethe less constrained direction• proposals have no tendency to move consistently in the samedirection.For HMC the proposal move accordingly to the gradient for L step,even if ε is still constrained by the smallest eigenvalue.
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionConnected diﬃcultiesPerformance depends strongly on choosing suitable values for εand L.• ε too large → inaccurate simulation & high reject rate• ε too small → wasted computational power (small steps)• L too small → random walk behavior and slow mixing• L too large → trajectories retrace their steps → U-TURN
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionConnected diﬃcultiesPerformance depends strongly on choosing suitable values for εand L.• ε too large → inaccurate simulation & high reject rate• ε too small → wasted computational power (small steps)• L too small → random walk behavior and slow mixing• L too large → trajectories retrace their steps → U-TURNeven worse: PERIODICITY!!
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionPeriodicityErgodicity of HMC may fail if a produced trajectory of length (Lε)is an exact periodicity for some state.ExampleConsider q ∼ N(0, 1), then H(q, p) = q2/2 + p2/2 and theresulting Hamiltonian isdqdt= pdpdt= −qand hence has an exact solutionq(Lε) = r cos(a + Lε), p(Lε) = −r sin(a + Lε)for some real a and r.
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionPeriodicityExampleq ∼ N(0, 1), H(q, p) = q2/2 + p2/2dqdt= pdpdt= −qq(Lε) = r cos(a + Lε), p(Lε) = −r sin(a + Lε)If we chose a trajectory length such that Lε = 2π at the end of theiteration we will return to the starting point!
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionPeriodicityExampleq ∼ N(0, 1), H(q, p) = q2/2 + p2/2dqdt= pdpdt= −qq(Lε) = r cos(a + Lε), p(Lε) = −r sin(a + Lε)If we chose a trajectory length such that Lε = 2π at the end of theiteration we will return to the starting point!• Lε near the period makes HMC ergodic but practically useless• interactions between variables prevent exact periodicities, butnear periodicities might still slow HMC considerably!
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionOther useful tools - Windowed HMCThe leapfrog introduces ”random” errors in H at each step andhence the acceptance probability may have an high variability.Smoothing (over windows of states) out this oscillations could leadto higher acceptance rates.
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionOther useful tools - Windowed HMCThe leapfrog introduces ”random” errors in H at each step andhence the acceptance probability may have an high variability.Smoothing (over windows of states) out this oscillations could leadto higher acceptance rates.Windows mapWe map (q, p) → [(q0, p0), . . . , (qW −1, pW −1)] by writing(q, p) = (qs, ps), s ∼ U(0, W − 1)and deterministically recover the other states. This windows hasprobability densityP([(q0, p0), . . . , (qW −1, pW −1)]) =1WW −1i=0P(qi , pi )
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionWindowed HMCSimilarly we perform L - W + 1 leapfrog steps starting from(qW −1, pW −1), up to (qL, pL) and then accept the window[(qL−W +1, −pL−W +1), . . . , (qL, −pL)] with probabilitymin 1,Li=L−W +1 P(qi , pi )W −1i=0 P(qi , pi )and ﬁnally select (qa, −pa) with probabilityP(qa, pa)i∈accepted window P(qi , pi )
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionSlice SamplerSlice Sampling ideaSampling from f (x) is equivalent to uniform sampling fromSG(f ) = {(x, y)|0 ≤ y ≤ f (x)}the subgraph of fWe simply make use of this by sampling from f with an auxiliaryvariable Gibbs sampling:• y|x ∼ U(0, f (x))• x|y ∼ USy where Sy = {x|y ≤ f (x)} is the slice
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionOutline1 Hamiltonian MCMC2 NUTSIdeaFlow of the simplest NUTSA more eﬃcient implementation3 ε tuning4 Numerical Results5 Conclusion
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionWhat we (may) have and want to avoidOpacity and size grows with leapfrog steps madeIn black start and end-point
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionMain ideaDeﬁne a criterion that helps us avoid these U-Turn, stopping whenwe simulated ”long enough”: instantaneous distance gainC(q, q ) =∂∂t12(q −q)T(q −q) = (q −q)T ∂∂t(q −q) = (q −q)TpSimulating until C(q, q ) < 0 lead to a non-reversible MC, so theauthors devised a diﬀerent scheme.
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionMain Idea• NUTS augments the model with a slice variable u• Add a ﬁnite set C of candidates for the update• C ⊆ B deterministically chosen,with B the set of all leapfrog steps
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionMain Idea• NUTS augments the model with a slice variable u• Add a ﬁnite set C of candidates for the update• C ⊆ B deterministically chosen,with B the set of all leapfrog steps(At a high level) B is built by doubling and checking C(q, q ) onsub-trees
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionHigh level NUTS steps1 resample momentum p ∼ N(0, I)2 sample u|q, p ∼ U[0, exp(−H(qt, p))]3 generate the proposal from p(B, C|qt, p, u, ε)4 sample (qt+1, p) ∼ T(·|qt, p, C)T(·|qt, p, C) s.t. leave the uniform distribution over C invariant(C contain (q , p ) s.t. u ≤ exp(−H(q , p )) and sat. reversibility)
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionJustify T(q , p |qt, p, C)Conditions on p(B, C|q, p, u, ε):C.1: Elements in C chosen in a volume-preserving wayC.2: p((q, p) ∈ C|q, p, u, ε) = 1C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1C.4: If (q, p) ∈ C and (q , p ) ∈ C thenp(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)by themp(q, p|u, B, C, ε) ∝ p(B, C|q, p, u, ε)p(q, p|u)∝ p(B, C|q, p, u, ε)I{u≤exp(−H(q ,p ))} C.1∝ IC C.2&C.4 − C.3
• Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusionp(B, C|q, p, u, ε) - Building B by doublingBuild B by repeatedly doubling a binary tree with (q,p) leaves.• Chose a random ”direction in time” νj ∼ U({−1, 1})• Take 2j leap(frog)s of size νj ε from(q−, p−)I{−1}(νj ) + (q+, p+)I{+1}(νj )• Continue until a stopping rule is met
• Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusionp(B, C|q, p, u, ε) - Building B by doublingBuild B by repeatedly doubling a binary tree with (q,p) leaves.• Chose a random ”direction in time” νj ∼ U({−1, 1})• Take 2j leap(frog)s of size νj ε from(q−, p−)I{−1}(νj ) + (q+, p+)I{+1}(νj )• Continue until a stopping rule is metGiven the start (q, p) and ε : 2j equi-probable height-j treesReconstructing a particular height-j tree from any leaf has prob 2−j
• Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusionp(B, C|q, p, u, ε) - Building B by doublingBuild B by repeatedly doubling a binary tree with (q,p) leaves.• Chose a random ”direction in time” νj ∼ U({−1, 1})• Take 2j leap(frog)s of size νj ε from(q−, p−)I{−1}(νj ) + (q+, p+)I{+1}(νj )• Continue until a stopping rule is metGiven the start (q, p) and ε : 2j equi-probable height-j treesReconstructing a particular height-j tree from any leaf has prob 2−jPossible stopping rule: at height j• for one of the 2j − 1 subtrees(q+, q−)Tp−< 0 or (q+, q−)Tp+< 0• the tree includes a leaf s.t. log(u) − H(q, p) > ∆max
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionSatisfying Detailed-BalanceWe’ve deﬁned p(B|q, p, u, ε) → deterministically select CRemember the conditions:C.1: Elements in C chosen in a volume-preserving wayC.2: p((q, p) ∈ C|q, p, u, ε) = 1C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1C.4: If (q, p) ∈ C and (q , p ) ∈ C thenp(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionSatisfying Detailed-BalanceWe’ve deﬁned p(B|q, p, u, ε) → deterministically select CRemember the conditions:C.1: Satisﬁed because of the leapfrogC.2: p((q, p) ∈ C|q, p, u, ε) = 1C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1C.4: If (q, p) ∈ C and (q , p ) ∈ C thenp(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionSatisfying Detailed-BalanceWe’ve deﬁned p(B|q, p, u, ε) → deterministically select CRemember the conditions:C.1: Satisﬁed because of the leapfrogC.2: Satisﬁed if C includes the initial stateC.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1C.4: If (q, p) ∈ C and (q , p ) ∈ C thenp(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionSatisfying Detailed-BalanceWe’ve deﬁned p(B|q, p, u, ε) → deterministically select CRemember the conditions:C.1: Satisﬁed because of the leapfrogC.2: Satisﬁed if C includes the initial stateC.3: Satisﬁed if we exclude points outside the slice uC.4: If (q, p) ∈ C and (q , p ) ∈ C thenp(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionSatisfying Detailed-BalanceWe’ve deﬁned p(B|q, p, u, ε) → deterministically select CRemember the conditions:C.1: Satisﬁed because of the leapfrogC.2: Satisﬁed if C includes the initial stateC.3: Satisﬁed if we exclude points outside the slice uC.4: As long as C given B is deterministicp(B, C|q, p, u, ε) = 2−jor p(B, C|q, p, u, ε) = 0
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionSatisfying Detailed-BalanceWe’ve deﬁned p(B|q, p, u, ε) → deterministically select CRemember the conditions:C.1: Satisﬁed because of the leapfrogC.2: Satisﬁed if C includes the initial stateC.3: Satisﬁed if we exclude points outside the slice uC.4: Satisﬁed if we exclude states that couldn’t generate B
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionSatisfying Detailed-BalanceWe’ve deﬁned p(B|q, p, u, ε) → deterministically select CRemember the conditions:C.1: Satisﬁed because of the leapfrogC.2: Satisﬁed if C includes the initial stateC.3: Satisﬁed if we exclude points outside the slice uC.4: Satisﬁed if we exclude states that couldn’t generate BFinally we select (q’,p’) at random from C
• Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion
• Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionEﬃcient NUTSThe computational cost per-leapfrog step is comparable with HMC(just 2j+1 − 2 more inner products)However:• it requires to store 2j position-momentum states• long jumps are not guaranteed• waste time if a stopping criterion is met during doubling
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionEﬃcient NUTSThe authors address these issues as follow:• waste time if a stopping criterion is met during doublingas soon as a stopping rule is met break out of the loop
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionEﬃcient NUTSThe authors address these issues as follow:• long jumps are not guaranteedConsider the kernel:T(w |w, C) =I[w ∈Cnew]|Cnew | if |Cnew| > |Cold|,I[w ∈Cnew]|Cnew ||Cnew||Cold |+ (1 − |Cnew||Cold |)I[w = w] if |Cnew| ≤ |Cold|,where w is short for (q,p) and let Cnewand Coldbe respectively the set Cintroduced during the ﬁnal iteration and the older elements already in C.
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionEﬃcient NUTSThe authors address these issues as follow:• it requires to store 2j position-momentum states• long jumps are not guaranteedConsider the kernel:T(w |w, C) =I[w ∈Cnew]|Cnew | if |Cnew| > |Cold|,I[w ∈Cnew]|Cnew ||Cnew||Cold |+ (1 − |Cnew||Cold |)I[w = w] if |Cnew| ≤ |Cold|,where w is short for (q,p) and let Cnewand Coldbe respectively the set Cintroduced during the ﬁnal iteration and the older elements already in C.Iteratively applying the above after every doubling moreover require thatwe store only O(j) position (and sizes) rather than O(2j)
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionOutline1 Hamiltonian MCMC2 NUTS3 ε tuningAdaptive MCMCRandom ε4 Numerical Results5 Conclusion
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionAdaptive MCMCClassic adaptive MCMC idea:stochastic optimization on the parameter with vanishingadaptation:θt+1 ← θt − ηtHtwhere R ηt → 0 with the ”correct” pace andHt = δ − αt deﬁnes a characteristic of interest of the chain.ExampleConsider the case of a random-walk MH with x |xt ∼ N(0, θt)adapted on the acceptance rate (αt) in order to get to the desiredoptimum δ = 0.234
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionDouble AveragingThis idea has however (in particoular) an intrinsic ﬂaw:• the diminishing step sizes ηt give more weight to the earlyiterations
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionDouble AveragingThis idea has however (in particoular) an intrinsic ﬂaw:• the diminishing step sizes ηt give more weight to the earlyiterationsThe authors rely then on the following scheme3:εt+1 = µ −1γ√tt + t0ti=1Hi ˜εt+1 = ηtεt+1 + (1 − ηt)˜εtadaptationγ shrinkage → µt0 early iterationsamplingηt = t−kk ∈ (0.5, 1]3introduced by Nesterov (2009) for stochastic convex optimization
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionSelect the criterionThe authors advice to use Ht = δ − αt with δ = 0.65 andαHMCt = min 1,P(q , −p )P(qt, −pt)αNUTSt =1|Bﬁnalt |(qt ,pt )∈Bﬁnaltmin 1,P(qt, −pt)P(qt−1, −pt)
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionAlternative: random εPeriodicity is still a problem and the ”optimal” ε may diﬀer indiﬀerent region of the target! (e.g. mixtures)The solution might be to randomize ε around some ε0Exampleε ∼ E(ε0) ε ∼ U(c1ε0, c2ε0)c1 < 1, c2 > 1
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionAlternative: random εPeriodicity is still a problem and the ”optimal” ε may diﬀer indiﬀerent region of the target! (e.g. mixtures)The solution might be to randomize ε around some ε0Helpfull especially for HMC, care is needed for NUTS as L isusually inversely proportional to εWhat said about adaptation can be combined using ε0 = εt
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionOutline1 Hamiltonian MCMC2 NUTS3 ε tuning4 Numerical ResultsPaper’s examplesMultivariate Normal5 Conclusion
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionNumerical Examples”It at least as good as an ’optimally’ tuned HMC.”Tested on:1 (MVN) : 250-d Normal whose precision matrix was sampledfrom Wishart with identity scale matrix and 250 df2 (LR) : (24+1)-dim regression coeﬃcient distribution on theGerman credit data with weak priors3 (HLR) : same dataset - exponentially distributed variancehyper parameters and included two-way interactions →(300+1)*2-d predictors4 (SV) : 3000 days of returns from S&P 500 index → 3001-dtarget following this scheme:τ ∼ E(100); ν ∼ E(100); s1 ∼ E(100);log si ∼ N(log si−1, τ−1);log yi − log yi−1si∼ tν.
• Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusionε and L in NUTSAlso the dual averaging usually does a good job of coercing thestatistic H to its desired value.Most of the trajectory lengths are integer powers of two, indicatingthat the U-turn criterion is usually satisﬁed only after a doubling iscompleted, which is desirable since it means that we onlyoccasionally have to throw out entire half-trajectories to satisfy DB.
• Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionMultivariate NormalHighly correlated, 30-dim Multivariate NormalTo get the full advantage of HMC, the trajectory has to be longenough (but not much longer) that in the least constraineddirection the end-point is distant from the start point.
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionMultivariate NormalHighly correlated, 30-dim Multivariate Normal
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionMultivariate NormalHighly correlated, 30-dim Multivariate NormalTrajectory reverse direction before the least constrained directionhas been explored.This can produce a U-Turn when the trajectory is much shorterthan is optimal.
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionMultivariate NormalHighly correlated, 30-dim Multivariate NormalFortunately, too-short trajectories are cheap relative tolong-enough trajectories!but it does not help in avoiding RW
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionMultivariate NormalOne way to overcome the problem could be to check for U-Turnonly in the least constrainedBut usually such an information is not known a-priori.Maybe after a trial run to discover the magnitude order of thevariables?someone said periodicity?
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionOutline1 Hamiltonian MCMC2 NUTS3 ε tuning4 Numerical Results5 ConclusionImprovementsDiscussion
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionNon-Uniform weighting systemQin and Liu (2001) propose a similar (to Neal’s) weightingprocedure:• Generate the whole trajectory from (q0, p0)= (q, p) to (qL, pL)• Select a point in the ”accept window”[(qL−W +1, pL−W +1), . . . , (qL, pL)], say (qL−k, pL−k)• equivalent to select k ∼ U(0, W )
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionNon-Uniform weighting systemQin and Liu (2001) propose a similar (to Neal’s) weightingprocedure:• Generate the whole trajectory from (q0, p0) to (qL, pL)• Select a point (qL−k, pL−k)• Take k step backward from (q0, p0) to (q−k, p−k)• Reject window : [(q−k, p−k), . . . , (qW −k−1, pW −k−1)]
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionNon-Uniform weighting systemQin and Liu (2001) propose a similar (to Neal’s) weightingprocedure:• Generate the whole trajectory from (q0, p0) to (qL, pL)• Select a point (qL−k, pL−k)• Take k step backward from (q0, p0) to (q−k, p−k)• Accept (qL−k, pL−k) with probabilitymin1,Wi=1wi P(qL−W +i , −pL−W +i )Wi=1wi P(qi−k−1, −pi−k−1)
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionNon-Uniform weighting systemQin and Liu (2001) propose a similar (to Neal’s) weightingprocedure:• Accept (qL−k, pL−k) with probabilitymin1,Wi=1wi P(qL−W +i , −pL−W +i )Wi=1wi P(qi−k−1, −pi−k−1)• With uniform weights wi = 1/W ∀i• Other weights may favor states further from the start!
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionRiemann Manifold HMC• In statistical modeling the parameter space is a manifold• d(p(y|θ), p(y|θ + δθ)) can be deﬁned as δθT G(θ)δθG(θ) is the expected Fisher information matrix [Rao (1945)]• G(θ) deﬁnes a position-speciﬁc Riemann metric.Girolami & Calderhead (2011) used this to tune K(p) using local2nd -order derivative information encoded in M = G(q)
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionRiemann Manifold HMCGirolami & Calderhead (2011) used this to tune K(p) using local2nd -order derivative information encoded in M = G(q)The deterministic proposal is guided not only by the gradient ofthe target density but also exploits local geometric structure.Possible optimality : paths that are produced by the solution ofHamiltoninan equations follow the geodesics on the manifold.
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionRiemann Manifold HMCGirolami & Calderhead (2011) used this to tune K(p) using local2nd -order derivative information encoded in M = G(q)Practically:p|q ∼ N(0, G(q)) resolving the scaling issues in HMC• tuning of ε less critical• non separable HamiltonianH(q, p) = U(q) +12log (2π)D|G(q)| +12pTG(q)−1p• need for an (expensive) implicit integrator(generalized leapfrog + ﬁxed point iterations)
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionRMHMC & Related”In some cases, the computational overhead for solving implicitequations undermines RMHMCs beneﬁts” . [Lan et al. (2012)]Lagrangian Dynamics (RMLMC)Replaces momentum p (mass×velocity) in HMC by velocity vvolume correction through the Jacobian• semi-implicit integrator for ”Hamiltonian” dynamics• diﬀerent fully explicit integrator for Lagrangian dynamicstwo extra matrix inversions to update vAdaptively Updated (AUHMC)Replace G(q) with M(q, q ) = 12[G(q) + G(q )]• fully explicit leapfrog integrator (M(q, q ) constant throughtrajectory)• less local-adaptive (especially for long trajectories)• is it really reversible?
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionSplit HamiltonianVariations on HMC obtained by using discretizations ofHamiltonian dynamics, splitting H into:H(q, p) = H1(q, p) + H2(q, p) + · · · + HK (q, p)This may allows much of the movement to be done at low cost.• log of U(q) as the log of a Gaussian plus a second termquadratic forms allow for explicit solution• H1 and its gradient can be eval. quickly, with only a slowly-varying H2 requiring costly computations. e.g. splitting data
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionSplit Hamiltonian• log of U(q) as the log of a Gaussian plus a second termquadratic forms allow for explicit solution• H1 and its gradient can be eval. quickly, with only a slowly-varying H2 requiring costly computations. e.g. splitting data
• Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusionsplit-wRMLNUTS & co.• NUTS itself provide a nice automatic tune of HMC• very few citation to their work!• integration with RMHMC (just out!) may overcome ε-relatedproblems and provide better criteria for the stopping rule!HMC may not be the deﬁnitive sampler, but is deﬁnitely equallyuseful and unknown.It seems the direction taken may ﬁnally overcome the diﬃcultiesconnected with its use and spread.
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionBetancourt (2013b)• RMHMC has smoother movements on the surface but..• .. (q − q)T p has no more a meaningThe paper address this issue by generalizing the stopping rule toother M mass matrices.
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionReferences IIntroduction:• M.D. Hoﬀmann & A. Gelman (2011) - The No-U-TurnSampler• R.M. Neal (2011) - MCMC using Hamiltonian dynamics inHandbook of Markov Chain Monte Carlo• Z. S. Qin, and J.S. Liu (2001) - Multipoint Metropolismethod with application to hybrid Monte Carlo• R.M. Neal (2003) - Slice Sampling
• Hamiltonian MCMC NUTS ε tuning Numerical Results ConclusionReferences IIFurther Readings:• M. Girolami, B. Calderhead (2011) - Riemann manifoldLangevin and Hamiltonian Monte Carlo methods• Z. Wang, S. Mohamed, N. de Freitas (2013) - AdaptiveHamiltonian and Riemann Manifold Monte Carlo Samplers• S. Lan, V. Stathopoulos, B. Shahbaba, M. Girolami (2012) -Lagrangian Dynamical Monte Carlo• M. Burda, JM. Maheu (2013) - Bayesian Adaptively UpdatedHamiltonian Monte Carlo with an Application toHigh-Dimensional BEKK GARCH Models• Michael Betancourt (2013) - Generalizing the No-U-TurnSampler to Riemannian Manifolds