# MLPI Lecture 3: Advanced Sampling Techniques

This lecture covers several advanced MCMC sampling techniques, including collapsed Gibbs sampling, Slice Sampler, Parallel tempering, and Hamiltonian Monte Carlo, as well as several softwares for generic sampling.

### MLPI Lecture 3: Advanced Sampling Techniques

1. 1. Lecture'3 Advanced(Sampling(Techniques Dahua%Lin The\$Chinese\$University\$of\$Hong\$Kong 1
2. 2. Overview • Collapsed*Gibbs*Sampling • Sampling*with*Auxiliary*Variables • Slice*Sampling • Simulated*Tempering*&*Parallel*Tempering • Swendsen?Wang*Algorithm • Hamiltonian*Monte*Carlo 2
3. 3. Collapsed)Gibbs)Sampling 3
4. 4. Mo#va#ng(Example with We#want#to#sample#from# . 4
5. 5. Gibbs%Sampling Draw% %where% : with% . 5
6. 6. Gibbs%Sampling%(cont'd) Draw% : • How%well%can%this%sampler%perform%when% ? 6
7. 7. Collapsed)Gibbs)Sampling • Basic&idea:"replace"the"original"condi0onal" distribu0on"with"a"condi0onal"distribu0on"of"a" marginal(distribu.on,"o7en"called"a"reduced( condi.onal(distribu.on. • Consider"the"example"above,"we"consider"a" marginal(distribu.on: 7
8. 8. Collapsed)Gibbs)Sampling)(cont'd) • Draw& ,&with& &marginalized&out,&as: • Draw& • Can&we&exchange&the&order&of&these&two&steps?& Why? 8
9. 9. Basic&Guidelines • Order%of%steps%ma-ers! • Generally,*one*can*move*components*from*"being' sampled"*to*"being'condi0oned'on". • replacing*outputs*with*intermediates*would*change* the*sta:onary*distribu:on. • A*variable*can*be*updated*mul:ple*:mes*in*an* itera:on. 9
10. 10. Why\$do\$collapsed\$samplers\$o/en\$perform\$be3er\$than\$ full6ﬂedged\$Gibbs\$samplers? 10
11. 11. Rao\$Blackwell+Theorem Consider)an)example) )and)we)want)to) es1mate) .)Suppose)we)have)two)tractable) ways)to)do)so: (1)\$draw\$ ,\$and\$ compute 11
12. 12. Rao\$Blackwell+Theorem+(cont'd) (2)\$draw\$ \$where\$ \$is\$the\$ marginal\$distribu4on,\$and\$compute • Both&are&correct.&By&Strong&LLN,&both& &and& & converge&to& &almost&surely. • Which&one&is&be<er?&Can&you&jus@fy&your&answer? 12
13. 13. Rao\$Blackwell+Theorem+(cont'd) • (Rao%Blackwell,Theorem)"Sample"variance"will"be" reduced"when"some"components"are"marginalized" out."With"the"se:ng"above,"we"have • Generally,*reducing)sample)variance*would*also*lead* to*the*reduc3on*of*autocorrela2on*of*the*chain,* thus*improving*the*mixing*performance. 13
14. 14. Sampling)with)Auxiliary)Variables • The%Rao\$Blackwell\$Theorem%suggests%that%in%order%to% achieve%be3er%performance,%one%should%try%to% marginalize%out%as%many%components%as%possible. • However,%in%many%cases,%one%may%want%to%do%the% opposite,%that%is,%to%introduce%addi>onal%variables% to%facilitate%the%simula>ons. • For%example,%when%the%target\$distribu6on%is% mul6modal,%one%may%use%an%auxiliary%variable%to% help%the%chain%escape%from%local%traps. 14
15. 15. Use\$Auxiliary\$Variables • Specify)an)auxiliary)variable) )and)the)joint) distribu8on) )such)that) )for)certain) . • Design)a)chain)to)update) )using)the)M=H) algorithm)or)the)Gibbs)sampler. • The)samples)of) )can)then)be)obtained)through) marginaliza)on)or)condi)oning. 15
16. 16. Slice&Sampling 16
17. 17. Slice&Sampler • Sampling* *is*equivalent*to*sampling* uniformly*from*the*area*under* :* . • Gibbs*sampling*based*on*the*uniform*distribu;on* over* .*Each*itera;on*consists*of*two*steps: • Given* ,* • Given* ,* 17
18. 18. Slice&Sampler&(Illustra0on) 18
19. 19. Slice&Sampler&(Discussion) • Slice&sampler"can"mix"very"rapidly,"as"it"will"not"be" locally"trapped. • Slice&sampler"is"o7en"nontrivial"to"implement"in" prac8ce."Drawing" "is" some8mes"very"diﬃcult. • For"distribu8ons"of"certain"forms,"which"have"an" easy&way"to"draw" ," slice&sampling"is"good"strategy. 19
20. 20. Simulated*Tempering 20
21. 21. Gibbs%Measure A"Gibbs%measure"is"a"probability" measure"with"a"density"in"the" following"form: Here,% %is%called%the%energy& func*on,% %is%called%the%inverse& temperature,%and%the%normalizing% constant% %depends%on% . 21
22. 22. Gibbs%Measure%(cont'd) In#literature#of#MCMC#sampling,#we#o5en# parameterize#a#Gibbs#measure#using#the#temperature( parameter# ,#thus# . 22
23. 23. Tempered'MCMC Typical(MCMC(methods(usually(rely(on(local%moves(to( explore(the(state(space.(What(is(the(problem? 23
24. 24. Tempered'MCMC'(cont'd) Local&traps&o+en&leads&to&very&poor&mixing.&Can&we& improve&this? 24
25. 25. Simulated*Tempering Suppose'we'intend'to'sample'from' Basic&idea:!Augment!the!target!distribu0on!by! including!a!temperature(index( ,!with!joint!distribu0on! given!by 25
26. 26. Simulated*Tempering*(cont'd) • We\$only\$collect\$samples\$at\$the\$lowest'temperature,\$ . • The\$chain\$mixes\$much\$faster\$at\$high\$temperatures,\$ but\$we\$want\$to\$collect\$samples\$at\$the\$lowest\$ temperature.\$So\$we\$have\$to\$constantly\$switch\$ between\$temperatures. 26
27. 27. Simulated*Tempering*(Algorithm) One\$itera)on\$of\$Simulated*Tempering\$has\$two\$steps: • (Base&transi+on):#update# #at#the#same#temperature,# i.e.#holding# #ﬁxed. • (Temperature&switching):#with# #ﬁxed,#propose# #with# #such#that# • Accept#the#change#with#probability# . • Any#drawbacks? 27
28. 28. Simulated*Tempering*(Discussion) • Set% .%Given% ,%we%should%set% %such%that% uphill%moves%from%( )%should%have% a%considerable%probability%of%being%accepted. • Build%the%temperature(ladder%step%by%step%un?l%we% have%a%suﬃciently%smooth%distribu?on%at%the%top. • The%?me%spent%on%the%base%level% %is%around% .%If%we%have%too%many%levels,%only%a%very% small%por?on%of%samples%can%be%used. 28
29. 29. Simulated*Tempering*(Discussion) • All\$temperature\$levels\$play\$an\$important\$role.\$So\$it\$ is\$desirable\$to\$spend\$comparable\$amount\$of\$8me\$at\$ each\$level.\$Se:ng\$ \$for\$each\$ ,\$we\$have • The%normalzing%constants% %are%typically%unknown% and%es8ma8ng%them%is%very%diﬃcult%and%expensive. 29
30. 30. Parallel&Tempering (Basic'idea)!rather!than!jumping!between! temperatures,!it!simultaneously!simulate!mul3ple! chains,!each!at!a!temperature!level! ,!called!a!replica,! and!constantly!swap!samples!between!replicas. 30
31. 31. Parallel&Tempering&(Algorithm) Each%itera*on%consists%of%the%following%steps: • (Parallel'update):"simulate"each"replica"with"its"own" transi2on"kernel • (Replica'exchange):"propose"to"swap"states"between" two"replicas"(say"the" 7th"and" 7th,"where" ): 31
32. 32. Parallel&Tempering&(Algorithm) • The%proposal%is%accepted%with%probability% ,%where • We\$collect\$samples\$from\$the\$base\$replica\$(the\$one\$ with\$ ). • Why\$does\$this\$algorithm\$produce\$the\$desired\$ distribu;on? 32
33. 33. Parallel&Tempering&(Jus1ﬁca1on) Let\$ .\$We\$deﬁne Obviously,+the+step+of+parallel&update+preserves+the+ invariant+distribu5on+ . 33
34. 34. Parallel&Tempering&(Jus1ﬁca1on) Note%that%the%step%of%replica(exchange%is%symmetric,%i.e.% the%probabili0es%of%going%up%and%down%are%equal,%then% according%to%the%Metropolis(algorithm,%we%have% %with 34
35. 35. Parallel&Tempering&(Discussion) • It\$is\$eﬃcient\$and\$very\$easy\$to\$implement,\$especially\$ in\$a\$parallel\$compu6ng\$environment. • It\$is\$o9en\$an\$art\$instead\$of\$a\$technique\$to\$tune\$a\$ parallel\$tempering\$system. • The\$parallel-tempering\$is\$a\$special\$case\$of\$a\$large\$ family\$of\$MCMC\$methods\$called\$Extended-Ensemble- Monte-Carlo,\$which\$involves\$a\$collec6on\$of\$parallel\$ Markov\$chains\$and\$the\$simula6on\$switches\$ between\$these\$them. 35
36. 36. Swendsen'Wang+Algorithm The\$Swendsen'Wang\$algorithm\$(R.-Swendsen\$and\$J.- Wang,\$1987)\$is\$an\$eﬃcient\$Gibbs\$sampling\$algorithm\$ for\$sampling\$from\$the\$extended-Ising-model. 36
37. 37. Standard'Ising'Model The\$standard\$Ising&model\$is\$deﬁned\$as where% %for%each% %is%called%a%spin,%and% . • Gibbs&sampling&is&extremely&slow,&especially&when& the&temperature&is&low. 37
38. 38. Extended'Ising'Model • We\$extend\$the\$model\$by\$introducing\$addi5onal\$ bond%variables\$ ,\$each\$for\$an\$edge.\$Each\$bond\$ has\$two\$states:\$ \$indica5ng\$connected\$and\$ \$ indica5ng\$disconnected. • We\$deﬁne\$a\$joint\$distribu5on\$that\$couples\$the\$spins\$ and\$bonds, 38
39. 39. Extended'Ising'Model'(cont'd) Here,% %is%described%as%below: • When& ,& &for&every& se.ng&of& • When& ,& 39
40. 40. Extended'Ising'Model'(cont'd) With%this%se(ng,% %can%be%wri1en%as: where% : • when& ,& &must&be& • when& ,& &is&set&to&zero&with&probability& . 40
41. 41. Swendsen'Wang+Algorithm !Each!itera*on!consists!of!two!steps: • (Clustering):"condi(oned"on"the"spins" ,"draw"the" bonds" "independenly."For"an"edge" : • If" ,"set" • If" ,"set" "with"probability" "or" "otherwise. 41
42. 42. Swendsen'Wang+Algorithm • (Swapping):"condi(oned"on"the"bonds" ,"draw"the" spins" . • For"each"connected"component,"draw" "or" " with"equal"chance,"and"assign"the"resultant"value" to"all"nodes"in"the"component. 42
43. 43. Swendsen'Wang+Algorithm+ (Illustra7on)In the case of a rectangular grid, this Gibbs sampling algorithm mixes very rapidly. The following ﬁgures illustrate Gibbs sampling. Spin states up and down are shown by ﬁlled and empty circles. Bond states 1 and 0 are shown by thick lines and thin dotted lines. We start from a state with ﬁve connected components. (Remember that isolated spins count as connected components, albeit of size 1.) First, let’s update the bonds The forbidden bonds are highlighted Bonds are forbidden from forming wherever the two adjacent spins are in opposite states. The bonds that are not forbidden are set to the 1 state with probability p. After updating the bonds Now we update spins Update bonds again 1.2 Other properties of the extended model We already mentioned that the partition function Z is the same as that of the Ising model. In the case of a rectangular grid, this Gibbs sampling algorithm mixes very rapidly. The following ﬁgures illustrate Gibbs sampling. Spin states up and down are shown by ﬁlled and empty circles. Bond states 1 and 0 are shown by thick lines and thin dotted lines. We start from a state with ﬁve connected components. (Remember that isolated spins count as connected components, albeit of size 1.) First, let’s update the bonds The forbidden bonds are highlighted Bonds are forbidden from forming wherever the two adjacent spins are in opposite states. The bonds that are not forbidden are set to the 1 state with probability p. After updating the bonds Now we update spins Update bonds again 1.2 Other properties of the extended model We already mentioned that the partition function Z is the same as that of the Ising model. 43
44. 44. Swendsen'Wang+Algorithm+ (Discussion) • When& &is&large,& &has&a&high&probability&of& being&set&to&one,&i.e.& &and& &are&likely&to&be& connected. • Experiments&show&that&the&Swendsen)Wang& algorithm&mixes&very&rapidly,&especially&for& rectangular&grids. • Can&you&provide&an&intui?ve&explana?on? 44
45. 45. Swendsen'Wang+Algorithm+ (Discussion) • The%Swendsen'Wang%algorithm%can%be%generalized% to%Po4s%models%(nodes%can%take%values%from%a%ﬁnite% set). • The%Swendsen'Wang%algorithm%has%been%widely% used%in%image%analysis%applicaAons,%e.g.%image% segmentaAon%(in%this%case,%it%is%called%Swendsen' Wang,cut). 45
46. 46. Hamiltonian)Monte)Carlo • An\$MCMC\$method\$based\$on\$Hamiltonian)Dynamics.\$ It\$was\$originally\$devised\$for\$molecular)simula1on • In\$1987,\$a\$seminal\$paper\$by\$Duane\$et)al\$uniﬁes\$ MCMC\$and\$molecular\$dynamics.\$They\$called\$it\$ Hybrid)Monte)Carlo,\$which\$abbreviates\$to\$HMC • In\$many\$arEcles,\$people\$call\$it\$Hamiltonian)Monte) Carlo,\$as\$this\$name\$is\$considered\$to\$be\$more\$ speciﬁc\$and\$informaEve,\$and\$it\$retains\$the\$same\$ abbreviaEon\$"HMC". 46
47. 47. Mo#va#ng(Example:(Free(Fall 47
48. 48. Mo#va#ng(Example:(Free(Fall • The%change%of%momentum% %is%caused%by%the% accumula5on/release%of%the%poten(al+energy: • The%change%of%loca-on% %is%caused%by%velocity,%the% deriva-ve%of%kinema-c.energy%w.r.t.%the%momentum: 48
49. 49. Hamiltonian)Dynamics • Hamiltonian)Dynamics"is"a"generalized"theory"of"the" classical)mechanics,"which"provides"a"elegant"and" ﬂexible"abstrac:on"of"a"dynamic"system"in"physics. • In"Hamiltonian"Dynamics,"a"physical"system"is" described"by" ,"where" "and" "are" respec:vely"the"posi1on"and"momentum"of"the" @th" en:ty. 49
50. 50. Hamilton's+Equa/ons The\$dynamics\$of\$the\$system\$is\$characterized\$by\$the\$ Hamilton's+Equa/ons: Here,% %is%called%the%Hamiltonian,%which%can%be% interpreted%as%the%total)energy%of%the%system. 50
51. 51. Hamilton's+Equa/ons+(cont'd) • The%Hamiltonian% %is%o)en%formulated%as%the%sum% of%the%poten+al,energy% %and%the%kine+c,energy% : • With&this&se)ng,&the&Hamilton's+Equa/ons&become: 51
52. 52. Conserva)on*of*Hamiltonian The\$Hamiltonian\$is\$conserved,\$i.e.,\$it\$is\$invariant\$over\$ ,me: Intui&vely,,this,reﬂects,the,law\$of\$energy\$conserva/on. 52
53. 53. Hamiltonian)Reversibility • The%Hamiltonian)dynamics%is%reversible • Let%the%ini+al%states%be% %and%the%states%at% +me% %be% .%Then,%it%we%reverse%the%process,% star+ng%at% ,%then%the%states%at%+me% % would%be% . • In%the%context%of%MCMC,%this%leads%to%the% reversibility%of%the%underlying%chain. 53
54. 54. Simula'on*of*Hamiltonian*Dynamics A"natural"idea"to"simulate"Hamiltonian)dynamics"is"to" use"Euler's)method"over"discre1zed"1me"steps: Is#this#a#good#method? 54
55. 55. Leapfrog)Method Be#er%results%can%be%obtained%with%leapfrog: More%importantly,%the%leapfrog%update%is%reversible. 55
56. 56. Leapfrog)Method)(cont'd) 56
57. 57. Example Consider)a)Hamiltonian)system: Write&down&the&Hamilton's+Equa/ons: Derive&the&solu-on: 57
58. 58. Example((Simula-on) 58
59. 59. Hamiltonian)Monte)Carlo (Basic'idea):!Consider!the!poten&al)energy!as!the!Gibbs) energy,!and!introduce!the!"momentums"!as!auxiliary) variables!to!control!the!dynamics. 59
60. 60. Hamiltonian)Monte)Carlo)(cont'd) Suppose'the'target&distribu,on'is' ,'then'we'form'an'augmented& distribu,on'as Here,%the%loca%ons% %represent%the%variables%of% interest,%and%the%momentums% %control%the%dynamics% of%simula7on. 60
61. 61. Hamiltonian)Monte)Carlo)(cont'd) In#prac(ce,#the#kine%c#energy#is#o2en#formalized#as 61
62. 62. Hamiltonian)Monte)Carlo)(Algorithm) Each%itera*on%of%HMC%comprises%two%steps: • Gibbs%update:#sample#the#momentums# #from#the# Gaussian#prior#given#by 62
63. 63. Hamiltonian)Monte)Carlo)(Algorithm) • Metropolis*update:#using#Hamiltonian#dynamics#to# propose#a#new#state.#Star8ng#from# ,#simulate# the#dynamic#system#with#the#leapfrog#method#for# # steps#with#step<size# ,#which#yields# .#The# proposed#state#is#accepted#with#probability: 63
64. 64. HMC\$(Discussion) • If\$the\$simula.on\$is\$exact,\$we\$will\$have\$ ,\$and\$thus\$the\$proposed\$state\$ should\$always\$be\$accepted.\$ • In\$prac.ce,\$there\$can\$be\$some\$devia.on\$due\$to\$ discre.za.on,\$we\$have\$to\$use\$the\$Metropolis\$ rule\$to\$guarantee\$the\$correctness. 64
65. 65. HMC\$(Discussion) • HMC%has%a%high%acceptance% rate%while%allowing%large% moves%along%less6constrained% direc8ons%at%each%itera8on. • This%is%a%key%advantage%as% compared%to%random'walk% proposals,%which,%in%order%to% maintain%a%reasonably%high% acceptance%rate,%has%to%keep% a%very%small%step%size,% resul8ng%in%substan8al% correla8on%between% consecu8ve%samples. 65
66. 66. Tuning&HMC • For%eﬃcient%simula1on,%it%is%important%to%choose% appropriate%values%for%both%the%leapfrog%step%size% % and%the%number%of%leapfrog%steps%per%itera1on% . • Tuning%HMC%(and%actually%many%generic%sampling% methods)%oCen%requires%preliminary*runs%with% diﬀerent%trial%seGngs%and%diﬀerent%ini1al%values,%as% well%as%careful%analysis%of%the%energy%trajectories. 66
67. 67. Tuning&HMC&(cont'd) • For%most%cases,% %and% %can%be%tuned%independently. • Too%small%a%stepsize%would%waste%computa8on% 8me,%while%large%stepsize%would%cause%unstable% simula8on,%and%thus%low%acceptance%rate. • One%should%choose% %such%that%the%energy% trajectory%is%stable%and%the%acceptance%rate%is% maintained%at%a%reasonably%high%level. • One%should%choose% %such%that%back@and@forth% movement%of%the%states%can%be%observed. 67
68. 68. Generic'Sampling'Systems A"number"of"so,ware"systems"are"available"for" sampling"from"models"speciﬁed"by"the"user • WinBUGS:*based*on*BUGS*(Bayesian*inference* Using*Gibbs*Sampling). • provide*a*friendly*language*for*user*to*specify* the*model • Running*only*on*Windows • Note:*The*development*has*stopped*since*2007. 68
69. 69. Generic'Sampling'Systems'(cont'd) • JAGS:'"Just'Another'Gibbs'Sampler" • Cross8pla9orm'support • Use'a'dialect'of'BUGS • Extensible:'allow'users'to'write'customized' funcCons,'distribuCons,'and'samplers 69
70. 70. Generic'Sampling'Systems'(cont'd) • Stan:'"Sampling'Through'Adap5ve'Neighborhoods" • Core'wri=en'in'C++,'and'ports'available'in' Python,'R,'Matlab,'and'Julia • A'user'friendly'language'for'model'speciﬁca5on • Use'Hamiltonian'Monte'Carlo'(HMC)'and'No'UL Turn'Samplers'(NUTS)'as'core'algorithm • Open'source'(GPLv3'licensed)'and'under'ac5ve' development'on'Github 70
71. 71. Stan%Example data { int<lower=0> N; vector[N] x; vector[N] y; } parameters { real alpha; real beta; real<lower=0> sigma; } model { for (n in 1:N) y[n] ~ normal(alpha + beta * x[n], sigma); } 71
72. 72. Generic'Sampling'System'vs.' Dedicated'Algorithms Generic' Dedicated' Easy%to%use% Require%knowledge%and%experience% High%produc9vity% Time=consuming%to%develop% Slow% O@en%remarkably%more%eﬃcient% Limited%ﬂexibility% Necessary%for%many%new%models% 72