Lecture'3
Advanced(Sampling(Techniques
Dahua%Lin
The$Chinese$University$of$Hong$Kong
1
Overview
• Collapsed*Gibbs*Sampling
• Sampling*with*Auxiliary*Variables
• Slice*Sampling
• Simulated*Tempering*&*Parallel*Tempering
• Swendsen?Wang*Algorithm
• Hamiltonian*Monte*Carlo
2
Collapsed)Gibbs)Sampling
3
Mo#va#ng(Example
with
We#want#to#sample#from# .
4
Gibbs%Sampling
Draw% %where% :
with% .
5
Gibbs%Sampling%(cont'd)
Draw% :
• How%well%can%this%sampler%perform%when%
?
6
Collapsed)Gibbs)Sampling
• Basic&idea:"replace"the"original"condi0onal"
distribu0on"with"a"condi0onal"distribu0on"of"a"
marginal(distribu.on,"o7en"called"a"reduced(
condi.onal(distribu.on.
• Consider"the"example"above,"we"consider"a"
marginal(distribu.on:
7
Collapsed)Gibbs)Sampling)(cont'd)
• Draw& ,&with& &marginalized&out,&as:
• Draw&
• Can&we&exchange&the&order&of&these&two&steps?&
Why?
8
Basic&Guidelines
• Order%of%steps%ma-ers!
• Generally,*one*can*move*components*from*"being'
sampled"*to*"being'condi0oned'on".
• replacing*outputs*with*intermediates*would*change*
the*sta:onary*distribu:on.
• A*variable*can*be*updated*mul:ple*:mes*in*an*
itera:on.
9
Why$do$collapsed$samplers$o/en$perform$be3er$than$
full6fledged$Gibbs$samplers?
10
Rao$Blackwell+Theorem
Consider)an)example) )and)we)want)to)
es1mate) .)Suppose)we)have)two)tractable)
ways)to)do)so:
(1)$draw$ ,$and$
compute
11
Rao$Blackwell+Theorem+(cont'd)
(2)$draw$ $where$ $is$the$
marginal$distribu4on,$and$compute
• Both&are&correct.&By&Strong&LLN,&both& &and& &
converge&to& &almost&surely.
• Which&one&is&be<er?&Can&you&jus@fy&your&answer?
12
Rao$Blackwell+Theorem+(cont'd)
• (Rao%Blackwell,Theorem)"Sample"variance"will"be"
reduced"when"some"components"are"marginalized"
out."With"the"se:ng"above,"we"have
• Generally,*reducing)sample)variance*would*also*lead*
to*the*reduc3on*of*autocorrela2on*of*the*chain,*
thus*improving*the*mixing*performance.
13
Sampling)with)Auxiliary)Variables
• The%Rao$Blackwell$Theorem%suggests%that%in%order%to%
achieve%be3er%performance,%one%should%try%to%
marginalize%out%as%many%components%as%possible.
• However,%in%many%cases,%one%may%want%to%do%the%
opposite,%that%is,%to%introduce%addi>onal%variables%
to%facilitate%the%simula>ons.
• For%example,%when%the%target$distribu6on%is%
mul6modal,%one%may%use%an%auxiliary%variable%to%
help%the%chain%escape%from%local%traps.
14
Use$Auxiliary$Variables
• Specify)an)auxiliary)variable) )and)the)joint)
distribu8on) )such)that)
)for)certain) .
• Design)a)chain)to)update) )using)the)M=H)
algorithm)or)the)Gibbs)sampler.
• The)samples)of) )can)then)be)obtained)through)
marginaliza)on)or)condi)oning.
15
Slice&Sampling
16
Slice&Sampler
• Sampling* *is*equivalent*to*sampling*
uniformly*from*the*area*under* :*
.
• Gibbs*sampling*based*on*the*uniform*distribu;on*
over* .*Each*itera;on*consists*of*two*steps:
• Given* ,*
• Given* ,*
17
Slice&Sampler&(Illustra0on)
18
Slice&Sampler&(Discussion)
• Slice&sampler"can"mix"very"rapidly,"as"it"will"not"be"
locally"trapped.
• Slice&sampler"is"o7en"nontrivial"to"implement"in"
prac8ce."Drawing" "is"
some8mes"very"difficult.
• For"distribu8ons"of"certain"forms,"which"have"an"
easy&way"to"draw" ,"
slice&sampling"is"good"strategy.
19
Simulated*Tempering
20
Gibbs%Measure
A"Gibbs%measure"is"a"probability"
measure"with"a"density"in"the"
following"form:
Here,% %is%called%the%energy&
func*on,% %is%called%the%inverse&
temperature,%and%the%normalizing%
constant% %depends%on% .
21
Gibbs%Measure%(cont'd)
In#literature#of#MCMC#sampling,#we#o5en#
parameterize#a#Gibbs#measure#using#the#temperature(
parameter# ,#thus#
.
22
Tempered'MCMC
Typical(MCMC(methods(usually(rely(on(local%moves(to(
explore(the(state(space.(What(is(the(problem?
23
Tempered'MCMC'(cont'd)
Local&traps&o+en&leads&to&very&poor&mixing.&Can&we&
improve&this?
24
Simulated*Tempering
Suppose'we'intend'to'sample'from'
Basic&idea:!Augment!the!target!distribu0on!by!
including!a!temperature(index( ,!with!joint!distribu0on!
given!by
25
Simulated*Tempering*(cont'd)
• We$only$collect$samples$at$the$lowest'temperature,$
.
• The$chain$mixes$much$faster$at$high$temperatures,$
but$we$want$to$collect$samples$at$the$lowest$
temperature.$So$we$have$to$constantly$switch$
between$temperatures.
26
Simulated*Tempering*(Algorithm)
One$itera)on$of$Simulated*Tempering$has$two$steps:
• (Base&transi+on):#update# #at#the#same#temperature,#
i.e.#holding# #fixed.
• (Temperature&switching):#with# #fixed,#propose#
#with# #such#that#
• Accept#the#change#with#probability#
.
• Any#drawbacks?
27
Simulated*Tempering*(Discussion)
• Set% .%Given% ,%we%should%set% %such%that%
uphill%moves%from%( )%should%have%
a%considerable%probability%of%being%accepted.
• Build%the%temperature(ladder%step%by%step%un?l%we%
have%a%sufficiently%smooth%distribu?on%at%the%top.
• The%?me%spent%on%the%base%level% %is%around%
.%If%we%have%too%many%levels,%only%a%very%
small%por?on%of%samples%can%be%used.
28
Simulated*Tempering*(Discussion)
• All$temperature$levels$play$an$important$role.$So$it$
is$desirable$to$spend$comparable$amount$of$8me$at$
each$level.$Se:ng$ $for$each$ ,$we$have
• The%normalzing%constants% %are%typically%unknown%
and%es8ma8ng%them%is%very%difficult%and%expensive.
29
Parallel&Tempering
(Basic'idea)!rather!than!jumping!between!
temperatures,!it!simultaneously!simulate!mul3ple!
chains,!each!at!a!temperature!level! ,!called!a!replica,!
and!constantly!swap!samples!between!replicas.
30
Parallel&Tempering&(Algorithm)
Each%itera*on%consists%of%the%following%steps:
• (Parallel'update):"simulate"each"replica"with"its"own"
transi2on"kernel
• (Replica'exchange):"propose"to"swap"states"between"
two"replicas"(say"the" 7th"and" 7th,"where"
):
31
Parallel&Tempering&(Algorithm)
• The%proposal%is%accepted%with%probability%
,%where
• We$collect$samples$from$the$base$replica$(the$one$
with$ ).
• Why$does$this$algorithm$produce$the$desired$
distribu;on?
32
Parallel&Tempering&(Jus1fica1on)
Let$ .$We$define
Obviously,+the+step+of+parallel&update+preserves+the+
invariant+distribu5on+ .
33
Parallel&Tempering&(Jus1fica1on)
Note%that%the%step%of%replica(exchange%is%symmetric,%i.e.%
the%probabili0es%of%going%up%and%down%are%equal,%then%
according%to%the%Metropolis(algorithm,%we%have%
%with
34
Parallel&Tempering&(Discussion)
• It$is$efficient$and$very$easy$to$implement,$especially$
in$a$parallel$compu6ng$environment.
• It$is$o9en$an$art$instead$of$a$technique$to$tune$a$
parallel$tempering$system.
• The$parallel-tempering$is$a$special$case$of$a$large$
family$of$MCMC$methods$called$Extended-Ensemble-
Monte-Carlo,$which$involves$a$collec6on$of$parallel$
Markov$chains$and$the$simula6on$switches$
between$these$them.
35
Swendsen'Wang+Algorithm
The$Swendsen'Wang$algorithm$(R.-Swendsen$and$J.-
Wang,$1987)$is$an$efficient$Gibbs$sampling$algorithm$
for$sampling$from$the$extended-Ising-model.
36
Standard'Ising'Model
The$standard$Ising&model$is$defined$as
where% %for%each% %is%called%a%spin,%and%
.
• Gibbs&sampling&is&extremely&slow,&especially&when&
the&temperature&is&low.
37
Extended'Ising'Model
• We$extend$the$model$by$introducing$addi5onal$
bond%variables$ ,$each$for$an$edge.$Each$bond$
has$two$states:$ $indica5ng$connected$and$ $
indica5ng$disconnected.
• We$define$a$joint$distribu5on$that$couples$the$spins$
and$bonds,
38
Extended'Ising'Model'(cont'd)
Here,% %is%described%as%below:
• When& ,& &for&every&
se.ng&of&
• When& ,&
39
Extended'Ising'Model'(cont'd)
With%this%se(ng,% %can%be%wri1en%as:
where% :
• when& ,& &must&be&
• when& ,& &is&set&to&zero&with&probability&
.
40
Swendsen'Wang+Algorithm
!Each!itera*on!consists!of!two!steps:
• (Clustering):"condi(oned"on"the"spins" ,"draw"the"
bonds" "independenly."For"an"edge"
:
• If" ,"set"
• If" ,"set" "with"probability"
"or" "otherwise.
41
Swendsen'Wang+Algorithm
• (Swapping):"condi(oned"on"the"bonds" ,"draw"the"
spins" .
• For"each"connected"component,"draw" "or" "
with"equal"chance,"and"assign"the"resultant"value"
to"all"nodes"in"the"component.
42
Swendsen'Wang+Algorithm+
(Illustra7on)In the case of a rectangular grid, this Gibbs sampling algorithm mixes very rapidly.
The following figures illustrate Gibbs sampling. Spin states up and down are
shown by filled and empty circles. Bond states 1 and 0 are shown by thick lines and
thin dotted lines. We start from a state with five connected components. (Remember
that isolated spins count as connected components, albeit of size 1.)
First, let’s update the bonds The forbidden bonds are highlighted
Bonds are forbidden from forming wherever the two adjacent spins are in opposite
states. The bonds that are not forbidden are set to the 1 state with probability p.
After updating the bonds Now we update spins Update bonds again
1.2 Other properties of the extended model
We already mentioned that the partition function Z is the same as that of the Ising
model.
In the case of a rectangular grid, this Gibbs sampling algorithm mixes very rapidly.
The following figures illustrate Gibbs sampling. Spin states up and down are
shown by filled and empty circles. Bond states 1 and 0 are shown by thick lines and
thin dotted lines. We start from a state with five connected components. (Remember
that isolated spins count as connected components, albeit of size 1.)
First, let’s update the bonds The forbidden bonds are highlighted
Bonds are forbidden from forming wherever the two adjacent spins are in opposite
states. The bonds that are not forbidden are set to the 1 state with probability p.
After updating the bonds Now we update spins Update bonds again
1.2 Other properties of the extended model
We already mentioned that the partition function Z is the same as that of the Ising
model.
43
Swendsen'Wang+Algorithm+
(Discussion)
• When& &is&large,& &has&a&high&probability&of&
being&set&to&one,&i.e.& &and& &are&likely&to&be&
connected.
• Experiments&show&that&the&Swendsen)Wang&
algorithm&mixes&very&rapidly,&especially&for&
rectangular&grids.
• Can&you&provide&an&intui?ve&explana?on?
44
Swendsen'Wang+Algorithm+
(Discussion)
• The%Swendsen'Wang%algorithm%can%be%generalized%
to%Po4s%models%(nodes%can%take%values%from%a%finite%
set).
• The%Swendsen'Wang%algorithm%has%been%widely%
used%in%image%analysis%applicaAons,%e.g.%image%
segmentaAon%(in%this%case,%it%is%called%Swendsen'
Wang,cut).
45
Hamiltonian)Monte)Carlo
• An$MCMC$method$based$on$Hamiltonian)Dynamics.$
It$was$originally$devised$for$molecular)simula1on
• In$1987,$a$seminal$paper$by$Duane$et)al$unifies$
MCMC$and$molecular$dynamics.$They$called$it$
Hybrid)Monte)Carlo,$which$abbreviates$to$HMC
• In$many$arEcles,$people$call$it$Hamiltonian)Monte)
Carlo,$as$this$name$is$considered$to$be$more$
specific$and$informaEve,$and$it$retains$the$same$
abbreviaEon$"HMC".
46
Mo#va#ng(Example:(Free(Fall
47
Mo#va#ng(Example:(Free(Fall
• The%change%of%momentum% %is%caused%by%the%
accumula5on/release%of%the%poten(al+energy:
• The%change%of%loca-on% %is%caused%by%velocity,%the%
deriva-ve%of%kinema-c.energy%w.r.t.%the%momentum:
48
Hamiltonian)Dynamics
• Hamiltonian)Dynamics"is"a"generalized"theory"of"the"
classical)mechanics,"which"provides"a"elegant"and"
flexible"abstrac:on"of"a"dynamic"system"in"physics.
• In"Hamiltonian"Dynamics,"a"physical"system"is"
described"by" ,"where" "and" "are"
respec:vely"the"posi1on"and"momentum"of"the" @th"
en:ty.
49
Hamilton's+Equa/ons
The$dynamics$of$the$system$is$characterized$by$the$
Hamilton's+Equa/ons:
Here,% %is%called%the%Hamiltonian,%which%can%be%
interpreted%as%the%total)energy%of%the%system.
50
Hamilton's+Equa/ons+(cont'd)
• The%Hamiltonian% %is%o)en%formulated%as%the%sum%
of%the%poten+al,energy% %and%the%kine+c,energy% :
• With&this&se)ng,&the&Hamilton's+Equa/ons&become:
51
Conserva)on*of*Hamiltonian
The$Hamiltonian$is$conserved,$i.e.,$it$is$invariant$over$
,me:
Intui&vely,,this,reflects,the,law$of$energy$conserva/on.
52
Hamiltonian)Reversibility
• The%Hamiltonian)dynamics%is%reversible
• Let%the%ini+al%states%be% %and%the%states%at%
+me% %be% .%Then,%it%we%reverse%the%process,%
star+ng%at% ,%then%the%states%at%+me% %
would%be% .
• In%the%context%of%MCMC,%this%leads%to%the%
reversibility%of%the%underlying%chain.
53
Simula'on*of*Hamiltonian*Dynamics
A"natural"idea"to"simulate"Hamiltonian)dynamics"is"to"
use"Euler's)method"over"discre1zed"1me"steps:
Is#this#a#good#method?
54
Leapfrog)Method
Be#er%results%can%be%obtained%with%leapfrog:
More%importantly,%the%leapfrog%update%is%reversible.
55
Leapfrog)Method)(cont'd)
56
Example
Consider)a)Hamiltonian)system:
Write&down&the&Hamilton's+Equa/ons:
Derive&the&solu-on:
57
Example((Simula-on)
58
Hamiltonian)Monte)Carlo
(Basic'idea):!Consider!the!poten&al)energy!as!the!Gibbs)
energy,!and!introduce!the!"momentums"!as!auxiliary)
variables!to!control!the!dynamics.
59
Hamiltonian)Monte)Carlo)(cont'd)
Suppose'the'target&distribu,on'is'
,'then'we'form'an'augmented&
distribu,on'as
Here,%the%loca%ons% %represent%the%variables%of%
interest,%and%the%momentums% %control%the%dynamics%
of%simula7on.
60
Hamiltonian)Monte)Carlo)(cont'd)
In#prac(ce,#the#kine%c#energy#is#o2en#formalized#as
61
Hamiltonian)Monte)Carlo)(Algorithm)
Each%itera*on%of%HMC%comprises%two%steps:
• Gibbs%update:#sample#the#momentums# #from#the#
Gaussian#prior#given#by
62
Hamiltonian)Monte)Carlo)(Algorithm)
• Metropolis*update:#using#Hamiltonian#dynamics#to#
propose#a#new#state.#Star8ng#from# ,#simulate#
the#dynamic#system#with#the#leapfrog#method#for# #
steps#with#step<size# ,#which#yields# .#The#
proposed#state#is#accepted#with#probability:
63
HMC$(Discussion)
• If$the$simula.on$is$exact,$we$will$have$
,$and$thus$the$proposed$state$
should$always$be$accepted.$
• In$prac.ce,$there$can$be$some$devia.on$due$to$
discre.za.on,$we$have$to$use$the$Metropolis$
rule$to$guarantee$the$correctness.
64
HMC$(Discussion)
• HMC%has%a%high%acceptance%
rate%while%allowing%large%
moves%along%less6constrained%
direc8ons%at%each%itera8on.
• This%is%a%key%advantage%as%
compared%to%random'walk%
proposals,%which,%in%order%to%
maintain%a%reasonably%high%
acceptance%rate,%has%to%keep%
a%very%small%step%size,%
resul8ng%in%substan8al%
correla8on%between%
consecu8ve%samples.
65
Tuning&HMC
• For%efficient%simula1on,%it%is%important%to%choose%
appropriate%values%for%both%the%leapfrog%step%size% %
and%the%number%of%leapfrog%steps%per%itera1on% .
• Tuning%HMC%(and%actually%many%generic%sampling%
methods)%oCen%requires%preliminary*runs%with%
different%trial%seGngs%and%different%ini1al%values,%as%
well%as%careful%analysis%of%the%energy%trajectories.
66
Tuning&HMC&(cont'd)
• For%most%cases,% %and% %can%be%tuned%independently.
• Too%small%a%stepsize%would%waste%computa8on%
8me,%while%large%stepsize%would%cause%unstable%
simula8on,%and%thus%low%acceptance%rate.
• One%should%choose% %such%that%the%energy%
trajectory%is%stable%and%the%acceptance%rate%is%
maintained%at%a%reasonably%high%level.
• One%should%choose% %such%that%back@and@forth%
movement%of%the%states%can%be%observed.
67
Generic'Sampling'Systems
A"number"of"so,ware"systems"are"available"for"
sampling"from"models"specified"by"the"user
• WinBUGS:*based*on*BUGS*(Bayesian*inference*
Using*Gibbs*Sampling).
• provide*a*friendly*language*for*user*to*specify*
the*model
• Running*only*on*Windows
• Note:*The*development*has*stopped*since*2007.
68
Generic'Sampling'Systems'(cont'd)
• JAGS:'"Just'Another'Gibbs'Sampler"
• Cross8pla9orm'support
• Use'a'dialect'of'BUGS
• Extensible:'allow'users'to'write'customized'
funcCons,'distribuCons,'and'samplers
69
Generic'Sampling'Systems'(cont'd)
• Stan:'"Sampling'Through'Adap5ve'Neighborhoods"
• Core'wri=en'in'C++,'and'ports'available'in'
Python,'R,'Matlab,'and'Julia
• A'user'friendly'language'for'model'specifica5on
• Use'Hamiltonian'Monte'Carlo'(HMC)'and'No'UL
Turn'Samplers'(NUTS)'as'core'algorithm
• Open'source'(GPLv3'licensed)'and'under'ac5ve'
development'on'Github
70
Stan%Example
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
for (n in 1:N)
y[n] ~ normal(alpha + beta * x[n], sigma);
}
71
Generic'Sampling'System'vs.'
Dedicated'Algorithms
Generic' Dedicated'
Easy%to%use% Require%knowledge%and%experience%
High%produc9vity% Time=consuming%to%develop%
Slow% O@en%remarkably%more%efficient%
Limited%flexibility% Necessary%for%many%new%models%
72

MLPI Lecture 3: Advanced Sampling Techniques