MLPI Lecture 3: Advanced Sampling Techniques

Lecture'3
Advanced(Sampling(Techniques
Dahua%Lin
The$Chinese$University$of$Hong$Kong
1

Overview
• Collapsed*Gibbs*Sampling
• Sampling*with*Auxiliary*Variables
• Slice*Sampling
• Simulated*Tempering*&*Parallel*Tempering
• Swendsen?Wang*Algorithm
• Hamiltonian*Monte*Carlo
2

Mo#va#ng(Example
with
We#want#to#sample#from# .
4

Gibbs%Sampling
Draw% %where% :
with% .
5

Gibbs%Sampling%(cont'd)
Draw% :
• How%well%can%this%sampler%perform%when%
?
6

Collapsed)Gibbs)Sampling
• Basic&idea:"replace"the"original"condi0onal"
distribu0on"with"a"condi0onal"distribu0on"of"a"
marginal(distribu.on,"o7en"called"a"reduced(
condi.onal(distribu.on.
• Consider"the"example"above,"we"consider"a"
marginal(distribu.on:
7

Collapsed)Gibbs)Sampling)(cont'd)
• Draw& ,&with& &marginalized&out,&as:
• Draw&
• Can&we&exchange&the&order&of&these&two&steps?&
Why?
8

Basic&Guidelines
• Order%of%steps%ma-ers!
• Generally,*one*can*move*components*from*"being'
sampled"*to*"being'condi0oned'on".
• replacing*outputs*with*intermediates*would*change*
the*sta:onary*distribu:on.
• A*variable*can*be*updated*mul:ple*:mes*in*an*
itera:on.
9

Why$do$collapsed$samplers$o/en$perform$be3er$than$
full6ﬂedged$Gibbs$samplers?
10

Rao$Blackwell+Theorem
Consider)an)example) )and)we)want)to)
es1mate) .)Suppose)we)have)two)tractable)
ways)to)do)so:
(1)$draw$ ,$and$
compute
11

Rao$Blackwell+Theorem+(cont'd)
(2)$draw$ $where$ $is$the$
marginal$distribu4on,$and$compute
• Both&are&correct.&By&Strong&LLN,&both& &and& &
converge&to& &almost&surely.
• Which&one&is&be<er?&Can&you&jus@fy&your&answer?
12

Rao$Blackwell+Theorem+(cont'd)
• (Rao%Blackwell,Theorem)"Sample"variance"will"be"
reduced"when"some"components"are"marginalized"
out."With"the"se:ng"above,"we"have
• Generally,*reducing)sample)variance*would*also*lead*
to*the*reduc3on*of*autocorrela2on*of*the*chain,*
thus*improving*the*mixing*performance.
13

Sampling)with)Auxiliary)Variables
• The%Rao$Blackwell$Theorem%suggests%that%in%order%to%
achieve%be3er%performance,%one%should%try%to%
marginalize%out%as%many%components%as%possible.
• However,%in%many%cases,%one%may%want%to%do%the%
opposite,%that%is,%to%introduce%addi>onal%variables%
to%facilitate%the%simula>ons.
• For%example,%when%the%target$distribu6on%is%
mul6modal,%one%may%use%an%auxiliary%variable%to%
help%the%chain%escape%from%local%traps.
14

Use$Auxiliary$Variables
• Specify)an)auxiliary)variable) )and)the)joint)
distribu8on) )such)that)
)for)certain) .
• Design)a)chain)to)update) )using)the)M=H)
algorithm)or)the)Gibbs)sampler.
• The)samples)of) )can)then)be)obtained)through)
marginaliza)on)or)condi)oning.
15

Slice&Sampler
• Sampling* *is*equivalent*to*sampling*
uniformly*from*the*area*under* :*
.
• Gibbs*sampling*based*on*the*uniform*distribu;on*
over* .*Each*itera;on*consists*of*two*steps:
• Given* ,*
• Given* ,*
17

Slice&Sampler&(Illustra0on)
18

Slice&Sampler&(Discussion)
• Slice&sampler"can"mix"very"rapidly,"as"it"will"not"be"
locally"trapped.
• Slice&sampler"is"o7en"nontrivial"to"implement"in"
prac8ce."Drawing" "is"
some8mes"very"diﬃcult.
• For"distribu8ons"of"certain"forms,"which"have"an"
easy&way"to"draw" ,"
slice&sampling"is"good"strategy.
19

Gibbs%Measure
A"Gibbs%measure"is"a"probability"
measure"with"a"density"in"the"
following"form:
Here,% %is%called%the%energy&
func*on,% %is%called%the%inverse&
temperature,%and%the%normalizing%
constant% %depends%on% .
21

Gibbs%Measure%(cont'd)
In#literature#of#MCMC#sampling,#we#o5en#
parameterize#a#Gibbs#measure#using#the#temperature(
parameter# ,#thus#
.
22

Tempered'MCMC
Typical(MCMC(methods(usually(rely(on(local%moves(to(
explore(the(state(space.(What(is(the(problem?
23

Tempered'MCMC'(cont'd)
Local&traps&o+en&leads&to&very&poor&mixing.&Can&we&
improve&this?
24

Simulated*Tempering
Suppose'we'intend'to'sample'from'
Basic&idea:!Augment!the!target!distribu0on!by!
including!a!temperature(index( ,!with!joint!distribu0on!
given!by
25

Simulated*Tempering*(cont'd)
• We$only$collect$samples$at$the$lowest'temperature,$
.
• The$chain$mixes$much$faster$at$high$temperatures,$
but$we$want$to$collect$samples$at$the$lowest$
temperature.$So$we$have$to$constantly$switch$
between$temperatures.
26

Simulated*Tempering*(Algorithm)
One$itera)on$of$Simulated*Tempering$has$two$steps:
• (Base&transi+on):#update# #at#the#same#temperature,#
i.e.#holding# #ﬁxed.
• (Temperature&switching):#with# #ﬁxed,#propose#
#with# #such#that#
• Accept#the#change#with#probability#
.
• Any#drawbacks?
27

Simulated*Tempering*(Discussion)
• Set% .%Given% ,%we%should%set% %such%that%
uphill%moves%from%( )%should%have%
a%considerable%probability%of%being%accepted.
• Build%the%temperature(ladder%step%by%step%un?l%we%
have%a%suﬃciently%smooth%distribu?on%at%the%top.
• The%?me%spent%on%the%base%level% %is%around%
.%If%we%have%too%many%levels,%only%a%very%
small%por?on%of%samples%can%be%used.
28

Simulated*Tempering*(Discussion)
• All$temperature$levels$play$an$important$role.$So$it$
is$desirable$to$spend$comparable$amount$of$8me$at$
each$level.$Se:ng$ $for$each$ ,$we$have
• The%normalzing%constants% %are%typically%unknown%
and%es8ma8ng%them%is%very%diﬃcult%and%expensive.
29

Parallel&Tempering
(Basic'idea)!rather!than!jumping!between!
temperatures,!it!simultaneously!simulate!mul3ple!
chains,!each!at!a!temperature!level! ,!called!a!replica,!
and!constantly!swap!samples!between!replicas.
30

Parallel&Tempering&(Algorithm)
Each%itera*on%consists%of%the%following%steps:
• (Parallel'update):"simulate"each"replica"with"its"own"
transi2on"kernel
• (Replica'exchange):"propose"to"swap"states"between"
two"replicas"(say"the" 7th"and" 7th,"where"
):
31

Parallel&Tempering&(Algorithm)
• The%proposal%is%accepted%with%probability%
,%where
• We$collect$samples$from$the$base$replica$(the$one$
with$ ).
• Why$does$this$algorithm$produce$the$desired$
distribu;on?
32

Parallel&Tempering&(Jus1ﬁca1on)
Let$ .$We$deﬁne
Obviously,+the+step+of+parallel&update+preserves+the+
invariant+distribu5on+ .
33

Parallel&Tempering&(Jus1ﬁca1on)
Note%that%the%step%of%replica(exchange%is%symmetric,%i.e.%
the%probabili0es%of%going%up%and%down%are%equal,%then%
according%to%the%Metropolis(algorithm,%we%have%
%with
34

Parallel&Tempering&(Discussion)
• It$is$eﬃcient$and$very$easy$to$implement,$especially$
in$a$parallel$compu6ng$environment.
• It$is$o9en$an$art$instead$of$a$technique$to$tune$a$
parallel$tempering$system.
• The$parallel-tempering$is$a$special$case$of$a$large$
family$of$MCMC$methods$called$Extended-Ensemble-
Monte-Carlo,$which$involves$a$collec6on$of$parallel$
Markov$chains$and$the$simula6on$switches$
between$these$them.
35

Swendsen'Wang+Algorithm
The$Swendsen'Wang$algorithm$(R.-Swendsen$and$J.-
Wang,$1987)$is$an$eﬃcient$Gibbs$sampling$algorithm$
for$sampling$from$the$extended-Ising-model.
36

Standard'Ising'Model
The$standard$Ising&model$is$deﬁned$as
where% %for%each% %is%called%a%spin,%and%
.
• Gibbs&sampling&is&extremely&slow,&especially&when&
the&temperature&is&low.
37

Extended'Ising'Model
• We$extend$the$model$by$introducing$addi5onal$
bond%variables$ ,$each$for$an$edge.$Each$bond$
has$two$states:$ $indica5ng$connected$and$ $
indica5ng$disconnected.
• We$deﬁne$a$joint$distribu5on$that$couples$the$spins$
and$bonds,
38

Extended'Ising'Model'(cont'd)
Here,% %is%described%as%below:
• When& ,& &for&every&
se.ng&of&
• When& ,&
39

Extended'Ising'Model'(cont'd)
With%this%se(ng,% %can%be%wri1en%as:
where% :
• when& ,& &must&be&
• when& ,& &is&set&to&zero&with&probability&
.
40

!Each!itera*on!consists!of!two!steps:
• (Clustering):"condi(oned"on"the"spins" ,"draw"the"
bonds" "independenly."For"an"edge"
:
• If" ,"set"
• If" ,"set" "with"probability"
"or" "otherwise.
41

• (Swapping):"condi(oned"on"the"bonds" ,"draw"the"
spins" .
• For"each"connected"component,"draw" "or" "
with"equal"chance,"and"assign"the"resultant"value"
to"all"nodes"in"the"component.
42

Swendsen'Wang+Algorithm+
(Illustra7on)In the case of a rectangular grid, this Gibbs sampling algorithm mixes very rapidly.
The following figures illustrate Gibbs sampling. Spin states up and down are
shown by filled and empty circles. Bond states 1 and 0 are shown by thick lines and
thin dotted lines. We start from a state with five connected components. (Remember
that isolated spins count as connected components, albeit of size 1.)
First, let’s update the bonds The forbidden bonds are highlighted
Bonds are forbidden from forming wherever the two adjacent spins are in opposite
states. The bonds that are not forbidden are set to the 1 state with probability p.
After updating the bonds Now we update spins Update bonds again
1.2 Other properties of the extended model
We already mentioned that the partition function Z is the same as that of the Ising
model.
In the case of a rectangular grid, this Gibbs sampling algorithm mixes very rapidly.
The following figures illustrate Gibbs sampling. Spin states up and down are
shown by filled and empty circles. Bond states 1 and 0 are shown by thick lines and
thin dotted lines. We start from a state with five connected components. (Remember
that isolated spins count as connected components, albeit of size 1.)
First, let’s update the bonds The forbidden bonds are highlighted
Bonds are forbidden from forming wherever the two adjacent spins are in opposite
states. The bonds that are not forbidden are set to the 1 state with probability p.
After updating the bonds Now we update spins Update bonds again
1.2 Other properties of the extended model
We already mentioned that the partition function Z is the same as that of the Ising
model.
43

(Discussion)
• When& &is&large,& &has&a&high&probability&of&
being&set&to&one,&i.e.& &and& &are&likely&to&be&
connected.
• Experiments&show&that&the&Swendsen)Wang&
algorithm&mixes&very&rapidly,&especially&for&
rectangular&grids.
• Can&you&provide&an&intui?ve&explana?on?
44

(Discussion)
• The%Swendsen'Wang%algorithm%can%be%generalized%
to%Po4s%models%(nodes%can%take%values%from%a%ﬁnite%
set).
• The%Swendsen'Wang%algorithm%has%been%widely%
used%in%image%analysis%applicaAons,%e.g.%image%
segmentaAon%(in%this%case,%it%is%called%Swendsen'
Wang,cut).
45

Hamiltonian)Monte)Carlo
• An$MCMC$method$based$on$Hamiltonian)Dynamics.$
It$was$originally$devised$for$molecular)simula1on
• In$1987,$a$seminal$paper$by$Duane$et)al$uniﬁes$
MCMC$and$molecular$dynamics.$They$called$it$
Hybrid)Monte)Carlo,$which$abbreviates$to$HMC
• In$many$arEcles,$people$call$it$Hamiltonian)Monte)
Carlo,$as$this$name$is$considered$to$be$more$
speciﬁc$and$informaEve,$and$it$retains$the$same$
abbreviaEon$"HMC".
46

Mo#va#ng(Example:(Free(Fall
47

Mo#va#ng(Example:(Free(Fall
• The%change%of%momentum% %is%caused%by%the%
accumula5on/release%of%the%poten(al+energy:
• The%change%of%loca-on% %is%caused%by%velocity,%the%
deriva-ve%of%kinema-c.energy%w.r.t.%the%momentum:
48

Hamiltonian)Dynamics
• Hamiltonian)Dynamics"is"a"generalized"theory"of"the"
classical)mechanics,"which"provides"a"elegant"and"
ﬂexible"abstrac:on"of"a"dynamic"system"in"physics.
• In"Hamiltonian"Dynamics,"a"physical"system"is"
described"by" ,"where" "and" "are"
respec:vely"the"posi1on"and"momentum"of"the" @th"
en:ty.
49

Hamilton's+Equa/ons
The$dynamics$of$the$system$is$characterized$by$the$
Hamilton's+Equa/ons:
Here,% %is%called%the%Hamiltonian,%which%can%be%
interpreted%as%the%total)energy%of%the%system.
50

Hamilton's+Equa/ons+(cont'd)
• The%Hamiltonian% %is%o)en%formulated%as%the%sum%
of%the%poten+al,energy% %and%the%kine+c,energy% :
• With&this&se)ng,&the&Hamilton's+Equa/ons&become:
51

Conserva)on*of*Hamiltonian
The$Hamiltonian$is$conserved,$i.e.,$it$is$invariant$over$
,me:
Intui&vely,,this,reﬂects,the,law$of$energy$conserva/on.
52

Hamiltonian)Reversibility
• The%Hamiltonian)dynamics%is%reversible
• Let%the%ini+al%states%be% %and%the%states%at%
+me% %be% .%Then,%it%we%reverse%the%process,%
star+ng%at% ,%then%the%states%at%+me% %
would%be% .
• In%the%context%of%MCMC,%this%leads%to%the%
reversibility%of%the%underlying%chain.
53

Simula'on*of*Hamiltonian*Dynamics
A"natural"idea"to"simulate"Hamiltonian)dynamics"is"to"
use"Euler's)method"over"discre1zed"1me"steps:
Is#this#a#good#method?
54

Leapfrog)Method
Be#er%results%can%be%obtained%with%leapfrog:
More%importantly,%the%leapfrog%update%is%reversible.
55

Example
Consider)a)Hamiltonian)system:
Write&down&the&Hamilton's+Equa/ons:
Derive&the&solu-on:
57

Hamiltonian)Monte)Carlo
(Basic'idea):!Consider!the!poten&al)energy!as!the!Gibbs)
energy,!and!introduce!the!"momentums"!as!auxiliary)
variables!to!control!the!dynamics.
59

Hamiltonian)Monte)Carlo)(cont'd)
Suppose'the'target&distribu,on'is'
,'then'we'form'an'augmented&
distribu,on'as
Here,%the%loca%ons% %represent%the%variables%of%
interest,%and%the%momentums% %control%the%dynamics%
of%simula7on.
60

Hamiltonian)Monte)Carlo)(cont'd)
In#prac(ce,#the#kine%c#energy#is#o2en#formalized#as
61

Hamiltonian)Monte)Carlo)(Algorithm)
Each%itera*on%of%HMC%comprises%two%steps:
• Gibbs%update:#sample#the#momentums# #from#the#
Gaussian#prior#given#by
62

Hamiltonian)Monte)Carlo)(Algorithm)
• Metropolis*update:#using#Hamiltonian#dynamics#to#
propose#a#new#state.#Star8ng#from# ,#simulate#
the#dynamic#system#with#the#leapfrog#method#for# #
steps#with#step<size# ,#which#yields# .#The#
proposed#state#is#accepted#with#probability:
63

HMC$(Discussion)
• If$the$simula.on$is$exact,$we$will$have$
,$and$thus$the$proposed$state$
should$always$be$accepted.$
• In$prac.ce,$there$can$be$some$devia.on$due$to$
discre.za.on,$we$have$to$use$the$Metropolis$
rule$to$guarantee$the$correctness.
64

HMC$(Discussion)
• HMC%has%a%high%acceptance%
rate%while%allowing%large%
moves%along%less6constrained%
direc8ons%at%each%itera8on.
• This%is%a%key%advantage%as%
compared%to%random'walk%
proposals,%which,%in%order%to%
maintain%a%reasonably%high%
acceptance%rate,%has%to%keep%
a%very%small%step%size,%
resul8ng%in%substan8al%
correla8on%between%
consecu8ve%samples.
65

Tuning&HMC
• For%efficient%simula1on,%it%is%important%to%choose%
appropriate%values%for%both%the%leapfrog%step%size% %
and%the%number%of%leapfrog%steps%per%itera1on% .
• Tuning%HMC%(and%actually%many%generic%sampling%
methods)%oCen%requires%preliminary*runs%with%
different%trial%seGngs%and%different%ini1al%values,%as%
well%as%careful%analysis%of%the%energy%trajectories.
66

Tuning&HMC&(cont'd)
• For%most%cases,% %and% %can%be%tuned%independently.
• Too%small%a%stepsize%would%waste%computa8on%
8me,%while%large%stepsize%would%cause%unstable%
simula8on,%and%thus%low%acceptance%rate.
• One%should%choose% %such%that%the%energy%
trajectory%is%stable%and%the%acceptance%rate%is%
maintained%at%a%reasonably%high%level.
• One%should%choose% %such%that%back@and@forth%
movement%of%the%states%can%be%observed.
67

Generic'Sampling'Systems
A"number"of"so,ware"systems"are"available"for"
sampling"from"models"speciﬁed"by"the"user
• WinBUGS:*based*on*BUGS*(Bayesian*inference*
Using*Gibbs*Sampling).
• provide*a*friendly*language*for*user*to*specify*
the*model
• Running*only*on*Windows
• Note:*The*development*has*stopped*since*2007.
68

Generic'Sampling'Systems'(cont'd)
• JAGS:'"Just'Another'Gibbs'Sampler"
• Cross8pla9orm'support
• Use'a'dialect'of'BUGS
• Extensible:'allow'users'to'write'customized'
funcCons,'distribuCons,'and'samplers
69

Generic'Sampling'Systems'(cont'd)
• Stan:'"Sampling'Through'Adap5ve'Neighborhoods"
• Core'wri=en'in'C++,'and'ports'available'in'
Python,'R,'Matlab,'and'Julia
• A'user'friendly'language'for'model'speciﬁca5on
• Use'Hamiltonian'Monte'Carlo'(HMC)'and'No'UL
Turn'Samplers'(NUTS)'as'core'algorithm
• Open'source'(GPLv3'licensed)'and'under'ac5ve'
development'on'Github
70

Stan%Example
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
for (n in 1:N)
y[n] ~ normal(alpha + beta * x[n], sigma);
}
71

Generic'Sampling'System'vs.'
Dedicated'Algorithms
Generic' Dedicated'
Easy%to%use% Require%knowledge%and%experience%
High%produc9vity% Time=consuming%to%develop%
Slow% O@en%remarkably%more%eﬃcient%
Limited%ﬂexibility% Necessary%for%many%new%models%
72

MLPI Lecture 3: Advanced Sampling Techniques

More Related Content

Viewers also liked

Similar to MLPI Lecture 3: Advanced Sampling Techniques

Recently uploaded

MLPI Lecture 3: Advanced Sampling Techniques