On the convergence properties of the Wang-Landau algorithm

Wang–Landau algorithm
Flat Histogram in ﬁnite time
Parallel Wang–Landau: asymptotic behaviour

On the convergence properties of the
Wang-Landau Algorithm

Robin J. Ryder
CEREMADE, Universit´ Paris Dauphine
e
and CREST

ENSAE – 31 January 2012

joint work with Pierre Jacob (National University of Singapore)
and Pierre Del Moral (INRIA & Universit´ de Bordeaux)
e

R. Ryder (Dauphine) & CREST Wang-Landau 1/ 40


Outline

1 Wang–Landau algorithm

2 Flat Histogram in ﬁnite time

3 Parallel Wang–Landau: asymptotic behaviour



Motivation

0.5

0.4

0.3
density

0.2

0.1

0.0
−5 0 5 10 15
X

Figure: A mixture of well-separated normal distributions.



Motivation

0.5

0.4

0.3
density

0.2

0.1

0.0
−5 0 5 10 15
X




Setting

Partition the state space
d
X = Xi
i=1

Desired frequencies

φ = (φ1 , . . . , φd )

Penalized distribution
π(x)
πθ (x) ∝
θ(J(x))
where J(x) such that x ∈ XJ(x) .



First algorithm

Algorithm 1 Wang-Landau with deterministic schedule (γt )
1: Init θ0 > 0, X0 ∈ X .
2: for t = 1 to T do
3: Sample Xt from Kθt−1 (Xt−1 , ·), MH kernel targeting πθt−1 .
4: Update the penalties:

log θt (i) ← log θt−1 (i) + γt (1 Xi (Xt ) − φi )
I

5: end for



Video



Flat Histogram

Issues with the ﬁrst version
Choice of γ has a huge impact on the results.

Flat Histogram
Deﬁne the counters:
t
νt (i) := 1 Xi (Xn )
I
n=1

Flat Histogram (FH) is reached when:

νt (i)
max − φi < c
i∈{1,...,d} t



Flat Histogram

Idea
Instead of decreasing γt at each time step t, decrease only when
the Flat Histogram criterion is reached.

In practice
Denote by κt the number of FH criteria reached up to time t.
Use γκt instead of γt at time t.
If FH is reached at time t, reset νt (i) to 0 for all i.



Wang–Landau with Flat Histogram

Algorithm 2 Wang-Landau with stochastic schedule (γκt )
1: Init θ0 > 0, X0 ∈ X .
2: Init κ0 ← 0.
3: for t = 1 to T do
4: Sample Xt from Kθt−1 (Xt−1 , ·), MH kernel targeting πθt−1 .
5: If (FH) then κt ← κt−1 + 1, otherwise κt ← κt−1 .
6: Update the penalties:

log θt (i) ← log θt−1 (i) + γκt (1 Xi (Xt ) − φi )
I

7: end for



FH is met in ﬁnite time

To be sure that eventually, for any c > 0:

νt (i)
max − φi < c
i∈{1,...,d} t
we want to prove:

νt (i) P
∀i ∈ {1, . . . , d} − − φi
−→
t t→∞
for any ﬁxed γ > 0.



Various updates

Right update

log θt (i) ← log θt−1 (i) + γ (1 Xi (Xt ) − φi )
I (1)
Using this update, FH is met in ﬁnite time.

Wrong update

θt (i) ← θt−1 (i) [1 + γ (1 Xi (Xt ) − φi )]
I
⇔
log θt (i) ← log θt−1 (i) + log [1 + γ (1 Xi (Xt ) − φi )]
I (2)

Here FH is not necessarily met in ﬁnite time.
1
Special case: ∀i φi = d .



Assumptions

In this talk, there are only two bins: d = 2. Additionally:
Assumption

The bins are not empty with respect to µ and π:

∀i ∈ {1, 2} µ(Xi ) > 0 and π(Xi ) > 0

Assumption

The state space X is compact.



Assumptions

Assumption

The proposal distribution Q(x, y ) is such that:

∃qmin > 0 ∀x ∈ X ∀y ∈ X Q(x, y ) > qmin

Assumption

The MH acceptance ratio is bounded from both sides:

π(y ) Q(y , x)
∃m > 0 ∃M > 0 ∀x ∈ X ∀y ∈ X m< <M
π(x) Q(x, y )



Theorem

Theorem
Consider the sequence of penalties θt introduced in the WL
algorithm. We deﬁne:

θt (1)
Zt = log = log θt (1) − log θt (2)
θt (2)

Then:
Zt L1
−− 0
−→
t t→∞
and consequently, with update (1) (FH) is reached in ﬁnite time
for any precision threshold c, whereas this is not guaranteed for
update (2).



Consequence

Using update (1), and starting from Z0 = 0:

X
Zt = log θt (1) 1 XX
IX
X
− log θt (2) 1 XX
IX



Consequence


t X
Zt = log θ0 (1) + γ I IX
n=1 (1 X1 (Xn ) − φ1 ) 1 XX
t X
− log θ0 (2) + γ IX
n=1 (1 X2 (Xn ) − φ2 ) 1 XX
I



Consequence


X
Zt = γ(νt (1) − tφ1 ) 1 XX
IX
X
−γ(νt (2) − tφ2 ) 1 XX
IX



Consequence


X
Zt = γ(νt (1) − tφ1 ) 1 XX
IX
X
−γ(t − νt (1) − t(1 − φ1 )) 1 XX
IX



Consequence


X
Zt = γ(νt (1) − tφ1 ) 1 XX
IX
X
+γ(νt (1) − tφ1 )) 1 XX
IX



Consequence


X
Zt = 2γ(νt (1) − tφ1 ) 1 XX
IX
X
1 XX
IX



Consequence


νt (1) X
Zt
t = 2γ t − φ1 1 XX
IX
X
1 XX
IX



Consequence


νt (1) L X
Zt
t = 2γ t − φ1 − − 0 1 XX
−1→ IX
t→∞
νt (1) L X
⇒ t −1→
− − φ1 1 XX
IX
t→∞



Consequence


νt (1) L X
Zt
t = 2γ t − φ1 − − 0 1 XX
−1→ IX
t→∞
νt (1) L X
⇒ t −1→
− − φ1 1 XX
IX
t→∞

Using update (2), these simpliﬁcations do not occur: νt (i)/t
converges, but not to φi in general.



Proof

To prove that
Zt L1
− − 0,
−→
t t→∞
ﬁrst notice that

Zt+1 − Zt = 2γ (νt+1 (1) − νt (1) − φ1 )
can only take two diﬀerent values, which we note +a and −b.



Proof

Introduce Ut , the increment of Zt :

Zt+1 = Zt + Ut

Lemma
With the introduced processes Zt and Ut , there exists > 0 such
¯ ¯
that for all η > 0, there exists Z hi such that, if Zt ≥ Z hi , we have
the following two inequalities:

P[Ut+1 = −b|Ut = +a, Zt ] >
P[Ut+1 = −b|Ut = −b, Zt ] > 1 − η.

(and the symmetric corollary)



Proof

q

q

q
q

q
q

q

q q
q
q
q q
q
q q q

q
q
q q

q
q q q

q q
q
q

Zs q
q
q
q q q

q q
~ ~
q
q

q q
Zs+T
q

q
q
q
q q q

q q
q
q
Zs+T
q

q
q
q
q

q
q
q
q
q
q
q

q
q
q
q
q q
q

5 10 15 20 25 30 35
time

Figure: We prove that Zt returns below a given horizontal bar whenever
it goes above it, and it does so in ﬁnite time. This implies Zt /t → 0.



Proof

Introduce a crossing time s: Zs−1 ≤ Z hi and Zs ≥ Z hi .
˜
Introduce a new process Zt :
˜ ˜ ˜
Zs = Zs and Zt+1 = Zt + Vt
˜
We can choose Vt such that Zt ≥ Zt and
Lemma
(Vt ) is a Markov chain over the space {+a, −b} with transition
matrix
1−
η 1−η
where the ﬁrst state corresponds to +a and the second state to
−b.



Proof

q

q

q
q

q
q

q

q q
q
q
q q
q
q q q

q
q
q q

q
q q q

q q
q
q

Zs q
q
q
q q q

q q
~ ~
q
q

q q
Zs+T
q

q
q
q
q q q

q q
q
q
Zs+T
q

q
q
q
q

q
q
q
q
q
q
q

q
q
q
q
q q
q

5 10 15 20 25 30 35
time

Figure: We prove that Zt returns below a given horizontal bar whenever
it goes above it, and it does so in ﬁnite time. This implies Zt /t → 0.



Case d = 2

This shows that Zt /t → 0, hence (FH) is met in ﬁnite time.
For the case d > 2, we shall use only update 1.



Case d > 2

To prove that (FH) is met in ﬁnite time in the case with d > 2, we
need an extra assumption:
Assumption
The desired frequencies are all rational numbers:

∀i, φi ∈ Q



Irreducible Markov chain

Under the assumption of rationality, we have the following lemma:
Lemma
Let Θ be the following subset of Rd :
d
d d
Θ = {z ∈ R : ∃(n1 , . . . , nd ) ∈ N zi = ni − φi nj }
j=1

Then denoting by λ the product of the Lebesgue measure µ on X
and of the counting measure on Θ, (Xt , log θt ) is λ-irreducible.

Proof: B´zout’s lemma.
e



A limit exists

Since (Xt , log θt ) is λ-irreducible, the proportion of visits to any
λ-measurable set of X × Θ converges to a limit in [0, 1]. Hence
the vector (ν(i)/t) converges to some limit pi . We now need to
prove that this limit corresponds to (FH).



Reductio ad absurdum

Suppose (reductio ad absurdum) that p = φ. Then there exist i
and j such that
pi − φi < pj − φj
and hence

Ztj,i = −νt (i) + νt (j) + t(φi − φj ) ∼ t(−pi + φi + pj − φj ) → ∞
˜
As before, we can construct Ut and Ut and show that Ut decreases
on average, which contradicts the assumption that Ztj,i → ∞.



Conclusion for d > 2

Conclusion: using update 1, (FH) is met in ﬁnite time for any
number of bins d.



Illustration

1.0 1.0

Proportions of visits to each bin

Proportions of visits to each bin
0.5
0.8 0.8
0.4
0.6 0.6
density

0.3

0.4 0.4
0.2

0.1 0.2 0.2

0.0 0.0 0.0
−4 −2 0 2 4 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
X iterations iterations

(a) Histogram of the gen- (b) Convergence of the (c) Convergence of the
erated sample proportions of visits to proportions of visits to
each bin, using the right each bin, using the wrong
update update



Parallel chains

(1) (N)
N chains (Xt , . . . , Xt ) instead of one.
targeting the same biased distribution πθt at iteration t,
sharing the same estimated bias θt at iteration t.

Old update of the penalties
X
log θt (i) ← log θt−1 (i) + γ (1 Xi (Xt ) − φi ) 1 XX
I IX

(L. Bornn, P. Jacob, P. Del Moral, A. Doucet)



Parallel chains

(1) (N)
N chains (Xt , . . . , Xt ) instead of one.
targeting the same biased distribution πθt at iteration t,
sharing the same estimated bias θt at iteration t.

New update of the penalties
N (k) X
log θt (i) ← log θt−1 (i) + γ 1
N k=1 1 Xi (Xt )
I − φi 1 XX
IX

(L. Bornn, P. Jacob, P. Del Moral, A. Doucet)



Inﬁnite number of chains

Suppose φi = 1/d. We can use either update and choose the
“wrong” update
N
1 (k) 1
θt (i) ← θt−1 (i) 1 + γ 1 Xi (Xt ) −
I
N d
k=1

(k) iid
Note that if (Xt )N ∼ πθt then
k=1

N −1
1 (k) a.s. ψi ψk
1 Xi (Xt )
I −−→
−− πθt (x)dx =
N N→∞ Xi θt−1 (i) θ(k)
k=1 k



This corresponds to update

1
θt (i) ← θt−1 (i) 1 + γ πθt (x)dx −
X d
  i 
−1
ψi ψk 1 
⇔ θt (i) ← θt−1 (i) 1 + γ  −
θt−1 (i) θ(k) d
k

Normalize:
θt (i)
θt (i) ←
j θt (j)

Then the penalties (θt )t≥0 should converge towards ψ.



ψi ψk −1 XX
θt−1 (i) 1+γ θt−1 (i) ( k θ(k) ) 1
−d I
1 XX
θt (i) ← XX
ψj ψk −1 1
d
j=1 θt−1 (j) 1+γ θt−1 (j) ( k θ(k) −d) I
1 XX



ψi XX
θt−1 (i) 1+γ −1 1
θt−1 (i) Mψ (θt−1 )− d I
1 XX
θt (i) ← ψj XX
d
j=1 θt−1 (j) 1+γ −1 1
θt−1 (j) Mψ (θt−1 )− d I
1 XX



XX
−1
I
θt−1 (i)(1− γ ) + ψi γMψ (θt−1 ) 1 XX
d
θt (i) ← XX
d γ
j=1 θt−1 (j)(1− d )
−1
I
+ ψj γMψ (θt−1 ) 1 XX



XX
−1
I
θt−1 (i)(1− γ ) + ψi γMψ (θt−1 ) 1 XX
d
θt (i) ← XX
−1
I
(1− γ )+γMψ (θt−1 ) 1 XX
d



1− γ γM −1 (θt−1 )
ψ
θt (i) ← θt−1 (i) 1− γ +γM −1 (θ
d
+ ψi 1− γ +γM −1 (θ
d ψ t−1 ) d ψ t−1 )



θt (i) ← θt−1 (i) × (1 − α(θt−1 )) + ψi × α(θt−1 )



Hence the update can be written as a convex combination

θt (i) ← θt−1 (i) × (1 − α(θt−1 )) + ψi × α(θt−1 )

where
−1
γMψ (θt−1 )
α(θt−1 ) = γ −1
1− d + γMψ (θt−1 )
It can easily be shown that

∀t ≥ 0 0 < αmin ≤ α(θt ) ≤ αmax < 1

provided that ∀i, θ0 (i) > 0, ψ(i) > 0.
iid
Of course the assumptions N → ∞ and ∼ sampling are strong.
What happens in practice?



Animation

Toy example

X = {1, 2, 3, 4, 5}, π = (π1 , . . . , π5 )
We take the partition: Xi = {i} and launch the Wang–Landau
algorithm for N = 1, 10, 100 and γ = 1.
In this case we know the value of ψi : ψi = πi , so we can compute
the deterministic sequence of penalties corresponding to N = ∞.

Animation



Animation

N=∞


Animation

N = 100


Animation

N = 10


Animation

N=1


Animation

“



New take on the Parallel Wang–Landau

Call fψ the function such that:

θt ← fψ (θt−1 )

As we saw:
fψ (θ) = θ(1 − α(θ)) + ψα(θ))
where
−1
γMψ (θ)
α(θ) = γ −1
1− d + γMψ (θ)



New take on the Parallel Wang–Landau (in progress)

In the ideal case we have
n
θn = fψ ◦ . . . ◦ fψ (θ0 ) = fψ (θ0 )
×n

If we add perturbations to ψ we also add perturbations to θ, as
follows:
ε
θ1 = fψ (θ0 ) = fψ (θ0 ) + V0




n
θn = fψ ◦ . . . ◦ fψ (θ0 ) = fψ (θ0 )
×n

follows:
ε
θ1 = fψ (θ0 ) = fψ (θ0 ) + V0
ε ε 2
θ2 = fψ (θ1 ) = fψ (fψ (θ0 ) + V0 ) = fψ (θ0 ) + V1
...




n
θn = fψ ◦ . . . ◦ fψ (θ0 ) = fψ (θ0 )
×n

follows:
ε
θ1 = fψ (θ0 ) = fψ (θ0 ) + V0
ε ε 2
θ2 = fψ (θ1 ) = fψ (fψ (θ0 ) + V0 ) = fψ (θ0 ) + V1
...

The properties of fψ should help control the noise terms
(Vn )n≥0 . . .



Bibliography

Atchad´, Y. and Liu, J. (2010). The Wang-Landau algorithm
e
in general state spaces: applications and convergence analysis.
Statistica Sinica, 20:209–233.
Wang, F. and Landau, D. (2001). Determining the density of
states for classical statistical models: A random walk
algorithm to produce a ﬂat histogram. Physical Review E,
64(5):56101.
An Adaptive Interacting Wang-Landau Algorithm for
Automatic Density Exploration, L. Bornn, P. Jacob, P. Del
Moral, A. Doucet, on arXiv.
The Wang-Landau algorithm reaches the Flat Histogram
criterion in ﬁnite time, P. Jacob & RR, on arXiv.


On the convergence properties of the Wang-Landau algorithm

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to On the convergence properties of the Wang-Landau algorithm

Similar to On the convergence properties of the Wang-Landau algorithm (7)

More from Robin Ryder

More from Robin Ryder (11)

On the convergence properties of the Wang-Landau algorithm