SlideShare a Scribd company logo
1 of 39
Download to read offline
Optimally adjusted mixture sampling and locally weighted histogram analysis
Zhiqiang Tan
Department of Statistics
Rutgers University
www.stat.rutgers.edu/home/ztan/
December 13, 2017
SAMSI
1
Outline
• Problem of interest
• Proposed methods
– Self-adjusted mixture sampling (SAMS)
– Locally weighted histogram analysis method (L-WHAM)
• Numerical studies
– Potts model
– Biological host-guest system
• Conclusion
2
Problem of interest
• Target probability distributions:
dPj =
qj(x)
Zj
dµ, j = 1, . . . , m,
where Zj =
∫
qj(x) µ(x), called the normalizing constant.
• Examples:
– Boltzmann distributions, including Ising and Potts models:
qj(x) = exp{−uj(x)}.
– Missing-data problems and latent-variable models:
qj(ymis) = pj(yobs, ymis),
Zj = pj(yobs).
– Bayesian analysis:
qj(θ) = pj(θ) × Lj(θ; yobs),
Zj = marginal likelihood.
3
Problem of interest (cont’d)
• Objectives
(i) To simulate observations from Pj for j = 1, . . . , m.
(ii) To estimate expectations
Ej(g) =
∫
g(x) dPj, j = 1, . . . , m,
for some function g(x).
(iii) To estimate normalizing constants (Z1, . . . , Zm) up to a multiplicative constant or, equivalently,
estimating the log ratios of normalizing constants:
ζ∗
j = log(Zj/Z1), j = 1, . . . , m.
4
Proposed methods
• Self-adjusted mixture sampling (SAMS) for
– sampling
– on-line estimation of expectations
– on-line estimation of normalizing constants
• Locally (v.s. globally) weighted histogram analysis method (WHAM) for
– off-line estimation of expectations
– off-line estimation of normalizing constants
• on-line: while the sampling process goes on (using each observation one by one)
off-line: after the sampling process is completed (using all observations simultaneously)
5
Labeled mixture
• Consider a labeled mixture (L = label):
(L, X) ∼ p(j, x; ζ) ∝
πj
eζj
qj(x),
where (π1, . . . , πm) : target/fixed mixture proportions (
∑m
j=1 πj = 1), typically π1 = · · · = πm = 1/m,
(ζ1, . . . , ζm) : hypothesized values of (ζ∗
1 , . . . , ζ∗
m).
• The marginal distributions are
p(x; ζ) ∝
∑m
j=1
πj
ζj
qj(x),
p(L = j; ζ) ∝
πj
eζj −ζ∗
j
.
• The conditional distributions are
p(x|L = j) ∝ qj(x),
p(L = j|x; ζ) =
πj
eζj
qj(x)
∑m
k=1
πk
eζk
qk(x)
∝
πj
eζj
qj(x).
6
Unadjusted mixture sampling via Metropolis-within-Gibbs
• Global-jump mixture sampling: For t = 1, 2, . . .,
(i) Global jump: Generate Lt ∼ p(L = ·|Xt−1; ζ);
(ii) Markov move: Generate Xt given (Xt−1, Lt = j), leaving Pj invariant.
• Local-jump mixture sampling: For t = 1, 2, . . .,
(i) Local jump: Generate j ∼ Γ(Lt−1, ·), and then set Lt = j with probability
min
{
1,
Γ(j, Lt−1)
Γ(Lt−1, j)
p(j|Xt−1; ζ)
p(Lt−1|Xt−1; ζ)
}
,
and, with the remaining probability, set Lt = Lt−1.
(ii) Markov move: Same as above.
• Γ(k, j) is a proposal probability of k → j, e.g., nearest-neighbor jump:
Γ(1, 2) = Γ(m, m − 1) = 1, and Γ(k, k − 1) = Γ(k, k + 1) = 1/2 for j = 2, . . . , m − 1.
7
A critical issue
• ζ must be specified reasonably close to ζ∗
in order for the algorithm to work!
• How shall we set ζ = (ζ1, . . . , ζm)?
– Trial and error, using the fact that
p(L = j; ζ) ∝
πj
eζj −ζ∗
j
, j = 1, . . . , m.
– Stochastic approximation: updating ζ during the sampling process
8
Self-adjusted mixture sampling via stochastic approximation
• Self-adjusted mixture sampling:
(i) Mixture sampling: Generate (Lt, Xt) given (Lt−1, Xt−1), with ζ set to ζ(t−1)
;
(ii) Free energy update: For j = 1, . . . , m,
ζ
(t− 1
2 )
j = ζ
(t−1)
j + aj,t(1{Lt = j} − πj),
ζ
(t)
j = ζ
(t− 1
2 )
j − ζ
(t− 1
2 )
1 .
[
keep ζ
(t)
1 = 0
]
→ Self-adjustment mechanism
• Questions:
– Optimal choice of cj,t?
– Other update rules?
9
Related literature
• (Unadjusted) Serial tempering (Geyer & Thompson 1995), originally for qj(x) = exp{−u(x)/Tj}.
• (Unadjusted) Mixed/generalized/expanded ensemble simulation (e.g., Chodera & Shirts 2011).
• (Unadjusted) Reversible jump algorithm (Green 1995) in trans-dimensional settings.
• Wang–Landau (2001) algorithm, based on partition of the state space, and its extensions:
stochastic approximation Monte Carlo (Liang et al. 2007),
generalized Wang–Landau algorithm (Atchade & Liu 2010).
• Our formulation makes explicit the mixture structure, leading to new development.
10
Intro to stochastic approximation (Robins & Monro 1951, etc.)
• Need to find θ∗
that solves
h(θ)
def
= Eθ{H(Y ; θ)} = 0,
where Y ∼ f(·; θ).
• A stochastic approximation algorithm:
(i) Generate Yt ∼ f(·; θt−1) or, more generally,
Yt ∼ Kθt−1 (Yt−1, ·),
by a Markov kernel which admits f(·; θt−1) as the stationary distribution.
(ii) Update
θt = θt−1 + AtH(Yt; θt−1),
where At is called a gain matrix.
11
Theory of stochastic approximation (Chen 2002; Kushner & Yin 2003; Liang 2010, etc.)
• Let At = A
t for a constant matrix A.
• Under suitable conditions,
√
t(θt − θ∗
) →D N{0, Σ(A)}.
• The minimum of Σ(A) is C−1
V CT−1
, achieved by A = C−1
, where
C = −
∂h
∂θT
(θ∗
),
1
√
t
t∑
i=1
H(Y ∗
t ; θ∗
t ) →D N(0, V ),
where (Y ∗
1 , Y ∗
2 , . . .) is a Markov chain with Y ∗
t ∼ Kθ∗ (Y ∗
t−1, ·).
• Key idea of proof: martingale approximation.
12
Theory of stochastic approximation (cont’d)
• In general, C is unknown.
• Possible remedies
– Estimating θ∗
and C simultaneously (Gu & Kong 1998; Gu & Zhu 2001).
– Off-line averaging (Polyak & Juditsky 1992; Ruppert 1988), using the trajectory average
¯θt =
1
t
t∑
i=1
θi,
by setting At = A
tα with α ∈ (1/2, 1). Then under suitable conditions,
√
t(¯θt − θ∗
) →D N(0, C−1
V CT−1
).
13
Back to self-adjusted mixture sampling
• Recall stochastic approximation: solve
h(θ)
def
= Eθ{H(Y ; θ)} = 0, with Y ∼ f(·; θ).
• Recast self-adjusted mixture sampling:
f(y; θ) = p(j, x; ζ),
h(θ) = {p(L = 2; ζ) − π2, . . . , p(L = m; ζ) − πm)T
,
H(Y ; θ) = (1{L = 2} − π2, . . . , 1{L = m} − πm)T
,
Kθ(yt−1, yt) = p(lt|lt−1, xt−1; ζ) × p(xt|lt, xt−1).
(jump probability) × (transition probability)
14
Optimally adjusted mixture sampling
• We find that C = −∂h(θ∗
)/∂θT
is known, even though θ∗
is unknown!
• The optimal stochastic-approximation update rule:
ζ(t− 1
2 )
= ζ(t−1)
+
1
t







δ1,t/π1 − 1
δ2,t/π2 − 1
...
δm,t/πm − 1







,
ζ(t)
= ζ(t− 1
2 )
− ζ
(t− 1
2 )
1 ,
where δj,t = 1{Lt = j}, j = 1, . . . , m.
15
Second update rule
• Take
H(Y ; θ) =





w2(X; ζ) − π2
...
wm(X; ζ) − πm





,
where
wj(X; ζ) =
πj
eζj
qj(X)
∑m
k=1
πk
eζk
qk(X)
, j = 1, . . . , m.
• The optimal update rule is
ζ(t− 1
2 )
= ζ(t−1)
+
1
t







w1(Xt; ζ(t−1)
)/π1 − 1
w2(Xt; ζ(t−1)
)/π2 − 1
...
wm(Xt; ζ(t−1)
)/πm − 1







.
16
Rao-Blackwellization
[
Replace binary δj,t = 1{Lt = j} by transition probability
]
• The second update rule can be seen as a Rao-Blackwellization of the first:
E(1{Lt = j}|Lt−1, Xt−1; ζ) = p(j|Xt−1; ζ) = wj(Xt; ζ)
under unadjusted global-jump mixture sampling.
• Similarly, we find that under unadjusted local-jump mixture sampling,
E(1{Lt = j}|Lt−1 = k, Xt−1; ζ)
=



Γ(k, j)min
{
1, Γ(j,k)p(j|Xt−1;ζ)
Γ(k,j)p(k|Xt−1;ζ)
}
, if j ∈ N(k),
1 −
∑
l∈N (k) E(1{Lt = l}|Lt−1 = k, Xt−1; ζ), if j = k,
where N(k) denotes the neighborhood of k.
17
Third update rule
• For j = 1, . . . , m, let
uj(L, X; ζ)
=



Γ(L, j)min
{
1, Γ(j,L)p(j|X;ζ)
Γ(L,j)p(L|X;ζ)
}
, if j ∈ N(L),
1 −
∑
l∈N (L) ul(L, X; ζ), if j = L,
• Take H(T; θ) = {u2(L, X; ζ) − π2, . . . , um(L, X; ζ) − πm}T
.
• The optimal update rule is
ζ(t− 1
2 )
= ζ(t−1)
+
1
t







u1(Lt, Xt; ζ(t−1)
)/π1 − 1
u2(Lt, Xt; ζ(t−1)
)/π2 − 1
...
um(Lt, Xt; ζ(t−1)
)/πm − 1







.
18
Implementation details
• Initial values
ζ1,0 = ζ2,0 = · · · = ζm,0 = 0.
• Two stage: Replace by 1
t by



min{πmin, 1
tβ }, if t ≤ t0,
min{πmin, 1
t−t0+tβ
0
}, if t > t0,
where β ∈ (1/2, 1), e.g., β = 0.6 or 0.8,
t0 is the length of burn-in,
πmin = min(π1, . . . , πm).
• Monitor ˆπj = 1
t
∑t
i=1 1{Li = j} for j = 1, . . . , m.
19
Potts model
• Consider a 10-state Pott model on a 20 × 20 lattice with periodic boundary conditions:
q(x; T)
Z
=
e−u(x)/T
Z
,
where u(x) = −
∑
i∼j 1{si = sj}, with i ∼ j indicating that sites i and j are nearest neighbors,
and Z =
∑
x exp{−u(x)/T}.
• The q-state Potts model on an infinite lattice exhibits a phase transition at 1/Tc = log(1 +
√
q),
about 1.426 for q = 10.
• Take m = 5 and (1/T1, . . . , 1/T5) = (1.4, 1.4065, 1.413, 1.4195, 1.426), evenly spaced
between 1.4 and 1.426.
• The Markov transition kernel for Pj is defined as a random-scan sweep using the single-spin-flip
Metropolis algorithm at temperature Tj.
20
Potts model (cont’d)
• Comparison of sampling algorithms:
– Parallel tempering (Geyer 1991)
– Self-adjusted mixture sampling: two stages with β = 0.8 and optimal second stage



min{πmin, 1
tβ }, if t ≤ t0,
min{πmin, 1
t−t0+tβ
0
}, if t > t0,
– Self-adjusted mixture sampling: implementation comparable to Liang et al. (2007)



min{πmin, 1
tβ }, if t ≤ t0,
min{πmin, 10
t−t0+tβ
0
}, if t > t0,
– Self-adjusted mixture sampling: implementation using a flat-histogram criterion
(Atchade & Liu 2010)
• Estimation of U = E{u(x)} and C = var{u(x)}.
21
2 Z. TAN
Figure . Histogram of u(x)/K at the temperatures (T1, T2, . . . , T5) labeled as 1, 2, . . . , 5 under the Potts model. Two vertical lines are placed at −1.7 and −0.85.
where, without loss of generality, Z1 is chosen to be a reference
value. In the following, we provide a brief discussion of existing
methods.
For the sampling problem, it is possible to run separate simu-
lations for (P1, . . . , Pm) by Markov chain Monte Carlo (MCMC)
(e.g., Liu 2001). However, this direct approach tends to be inef-
fective in the presence of multi-modality and irregular contours
in (P1, . . . , Pm). See, for example, bimodal energy histograms
for the Potts model in Figure 1. To address these difficulties,
various methods have been proposed by creating interactions
between samples from different distributions. Such methods
can be divided into at least two categories. On the one hand,
overlap-based algorithms, including parallel tempering (Geyer
1991) and its extensions (Liang and Wong 2001), serial tem-
pering (Geyer and Thompson 1995), resample-move (Gilks and
Berzuini 2001) and its extensions (De Moral et al. 2006; Tan
2015), require that there exist considerable overlaps between
(P1, . . . , Pm), to exchange or transfer information across distri-
butions. On the other hand, the Wang–Landau (2001) algorithm
and its extensions (Liang et al. 2007; Atchadé and Liu 2010) are
typically based on partitioning of the state space X, and hence
there is no overlap between (P1, . . . , Pm).
For the estimation problem, the expectations
{E1(φ), . . . , Em(φ)} can be directly estimated by sample
averages from (P1, . . . , Pm). However, additional considera-
tions are generally required for estimating (ζ∗
2 , . . . , ζ∗
m, ζ∗
0 )
and E0(φ), depending on the type of sampling algorithms. For
Wang–Landau type algorithms based on partitioning of X,
(ζ∗
2 , . . . , ζ∗
m) are estimated during the sampling process, and
ζ∗
0 and E0(φ) can then be estimated by importance sampling
techniques (Liang 2009). For overlap-based settings, both
(ζ∗
2 , . . . , ζ∗
m, ζ∗
0 ) and {E1(φ), . . . , Em(φ), E0(φ)} can be esti-
mated after sampling by a methodology known in physics and
statistics as the (binless) weighted histogram analysis method
(WHAM) (Ferrenberg and Swendsen 1989; Tan et al. 2012), the
multi-state Bennett acceptance ratio method (Bennett 1976;
Shirts and Chodera 2008), reverse logistic regression (Geyer
1994), bridge sampling (Meng and Wong 1996) and the global
likelihood method (Kong et al. 2003; Tan 2004). See Gelman
and Meng (1998), Tan (2013a), and Cameron and Pettitt (2014)
for reviews on this method and others such as thermodynamic
integration or equivalently path sampling.
The purpose of this article is twofold, dealing with sampling
and estimation respectively. First, we present a self-adjusted
mixture sampling method, which not only accommodates
adaptive serial tempering and the generalized Wang–Landau
algorithm in Liang et al. (2007), but also facilitates further
methodological development. The sampling method employs
stochastic approximation to estimate the log normalizing con-
stants (or free energies) online, while generating observations
by Markov transitions. We propose two stochastic approxima-
tion schemes by Rao–Blackwellization of the scheme used in
Liang et al. (2007) and Atchadé and Liu (2010). For all the three
schemes, we derive the optimal choice of a gain matrix, resulting
in the minimum asymptotic variance for free energy estimation,
in a simple and feasible form. In practice, we suggest a two-stage
implementation that uses a slow-decaying gain factor during
burn-in before switching to the optimal gain factor.
Second, we make novel connections between self-adjusted
mixture sampling and the global method of estimation (e.g.,
Kong et al. 2003). Based on this understanding, we develop a
new offline method, locally weighted histogram analysis, for
estimating free energies and expectations using all the simulated
data by either self-adjusted mixture sampling or other sampling
algorithms, subject to suitable overlaps between (P1, . . . , Pm).
The local method is expected to be computationally much
faster, with little sacrifice of statistical efficiency, than the global
method, because individual samples are locally pooled from
neighboring distributions, which typically overlap more with
each other than with other distributions. The computational
savings from using the local method are important, especially
when a large number of distributions are involved (i.e., m is
large, in hundreds or more), for example, in physical and chem-
ical simulations (Chodera and Shirts 2011), likelihood inference
(Tan 2013a, 2013b), and Bayesian model selection and sensitiv-
ity analysis (Doss 2010).
We describe a sampling method, labeled mixture sampling,
which is the nonadaptive version of self-adjusted mixture sam-
pling in Section 3.
2. Labeled Mixture Sampling
We describe a sampling method, labeled mixture sampling,
which is the nonadaptive version of self-adjusted mixture sam-
pling in Section 3. The ideas are recast from several existing
methods, including serial tempering (Geyer and Thompson
1995), the Wang–Landau (2001) algorithm and its extensions
(Liang et al. 2007; Atchadé and Liu 2010). However, we make
explicit the relationship between mixture weights and hypothe-
sized normalizing constants, which is crucial to the new devel-
opment of adaptive schemes in Section 3.2 and offline estimation
in Sections 4 and 5.
The basic idea of labeled mixture sampling is to combine
(P1, . . . , Pm) into a joint distribution on the space {1, . . . , m} ×
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS 9
r Self-adjusted local-jump mixture sampling, with the gain
factor t−1
in (9) replaced by 1/(mkt ), where kt increases by
1 only when a flat-histogram criterion is met: the observed
proportions of labels, Lt = j, are all within 20% of the tar-
get 1/m since the last time the criterion was met (e.g.,
Atchadé and Liu 2010).
See Tan (2015) for a simulation study in the same setup of Potts
distributions, where parallel tempering was found to perform
better than several resampling MCMC algorithms, including
resample-move and equi-energy sampling.
The initial value L0 is set to 1, corresponding to temperature
T1, and X0 is generated by randomly setting each spin. The same
X0 is used in parallel tempering for each of the five chains. For
parallel tempering, the total number of iterations is set to 4.4 ×
105
per chain, with the first 4 × 104
iterations as burn-in. For
self-adjusted mixture sampling with comparable cost, the total
number of iterations is set to 2.2 × 106
, with the first 2 × 105
treated as burn-in. The data are recorded (or subsampled) every
10 iterations, yielding 5 chains each of length 4.4 × 104
with the
first 4 × 103
iterations as burn-in for parallel tempering, and a
single chain of length 2.2 × 105
with the first 2 × 104
iterations
as burn-in for self-adjusted mixture sampling.
.. Simulation Results
Figure 1 shows the histograms of u(x)/K at the five tempera-
tures, based on a single run of optimally adjusted mixture sam-
pling with t0 set to 2 × 105
(the burn-in size before subsam-
pling). There are two modes in these energy histograms. As the
temperature decreases from T1 to T5, the mode located at about
−1.7 grows in its weight, from being a negligible one, to a minor
one, and eventually to a major one, so that the spin system moves
from the disordered phase to the ordered one.
Figure 2 shows the trace plots of free energy estimates ζ(t)
and observed proportions ˆπ for three algorithms of self-adjusted
mixture sampling. There are striking differences between these
algorithms. For optimally adjusted mixture sampling, the free
energy estimates ζ(t)
fall quickly toward the truth (as indicated
by the final estimates) in the first stage, with large fluctuations
due to the gain factor of order t−0.8
. The estimates stay stable
in the second stage, due to the optimal gain factor of order t−1
.
The observed proportions ˆπj also fall quickly toward the target
πj = 20%. But there are considerable deviations of ˆπj from πj
over time, which reflects the presence of strong autocorrelations
in the label sequence Lt .
For the second algorithm, the use of a gain factor about 10
times the optimal one forces the observed proportions ˆπj to stay
closer to the target ones, but leads to greater fluctuations in the
free energy estimates ζ(t)
than when the optimal SA scheme is
used. For the third algorithm using the flat-histogram adaptive
scheme, the observed proportions ˆπj are forced to be even closer
to πj, and the free energy estimates ζ(t)
are associated with even
greater fluctuations than when the optimal SA scheme. These
nonoptimal algorithms seem to control the observed propor-
tions ˆπj tightly about πj, but potentially increase variances for
free energy estimates and, as shown below, introduce biases for
estimates of expectations.
Figure 3 shows the Monte Carlo means and standard devia-
tions for the estimates of −U/K, the internal energy per spin,
C/K, the specific heat times T2
per spin, and free energies ζ∗
,
based on 100 repeated simulations. Similar results are obtained
by parallel tempering and optimally adjusted mixture sampling.
But the latter algorithm achieves noticeable variance reduction
for the estimates of −U/K and C/K at temperatures T4 and T5.
The algorithm using a nonoptimal SA scheme performs much
worse than the first two algorithms: not only the estimates of
−U/K andC/K are noticeably biased at temperature T5, but also
the online estimates of free energies have greater variances at
temperatures T2 to T5. The algorithm using the flat-histogram
scheme performs even poorly, with serious biases for the esti-
mates of −U/K and C/K and large variances for the online esti-
mates of free energies. These illustrate advantages of using opti-
mally adjusted mixture sampling.
Figure . Trace plots for self-adjusted mixture sampling witht0 = 2 × 105. The number of iterations is shown after subsampling. A vertical line is placed at the burn-in size.
Figure S4: Trace plots for optimally adjusted mixture sampling with a two-stage SA scheme
(t0 = 2 × 105) as shown in Figure 2.
Index
ζt
0 50 100 150 200
051015
sub
0 5000 10000 20000
051015
sub
0 50000 150000
051015
Index
utK
0 50 100 150 200
−2.0−1.5−1.0−0.5
sub
0 5000 10000 20000
−2.0−1.5−1.0−0.5
sub
0 50000 150000
−2.0−1.5−1.0−0.5
Index
Lt
0 50 100 150 200
12345
sub
0 5000 10000 20000
12345
sub
0 50000 150000
12345
πt
0 50 100 150 200
0.00.20.40.60.81.0
t(πt−π)
0 5000 10000 20000
−15−10−5051015
t(πt−π)
0 50000 150000
−15−10−5051015
21
Figure S5: Trace plots similar to Figure S4 but for self-adjusted mixture sampling with a
non-optimal, two-stage SA scheme (t0 = 2 × 105) as shown in Figure 2.
Index
ζt
0 50 100 150 200
051015
sub
0 5000 10000 20000
051015
sub
0 50000 150000
051015
Index
utK
0 50 100 150 200
−2.0−1.5−1.0−0.5
sub
0 5000 10000 20000
−2.0−1.5−1.0−0.5
sub
0 50000 150000
−2.0−1.5−1.0−0.5
Index
Lt
0 50 100 150 200
12345
sub
0 5000 10000 20000
12345
sub
0 50000 150000
12345
πt
0 50 100 150 200
0.00.20.40.60.81.0
t(πt−π)
0 5000 10000 20000
−15−10−5051015
t(πt−π)
0 50000 150000
−15−10−5051015
22
Figure S6: Trace plots similar to Figure S4 but for self-adjusted mixture sampling with
the flat-histogram scheme as shown in Figure 2.
Index
ζt
0 50 100 150 200
051015
sub
0 5000 10000 20000
051015
sub
0 50000 150000
051015
Index
utK
0 50 100 150 200
−2.0−1.5−1.0−0.5
sub
0 5000 10000 20000
−2.0−1.5−1.0−0.5
sub
0 50000 150000
−2.0−1.5−1.0−0.5
Index
Lt
0 50 100 150 200
12345
sub
0 5000 10000 20000
12345
sub
0 50000 150000
12345
πt
0 50 100 150 200
0.00.20.40.60.81.0
t(πt−π)
0 5000 10000 20000
−15−10−5051015
t(πt−π)
0 50000 150000
−15−10−5051015
23
10 Z. TAN
Figure . Summary of estimates at the temperatures (T1, . . . , T5) labeled as 1, . . . , 5, based on  repeated simulations. For each vertical bar, the center indicates Monte
Carlomeanminusthatobtainedfromparalleltempering(×:paralleltempering,◦:optimalSAscheme, :nonoptimalSAscheme,∇:flat-histogramscheme),andtheradius
indicates Monte Carlo standard deviation of the  estimates from repeated simulations. For −U/K andC/K, all the estimates are directly based on sample averages ˜Ej (φ).
For free energies, offline estimates ˜ζ(n)
j are shown for parallel tempering (×), whereas online estimates ζ(n)
j are shown for self-adjusted mixture sampling (◦, , or ∇).
In Appendix VI, we provide additional results on offline esti-
mates of free energies and expectations and other versions of
self-adjusted mixture sampling. For optimal or nonoptimal SA,
the single-stage algorithm performs similarly to the correspond-
ing two-stage algorithm, due to the small number (m = 5) of
distributions involved. The version with local jump and update
scheme (14) or with global jump and update scheme (12) yields
similar results to those of the basic version.
6.2 Censored Gaussian Random Field
Consider a Gaussian random field measured on a regular
6 grid in [0, 1]2
but right-censored at 0 in Stein (1992).
Let (u1, . . . , uK ) be the K = 36 locations of the grid,
ξ = (ξ1, . . . , ξK ) be the uncensored data, and y = (y1, . . . , yK )
be the observed data such that yj = max(ξj, 0). Assume
that ξ is multivariate Gaussian with E(ξj ) = β and
cov(ξj, ξj ) = c e− uj−uj for j, j = 1, . . . , K, where · is
the Euclidean norm. The density function of ξ is p(ξ; θ) =
(2πc)−K/2
det−1/2
( ) exp{−(ξ − β)T −1
(ξ − β)/(2c)},
where θ = (β, log c) and is the correlation matrix of ξ.
The likelihood of the observed data can be decomposed as
L(θ) = p(ξobs; θ) × Lmis(θ) with
Lmis(θ) =
0
−∞
· · ·
0
−∞
p(ξmis|ξobs; θ)
j:yj=0
dξj,
where ξobs or ξmis denotes the observed or censored subvector of
ξ. Then, Lmis(θ) is the normalizing constant for the unnormal-
ized density function p(ξmis|ξobs; θ) in ξmis. For the dataset in
Figure 1 of Stein (1992), it is of interest to compute {L(θ) : θ ∈
}, where is a 21 × 21 regular grid in [−2.5, 2.5] × [−2, 1].
There are 17 censored observations in Stein’s dataset and hence
Lmis(θ) is a 17-dimensional integral.
.. Simulation Details
We take qj(x) = p(ξmis|ξobs; θ) for j = 1, . . . , m(= 441),
where θj1+21×( j2−1) denotes the grid point (θ1
j1
, θ2
j2
) for
j1, j2 = 1, . . . , 21, and (θ1
1 , . . . , θ1
21) are evenly spaced in
[−2.5, 2.5] and (θ2
1 , . . . , θ2
21) are evenly spaced in [−2, 1].
The transition kernel j is defined as a systematic scan of
Gibbs sampling for the target distribution Pj (Liu 2001). In this
example, Gibbs sampling seems to work reasonably well for
each Pj. Previously, Gelman and Meng (1998) and Tan (2013a,
2013b) studied the problem of computing {L(θ) : θ ∈ }, up to
a multiplicative constant, using Gibbs sampling to simulate m
Markov chains independently for (P1, . . . , Pm).
We investigate self-adjusted local-jump mixture sampling,
with two-stage modification (15) of the local update scheme
(14), and locally weighted histogram analysis for estimating
θ∗
j = log{Lmis(θj )/Lmis(θ1
11, θ2
11)} for j = 1, . . . , m. The use of
self-adjusted mixture sampling is mainly to provide online esti-
mates of ζ∗
j , to be compared with offline estimates, rather than to
improve sampling as in the usual use of serial tempering (Geyer
and Thompson 1995). As discussed in Sections 2–5, global-
jump mixture sampling, the global update scheme (12), and
globally weighted histogram analysis are too costly to be imple-
mented for a large m.
The neighborhood N ( j) is defined as the set of 2, 3, or
4 indices l such that θl lies within and next to θj in one
of the four directions. That is, if j = j1 + 21 × ( j2 − 1),
then N ( j) = {l1 + 21 × (l2 − 1) : l1 = j1 ± 1(1 ≤ l1 ≤
21) and l2 = j2 ± 1(1 ≤ l2 ≤ 21)}. Additional simulations
using larger neighborhoods (e.g., l1 = j1 ± 2 and l2 = j2 ± 2)
lead to similar results to those reported in Section 6.2.2.
The initial value L0 is set to (θ1
11, θ2
11) corresponding to the
center of , and X0 is generated by independently drawing each
censored component ξj from the conditional distribution of ξj,
truncated to (−∞, 0], given the observed components of ξ, with
θ = (θ1
11, θ2
11). The total number of iterations is set to 441 × 550
with the first 441 × 50 iterations as burn-in, corresponding to
the cost in Gelman and Meng (1998) and Tan (2013a, 2013b),
which involve simulating a Markov chain of length 550 per dis-
tribution, with the first 50 iterations as burn-in.
.. Simulation Results
Figure 4 shows the output from a single run of self-adjusted mix-
ture sampling with β = 0.8 and t0 set to 441 × 50 (the burn-in
size). There are a number of interesting features. First, the esti-
mates ζ(t)
fall steadily toward the truth, with noticeable fluctu-
ations, during the first stage, and then stay stable and close to
the truth in the second stage, similarly as in Figure 2 for the
Potts model. Second, the locally weighted offline estimates yield
a more smooth contour than the online estimates. In fact, as
shown later in Table 1 from repeated simulations, the offline esti-
mates are orders of magnitude more accurate than the online
estimates. Third, some of the observed proportions ˆπj differ
Off-line estimation
• As t → ∞, we expect
(X1, X2, . . . , Xt)
approx
∼ p(x; ζ∗
) ∝
m∑
j=1
πj
eζ∗
j
qj(x).
• Law of large numbers: 1
t
∑t
i=1 g(Xi) →
∫
g(x)p(x; ζ∗
) dx.
• Taking g(x) = wj(x; ζ∗
) yields
1
t
t∑
i=1
πj
e
ζ∗
j
qj(Xi)
∑m
k=1
πk
eζ∗
k
qk(Xi)
→ πj, j = 1, 2, . . . , m.
• Define ˜ζ = (˜ζ1, ˜ζ2, . . . , ˜ζm)T
with ˜ζ1 = 0 as a solution to
1
t
t∑
i=1
πj
eζj
qj(Xi)
∑m
k=1
πk
eζk
qk(Xi)
→ πj, j = 2, . . . , m.
→ Self-consistent unstratified mixture-sampling estimator
22
Off-line stratified estimation
• As t → ∞, we expect
{Xi : Li = j for i = 1, . . . , t}
approx
∼ qj(x).
• Let ˆπj = 1
t
∑t
i=1 1{Li = j} for j = 1, . . . , m.
• Stratified approximation
(X1, X2, . . . , Xt)
approx
∼
m∑
j=1
ˆπj
eζ∗
j
qj(x).
• Define ˆζ = (ˆζ1, ˆζ2, . . . , ˆζm)T
with ˆζ1 = 0 as a solution to
1
t
t∑
i=1
e−ζj
qj(Xi)
∑m
k=1
πk
eζk
qk(Xi)
→ 1, j = 2, . . . , m.
→ Self-consistent stratified mixture-sampling estimator
23
Connections
• The (off-line) stratified estimator ˆζ is known as
– (binless) weighted histogram analysis method (Ferrenberg & Swendsen 1989; Tan et al. 2012)
– multi-state Bennet acceptance ratio method (Bennet 1976; Shirts & Chodera 2008)
– reverse logistic regression (Geyer 1994)
– multiple bridge sampling (Meng & Wong 1996)
– (global) likelihood method (Kong et al. 2003; Tan 2004)
• A new connection is that the (off-line) unstratified estimator ˜ζ corresponds to a Rao-Blackwellization of
the (on-line) stochastic-approximation estimator ζ(t)
.
24
Locally weighted histogram analysis method
• Global unstratified estimator ˜ζ is defined by solving:
1
t
t∑
i=1
wj(Xi; ζ) =
1
t
t∑
i=1
πj
e
ζ∗
j
qj(Xi)
∑m
k=1
πk
eζ∗
k
qk(Xi)
= πj, j = 2, . . . , m.
• Need to evaluate {qj(Xi) : j = 1, . . . , m, i = 1, . . . , t}. → tm = (t/m)m2
evaluations!
• Define a local unstratified estimator ˜ζL
as a solution to
1
t
n∑
i=1
uj(Li, Xi; ζ) = πj, j = 2, . . . , m,
where (recall)
uj(L, X; ζ)
=



Γ(L, j)min
{
1, Γ(j,L)p(j|X;ζ)
Γ(L,j)p(L|X;ζ)
}
, if j ∈ N(L),
1 −
∑
l∈N (L) ul(L, X; ζ), if j = L,
• Need to evaluate only {qj(Xi) : j = Li or j ∈ N(Li), i = 1, . . . , t}.
25
Locally weighted histogram analysis method (cont’d)
• The estimator ˜ζL
is, equivalently, a minimizer of the convex function
κL
(ζ) =
1
t
t∑
i=1
∑
j∈N (Li)
Γ(Li, j) log
{
Γ(j, Li)
πjqj(Xi)
eζj
+ Γ(Li, j)
πLi qLi (Xi)
eζLi
}
+
m∑
j=1
πjζj.
→ facilitating computation of ˜ζL
.
• Define a local stratified estimator ˆζL
, by using (ˆπ1, . . . , ˆπm) in place of (π1, . . . , πm).
• The local stratified estimator, like the global one, is broadly applicable to
– unadjusted or adjusted mixture sampling
– parallel tempering,
– separate sampling (i.e., running m Markov chains independently)
26
Biological host-guest system (Heptanoate β-cyclodextrins binding system)
• Boltzmann distribution
q(x; θ) = exp{−u(x; λ, T)},
where θ = (λ, T):
– λ ∈ [0, 1] is the strength of coupling (λ = 0 uncoupled, λ = 1 coupled)
– T is the temperature.
• Replica exchange molecular dynamics (REMD) simulations (Ronald Levy, Temple University)
• Performed 15 independent 1D REMD simulations at the 15 temperatures:
200, 206, 212, 218, 225, 231, 238, 245, 252, 260, 267, 275, 283, 291, 300.
where for each temperature, the simulation involves 16 λ values
0.0, 0.001, 0.002, 0.004, 0.01, 0.04, 0.07, 0.1, 0.2, 0.4, 0.6, 0.7, 0.8, 0.9, 0.95, 1.0
• Data (binding energies) were recorded every 0.5 ps during 72ns simulation per replica, resulting a
huge number of data points (144000 × 240 = 34.56 × 106
) from 16 × 15 = 240 replicas
associated with 240 pairs of (λ, T) values.
27
Biological host-guest system (cont’d)
• It is of interest to estimate
– the binding free energies (between λ = 1 and λ = 0) at the 15 temperatures
– the energy histograms in the coupled state (λ = 1) at the 15 temperatures
• Comparison of different WHAMs:
– 2D Global: Global WHAM applied to all data from 16 × 15 = 240 states
– 2D Local: Local WHAM applied to all data.
– 2D S-Local: Stochastic solution of local WHAM
– 1D Global: Global WHAM applied separately to 15 datastes, each obtained from 16 λ values at
one of the 15 temperatures.
• To serve as a benchmark, a well-converged dataset was also obtained from 2D REMD simulations
with the same 16 λ values and the same 15 temperatures: a total of 16 × 15 = 240 replicas.
• Data (binding energies) were recorded every 1 ps during the simulation time of 30 ns per replica,
resulting in a total of 7.2 × 106
data points (30000 × 240).
28
Coupled states: UP and DOWN
31
034107-11 Tan et al. J. Chem. Phys. 144, 034107 (2016)
FIG. 3. Free energies and binding energy distributions for λ = 1 at two different temperatures, calculated by four different reweighting methods from the
well-converged 2D Async REMD simulations. T = 200 K for (a) and (b), and T = 300 K for (c) and (d).
• 2D Local: the minimization method of local WHAM,
based on the non-weighted, nearest-neighbor jump
scheme with one jump attempt per cycle;
• 2D S-Local: the stochastic method of local WHAM
(i.e., the adaptive SOS-GST procedure), based on the
non-weighted nearest-neighbor jump scheme with 10
jump attempts per cycle;
• 1D Global: global WHAM applied separately to 15
datasets, each obtained from 16 λ values at one of the
15 temperatures.
The adaptive SOS-GST procedure is run for a total of
T = t0 + 4.8 × 107
cycles, with the burn-in size t0 = 4.8 × 106
and initial decay rate α = 0.6 in Eq. (32).
As shown in Figs. 3, S1, and S2,75
the free energy
estimates from different WHAM methods are similar over
the entire range of simulation time (i.e., converging as
fast as each other). Moreover, the distributions of binding
energies from different WHAM reweightings agree closely
with those from the raw data, suggesting that the 2D Async
REMD simulations are well equilibrated at all thermodynamic
TABLE I. CPU time for different WHAM methods.
2D Global 2D Local 2D S-Local 1D Global
Data size N = 1.2×107 (25 ns per state)
CPU timea 19510.1 s 39.7 436.4 159.7
Improvement rate 1 491.8 44.7 122.1
Data size N = 3.5×107 (72 ns per state)
CPU time 56411.6 s 114.0 453.3 458.3
Improvement rate 1 494.8 124.5 123.1
aThe CPU time during the period of handling data up to the given simulation time, with the initial values of free energies taken
from the previous period, in the stagewise scheme described in Section III C. See Table S175 for the cumulative CPU time.
This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
128.6.168.245 On: Wed, 20 Jan 2016 14:57:31
034107-12 Tan et al. J. Chem. Phys. 144, 034107 (2016)
FIG. 4. Free energies and binding energy distributions for λ = 1 at three different temperatures, calculated by four different reweighting methods from 15
independent 1D REMD simulations. T = 200 K for (a) and (b), T = 206 K for (c) and (d), and T = 300 K for (e) and (f). The bar lines in the distribution plots are
raw histograms from 2D Async REMD simulations, and the dashed lines in the free energy plots are the global WHAM estimates from the 2D simulations.
states. The convergence of binding energy distributions to
the same distribution also ensures the correctness of our
implementations of different reweighting methods. Hence,
there are no significant differences except the Central
Processing Unit (CPU) time (see further discussion below)
when the four different reweighting methods are applied to
the well-converged data from 2D Async REMD simulations.
But such large-scale 2D simulations are computationally
intensive.45,46
For the rest of this section, we will focus
on the analysis of another dataset from 15 independent
1D REMD simulations (which are easier to implement than
2D simulations) and find that the differences of reweighting
methods become more pronounced as the raw data are not
fully converged at some thermodynamic states (see Fig. S375
for the binding energy distributions from the raw data).
Table I displays the CPU time for different reweighting
methods we implemented, using the dataset from the
15 independent 1D REMD simulations. Each method is
This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
128.6.168.245 On: Wed, 20 Jan 2016 14:57:31
034107-11 Tan et al. J. Chem. Phys. 144, 034107 (2016)
FIG. 3. Free energies and binding energy distributions for λ = 1 at two different temperatures, calculated by four different reweighting methods from the
well-converged 2D Async REMD simulations. T = 200 K for (a) and (b), and T = 300 K for (c) and (d).
• 2D Local: the minimization method of local WHAM,
based on the non-weighted, nearest-neighbor jump
scheme with one jump attempt per cycle;
• 2D S-Local: the stochastic method of local WHAM
(i.e., the adaptive SOS-GST procedure), based on the
non-weighted nearest-neighbor jump scheme with 10
jump attempts per cycle;
• 1D Global: global WHAM applied separately to 15
datasets, each obtained from 16 λ values at one of the
15 temperatures.
The adaptive SOS-GST procedure is run for a total of
T = t0 + 4.8 × 107
cycles, with the burn-in size t0 = 4.8 × 106
and initial decay rate α = 0.6 in Eq. (32).
As shown in Figs. 3, S1, and S2,75
the free energy
estimates from different WHAM methods are similar over
the entire range of simulation time (i.e., converging as
fast as each other). Moreover, the distributions of binding
energies from different WHAM reweightings agree closely
with those from the raw data, suggesting that the 2D Async
REMD simulations are well equilibrated at all thermodynamic
TABLE I. CPU time for different WHAM methods.
2D Global 2D Local 2D S-Local 1D Global
Data size N = 1.2×107 (25 ns per state)
CPU timea 19510.1 s 39.7 436.4 159.7
Improvement rate 1 491.8 44.7 122.1
Data size N = 3.5×107 (72 ns per state)
CPU time 56411.6 s 114.0 453.3 458.3
Improvement rate 1 494.8 124.5 123.1
aThe CPU time during the period of handling data up to the given simulation time, with the initial values of free energies taken
from the previous period, in the stagewise scheme described in Section III C. See Table S175 for the cumulative CPU time.
This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP:
128.6.168.245 On: Wed, 20 Jan 2016 14:57:31
Conclusion
• We proposed methods
– SAMS for sampling and on-line estimation: optimal SA
– Local (v.s. global) WHAM for off-line estimation: similar efficiency but low computational cost
• The proposed methods are
– broadly applicable,
– highly modular: taking Markov transitions as a black-box,
– computationally effective for handling a large number of distributions.
• Further development:
– Stochastic solution of L-WHAM
– Trans-dimensional self-adjusted mixture sampling
– Applications to molecular simulations, language modeling, etc.
29

More Related Content

What's hot

Bayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear modelsBayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear models
Caleb (Shiqiang) Jin
 
accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmann
olli0601
 

What's hot (20)

2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
ABC in Venezia
ABC in VeneziaABC in Venezia
ABC in Venezia
 
Nested sampling
Nested samplingNested sampling
Nested sampling
 
2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...
2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...
2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsRao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
 
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
2018 MUMS Fall Course - Statistical and Mathematical Techniques for Sensitivi...
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
Bayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear modelsBayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear models
 
Approximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-LikelihoodsApproximate Bayesian Computation with Quasi-Likelihoods
Approximate Bayesian Computation with Quasi-Likelihoods
 
MCMC and likelihood-free methods
MCMC and likelihood-free methodsMCMC and likelihood-free methods
MCMC and likelihood-free methods
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distances
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmann
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Optimally Adjusted Mixture Sampling & Locally Weighed Histogram Analysis Methods - Zhiqiang Tan, Dec 13, 2017

Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Chiheb Ben Hammouda
 
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methods
Stephane Senecal
 
2012 mdsp pr05 particle filter
2012 mdsp pr05 particle filter2012 mdsp pr05 particle filter
2012 mdsp pr05 particle filter
nozomuhamada
 
2014 spring crunch seminar (SDE/levy/fractional/spectral method)
2014 spring crunch seminar (SDE/levy/fractional/spectral method)2014 spring crunch seminar (SDE/levy/fractional/spectral method)
2014 spring crunch seminar (SDE/levy/fractional/spectral method)
Zheng Mengdi
 

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Optimally Adjusted Mixture Sampling & Locally Weighed Histogram Analysis Methods - Zhiqiang Tan, Dec 13, 2017 (20)

Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
 
Unbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloUnbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte Carlo
 
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
sada_pres
sada_pressada_pres
sada_pres
 
Talk in BayesComp 2018
Talk in BayesComp 2018Talk in BayesComp 2018
Talk in BayesComp 2018
 
Controlled sequential Monte Carlo
Controlled sequential Monte Carlo Controlled sequential Monte Carlo
Controlled sequential Monte Carlo
 
Recent developments on unbiased MCMC
Recent developments on unbiased MCMCRecent developments on unbiased MCMC
Recent developments on unbiased MCMC
 
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methods
 
Lecture_9.pdf
Lecture_9.pdfLecture_9.pdf
Lecture_9.pdf
 
Integration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methodsIntegration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methods
 
Talk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesTalk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniques
 
2012 mdsp pr05 particle filter
2012 mdsp pr05 particle filter2012 mdsp pr05 particle filter
2012 mdsp pr05 particle filter
 
Ray : modeling dynamic systems
Ray : modeling dynamic systemsRay : modeling dynamic systems
Ray : modeling dynamic systems
 
002 ray modeling dynamic systems
002 ray modeling dynamic systems002 ray modeling dynamic systems
002 ray modeling dynamic systems
 
002 ray modeling dynamic systems
002 ray modeling dynamic systems002 ray modeling dynamic systems
002 ray modeling dynamic systems
 
2014 spring crunch seminar (SDE/levy/fractional/spectral method)
2014 spring crunch seminar (SDE/levy/fractional/spectral method)2014 spring crunch seminar (SDE/levy/fractional/spectral method)
2014 spring crunch seminar (SDE/levy/fractional/spectral method)
 
Mcqmc talk
Mcqmc talkMcqmc talk
Mcqmc talk
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 

More from The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
The Statistical and Applied Mathematical Sciences Institute
 

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Recently uploaded

Recently uploaded (20)

How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health Education
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Simple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdfSimple, Complex, and Compound Sentences Exercises.pdf
Simple, Complex, and Compound Sentences Exercises.pdf
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Optimally Adjusted Mixture Sampling & Locally Weighed Histogram Analysis Methods - Zhiqiang Tan, Dec 13, 2017

  • 1. Optimally adjusted mixture sampling and locally weighted histogram analysis Zhiqiang Tan Department of Statistics Rutgers University www.stat.rutgers.edu/home/ztan/ December 13, 2017 SAMSI 1
  • 2. Outline • Problem of interest • Proposed methods – Self-adjusted mixture sampling (SAMS) – Locally weighted histogram analysis method (L-WHAM) • Numerical studies – Potts model – Biological host-guest system • Conclusion 2
  • 3. Problem of interest • Target probability distributions: dPj = qj(x) Zj dµ, j = 1, . . . , m, where Zj = ∫ qj(x) µ(x), called the normalizing constant. • Examples: – Boltzmann distributions, including Ising and Potts models: qj(x) = exp{−uj(x)}. – Missing-data problems and latent-variable models: qj(ymis) = pj(yobs, ymis), Zj = pj(yobs). – Bayesian analysis: qj(θ) = pj(θ) × Lj(θ; yobs), Zj = marginal likelihood. 3
  • 4. Problem of interest (cont’d) • Objectives (i) To simulate observations from Pj for j = 1, . . . , m. (ii) To estimate expectations Ej(g) = ∫ g(x) dPj, j = 1, . . . , m, for some function g(x). (iii) To estimate normalizing constants (Z1, . . . , Zm) up to a multiplicative constant or, equivalently, estimating the log ratios of normalizing constants: ζ∗ j = log(Zj/Z1), j = 1, . . . , m. 4
  • 5. Proposed methods • Self-adjusted mixture sampling (SAMS) for – sampling – on-line estimation of expectations – on-line estimation of normalizing constants • Locally (v.s. globally) weighted histogram analysis method (WHAM) for – off-line estimation of expectations – off-line estimation of normalizing constants • on-line: while the sampling process goes on (using each observation one by one) off-line: after the sampling process is completed (using all observations simultaneously) 5
  • 6. Labeled mixture • Consider a labeled mixture (L = label): (L, X) ∼ p(j, x; ζ) ∝ πj eζj qj(x), where (π1, . . . , πm) : target/fixed mixture proportions ( ∑m j=1 πj = 1), typically π1 = · · · = πm = 1/m, (ζ1, . . . , ζm) : hypothesized values of (ζ∗ 1 , . . . , ζ∗ m). • The marginal distributions are p(x; ζ) ∝ ∑m j=1 πj ζj qj(x), p(L = j; ζ) ∝ πj eζj −ζ∗ j . • The conditional distributions are p(x|L = j) ∝ qj(x), p(L = j|x; ζ) = πj eζj qj(x) ∑m k=1 πk eζk qk(x) ∝ πj eζj qj(x). 6
  • 7. Unadjusted mixture sampling via Metropolis-within-Gibbs • Global-jump mixture sampling: For t = 1, 2, . . ., (i) Global jump: Generate Lt ∼ p(L = ·|Xt−1; ζ); (ii) Markov move: Generate Xt given (Xt−1, Lt = j), leaving Pj invariant. • Local-jump mixture sampling: For t = 1, 2, . . ., (i) Local jump: Generate j ∼ Γ(Lt−1, ·), and then set Lt = j with probability min { 1, Γ(j, Lt−1) Γ(Lt−1, j) p(j|Xt−1; ζ) p(Lt−1|Xt−1; ζ) } , and, with the remaining probability, set Lt = Lt−1. (ii) Markov move: Same as above. • Γ(k, j) is a proposal probability of k → j, e.g., nearest-neighbor jump: Γ(1, 2) = Γ(m, m − 1) = 1, and Γ(k, k − 1) = Γ(k, k + 1) = 1/2 for j = 2, . . . , m − 1. 7
  • 8. A critical issue • ζ must be specified reasonably close to ζ∗ in order for the algorithm to work! • How shall we set ζ = (ζ1, . . . , ζm)? – Trial and error, using the fact that p(L = j; ζ) ∝ πj eζj −ζ∗ j , j = 1, . . . , m. – Stochastic approximation: updating ζ during the sampling process 8
  • 9. Self-adjusted mixture sampling via stochastic approximation • Self-adjusted mixture sampling: (i) Mixture sampling: Generate (Lt, Xt) given (Lt−1, Xt−1), with ζ set to ζ(t−1) ; (ii) Free energy update: For j = 1, . . . , m, ζ (t− 1 2 ) j = ζ (t−1) j + aj,t(1{Lt = j} − πj), ζ (t) j = ζ (t− 1 2 ) j − ζ (t− 1 2 ) 1 . [ keep ζ (t) 1 = 0 ] → Self-adjustment mechanism • Questions: – Optimal choice of cj,t? – Other update rules? 9
  • 10. Related literature • (Unadjusted) Serial tempering (Geyer & Thompson 1995), originally for qj(x) = exp{−u(x)/Tj}. • (Unadjusted) Mixed/generalized/expanded ensemble simulation (e.g., Chodera & Shirts 2011). • (Unadjusted) Reversible jump algorithm (Green 1995) in trans-dimensional settings. • Wang–Landau (2001) algorithm, based on partition of the state space, and its extensions: stochastic approximation Monte Carlo (Liang et al. 2007), generalized Wang–Landau algorithm (Atchade & Liu 2010). • Our formulation makes explicit the mixture structure, leading to new development. 10
  • 11. Intro to stochastic approximation (Robins & Monro 1951, etc.) • Need to find θ∗ that solves h(θ) def = Eθ{H(Y ; θ)} = 0, where Y ∼ f(·; θ). • A stochastic approximation algorithm: (i) Generate Yt ∼ f(·; θt−1) or, more generally, Yt ∼ Kθt−1 (Yt−1, ·), by a Markov kernel which admits f(·; θt−1) as the stationary distribution. (ii) Update θt = θt−1 + AtH(Yt; θt−1), where At is called a gain matrix. 11
  • 12. Theory of stochastic approximation (Chen 2002; Kushner & Yin 2003; Liang 2010, etc.) • Let At = A t for a constant matrix A. • Under suitable conditions, √ t(θt − θ∗ ) →D N{0, Σ(A)}. • The minimum of Σ(A) is C−1 V CT−1 , achieved by A = C−1 , where C = − ∂h ∂θT (θ∗ ), 1 √ t t∑ i=1 H(Y ∗ t ; θ∗ t ) →D N(0, V ), where (Y ∗ 1 , Y ∗ 2 , . . .) is a Markov chain with Y ∗ t ∼ Kθ∗ (Y ∗ t−1, ·). • Key idea of proof: martingale approximation. 12
  • 13. Theory of stochastic approximation (cont’d) • In general, C is unknown. • Possible remedies – Estimating θ∗ and C simultaneously (Gu & Kong 1998; Gu & Zhu 2001). – Off-line averaging (Polyak & Juditsky 1992; Ruppert 1988), using the trajectory average ¯θt = 1 t t∑ i=1 θi, by setting At = A tα with α ∈ (1/2, 1). Then under suitable conditions, √ t(¯θt − θ∗ ) →D N(0, C−1 V CT−1 ). 13
  • 14. Back to self-adjusted mixture sampling • Recall stochastic approximation: solve h(θ) def = Eθ{H(Y ; θ)} = 0, with Y ∼ f(·; θ). • Recast self-adjusted mixture sampling: f(y; θ) = p(j, x; ζ), h(θ) = {p(L = 2; ζ) − π2, . . . , p(L = m; ζ) − πm)T , H(Y ; θ) = (1{L = 2} − π2, . . . , 1{L = m} − πm)T , Kθ(yt−1, yt) = p(lt|lt−1, xt−1; ζ) × p(xt|lt, xt−1). (jump probability) × (transition probability) 14
  • 15. Optimally adjusted mixture sampling • We find that C = −∂h(θ∗ )/∂θT is known, even though θ∗ is unknown! • The optimal stochastic-approximation update rule: ζ(t− 1 2 ) = ζ(t−1) + 1 t        δ1,t/π1 − 1 δ2,t/π2 − 1 ... δm,t/πm − 1        , ζ(t) = ζ(t− 1 2 ) − ζ (t− 1 2 ) 1 , where δj,t = 1{Lt = j}, j = 1, . . . , m. 15
  • 16. Second update rule • Take H(Y ; θ) =      w2(X; ζ) − π2 ... wm(X; ζ) − πm      , where wj(X; ζ) = πj eζj qj(X) ∑m k=1 πk eζk qk(X) , j = 1, . . . , m. • The optimal update rule is ζ(t− 1 2 ) = ζ(t−1) + 1 t        w1(Xt; ζ(t−1) )/π1 − 1 w2(Xt; ζ(t−1) )/π2 − 1 ... wm(Xt; ζ(t−1) )/πm − 1        . 16
  • 17. Rao-Blackwellization [ Replace binary δj,t = 1{Lt = j} by transition probability ] • The second update rule can be seen as a Rao-Blackwellization of the first: E(1{Lt = j}|Lt−1, Xt−1; ζ) = p(j|Xt−1; ζ) = wj(Xt; ζ) under unadjusted global-jump mixture sampling. • Similarly, we find that under unadjusted local-jump mixture sampling, E(1{Lt = j}|Lt−1 = k, Xt−1; ζ) =    Γ(k, j)min { 1, Γ(j,k)p(j|Xt−1;ζ) Γ(k,j)p(k|Xt−1;ζ) } , if j ∈ N(k), 1 − ∑ l∈N (k) E(1{Lt = l}|Lt−1 = k, Xt−1; ζ), if j = k, where N(k) denotes the neighborhood of k. 17
  • 18. Third update rule • For j = 1, . . . , m, let uj(L, X; ζ) =    Γ(L, j)min { 1, Γ(j,L)p(j|X;ζ) Γ(L,j)p(L|X;ζ) } , if j ∈ N(L), 1 − ∑ l∈N (L) ul(L, X; ζ), if j = L, • Take H(T; θ) = {u2(L, X; ζ) − π2, . . . , um(L, X; ζ) − πm}T . • The optimal update rule is ζ(t− 1 2 ) = ζ(t−1) + 1 t        u1(Lt, Xt; ζ(t−1) )/π1 − 1 u2(Lt, Xt; ζ(t−1) )/π2 − 1 ... um(Lt, Xt; ζ(t−1) )/πm − 1        . 18
  • 19. Implementation details • Initial values ζ1,0 = ζ2,0 = · · · = ζm,0 = 0. • Two stage: Replace by 1 t by    min{πmin, 1 tβ }, if t ≤ t0, min{πmin, 1 t−t0+tβ 0 }, if t > t0, where β ∈ (1/2, 1), e.g., β = 0.6 or 0.8, t0 is the length of burn-in, πmin = min(π1, . . . , πm). • Monitor ˆπj = 1 t ∑t i=1 1{Li = j} for j = 1, . . . , m. 19
  • 20. Potts model • Consider a 10-state Pott model on a 20 × 20 lattice with periodic boundary conditions: q(x; T) Z = e−u(x)/T Z , where u(x) = − ∑ i∼j 1{si = sj}, with i ∼ j indicating that sites i and j are nearest neighbors, and Z = ∑ x exp{−u(x)/T}. • The q-state Potts model on an infinite lattice exhibits a phase transition at 1/Tc = log(1 + √ q), about 1.426 for q = 10. • Take m = 5 and (1/T1, . . . , 1/T5) = (1.4, 1.4065, 1.413, 1.4195, 1.426), evenly spaced between 1.4 and 1.426. • The Markov transition kernel for Pj is defined as a random-scan sweep using the single-spin-flip Metropolis algorithm at temperature Tj. 20
  • 21. Potts model (cont’d) • Comparison of sampling algorithms: – Parallel tempering (Geyer 1991) – Self-adjusted mixture sampling: two stages with β = 0.8 and optimal second stage    min{πmin, 1 tβ }, if t ≤ t0, min{πmin, 1 t−t0+tβ 0 }, if t > t0, – Self-adjusted mixture sampling: implementation comparable to Liang et al. (2007)    min{πmin, 1 tβ }, if t ≤ t0, min{πmin, 10 t−t0+tβ 0 }, if t > t0, – Self-adjusted mixture sampling: implementation using a flat-histogram criterion (Atchade & Liu 2010) • Estimation of U = E{u(x)} and C = var{u(x)}. 21
  • 22. 2 Z. TAN Figure . Histogram of u(x)/K at the temperatures (T1, T2, . . . , T5) labeled as 1, 2, . . . , 5 under the Potts model. Two vertical lines are placed at −1.7 and −0.85. where, without loss of generality, Z1 is chosen to be a reference value. In the following, we provide a brief discussion of existing methods. For the sampling problem, it is possible to run separate simu- lations for (P1, . . . , Pm) by Markov chain Monte Carlo (MCMC) (e.g., Liu 2001). However, this direct approach tends to be inef- fective in the presence of multi-modality and irregular contours in (P1, . . . , Pm). See, for example, bimodal energy histograms for the Potts model in Figure 1. To address these difficulties, various methods have been proposed by creating interactions between samples from different distributions. Such methods can be divided into at least two categories. On the one hand, overlap-based algorithms, including parallel tempering (Geyer 1991) and its extensions (Liang and Wong 2001), serial tem- pering (Geyer and Thompson 1995), resample-move (Gilks and Berzuini 2001) and its extensions (De Moral et al. 2006; Tan 2015), require that there exist considerable overlaps between (P1, . . . , Pm), to exchange or transfer information across distri- butions. On the other hand, the Wang–Landau (2001) algorithm and its extensions (Liang et al. 2007; Atchadé and Liu 2010) are typically based on partitioning of the state space X, and hence there is no overlap between (P1, . . . , Pm). For the estimation problem, the expectations {E1(φ), . . . , Em(φ)} can be directly estimated by sample averages from (P1, . . . , Pm). However, additional considera- tions are generally required for estimating (ζ∗ 2 , . . . , ζ∗ m, ζ∗ 0 ) and E0(φ), depending on the type of sampling algorithms. For Wang–Landau type algorithms based on partitioning of X, (ζ∗ 2 , . . . , ζ∗ m) are estimated during the sampling process, and ζ∗ 0 and E0(φ) can then be estimated by importance sampling techniques (Liang 2009). For overlap-based settings, both (ζ∗ 2 , . . . , ζ∗ m, ζ∗ 0 ) and {E1(φ), . . . , Em(φ), E0(φ)} can be esti- mated after sampling by a methodology known in physics and statistics as the (binless) weighted histogram analysis method (WHAM) (Ferrenberg and Swendsen 1989; Tan et al. 2012), the multi-state Bennett acceptance ratio method (Bennett 1976; Shirts and Chodera 2008), reverse logistic regression (Geyer 1994), bridge sampling (Meng and Wong 1996) and the global likelihood method (Kong et al. 2003; Tan 2004). See Gelman and Meng (1998), Tan (2013a), and Cameron and Pettitt (2014) for reviews on this method and others such as thermodynamic integration or equivalently path sampling. The purpose of this article is twofold, dealing with sampling and estimation respectively. First, we present a self-adjusted mixture sampling method, which not only accommodates adaptive serial tempering and the generalized Wang–Landau algorithm in Liang et al. (2007), but also facilitates further methodological development. The sampling method employs stochastic approximation to estimate the log normalizing con- stants (or free energies) online, while generating observations by Markov transitions. We propose two stochastic approxima- tion schemes by Rao–Blackwellization of the scheme used in Liang et al. (2007) and Atchadé and Liu (2010). For all the three schemes, we derive the optimal choice of a gain matrix, resulting in the minimum asymptotic variance for free energy estimation, in a simple and feasible form. In practice, we suggest a two-stage implementation that uses a slow-decaying gain factor during burn-in before switching to the optimal gain factor. Second, we make novel connections between self-adjusted mixture sampling and the global method of estimation (e.g., Kong et al. 2003). Based on this understanding, we develop a new offline method, locally weighted histogram analysis, for estimating free energies and expectations using all the simulated data by either self-adjusted mixture sampling or other sampling algorithms, subject to suitable overlaps between (P1, . . . , Pm). The local method is expected to be computationally much faster, with little sacrifice of statistical efficiency, than the global method, because individual samples are locally pooled from neighboring distributions, which typically overlap more with each other than with other distributions. The computational savings from using the local method are important, especially when a large number of distributions are involved (i.e., m is large, in hundreds or more), for example, in physical and chem- ical simulations (Chodera and Shirts 2011), likelihood inference (Tan 2013a, 2013b), and Bayesian model selection and sensitiv- ity analysis (Doss 2010). We describe a sampling method, labeled mixture sampling, which is the nonadaptive version of self-adjusted mixture sam- pling in Section 3. 2. Labeled Mixture Sampling We describe a sampling method, labeled mixture sampling, which is the nonadaptive version of self-adjusted mixture sam- pling in Section 3. The ideas are recast from several existing methods, including serial tempering (Geyer and Thompson 1995), the Wang–Landau (2001) algorithm and its extensions (Liang et al. 2007; Atchadé and Liu 2010). However, we make explicit the relationship between mixture weights and hypothe- sized normalizing constants, which is crucial to the new devel- opment of adaptive schemes in Section 3.2 and offline estimation in Sections 4 and 5. The basic idea of labeled mixture sampling is to combine (P1, . . . , Pm) into a joint distribution on the space {1, . . . , m} ×
  • 23. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS 9 r Self-adjusted local-jump mixture sampling, with the gain factor t−1 in (9) replaced by 1/(mkt ), where kt increases by 1 only when a flat-histogram criterion is met: the observed proportions of labels, Lt = j, are all within 20% of the tar- get 1/m since the last time the criterion was met (e.g., Atchadé and Liu 2010). See Tan (2015) for a simulation study in the same setup of Potts distributions, where parallel tempering was found to perform better than several resampling MCMC algorithms, including resample-move and equi-energy sampling. The initial value L0 is set to 1, corresponding to temperature T1, and X0 is generated by randomly setting each spin. The same X0 is used in parallel tempering for each of the five chains. For parallel tempering, the total number of iterations is set to 4.4 × 105 per chain, with the first 4 × 104 iterations as burn-in. For self-adjusted mixture sampling with comparable cost, the total number of iterations is set to 2.2 × 106 , with the first 2 × 105 treated as burn-in. The data are recorded (or subsampled) every 10 iterations, yielding 5 chains each of length 4.4 × 104 with the first 4 × 103 iterations as burn-in for parallel tempering, and a single chain of length 2.2 × 105 with the first 2 × 104 iterations as burn-in for self-adjusted mixture sampling. .. Simulation Results Figure 1 shows the histograms of u(x)/K at the five tempera- tures, based on a single run of optimally adjusted mixture sam- pling with t0 set to 2 × 105 (the burn-in size before subsam- pling). There are two modes in these energy histograms. As the temperature decreases from T1 to T5, the mode located at about −1.7 grows in its weight, from being a negligible one, to a minor one, and eventually to a major one, so that the spin system moves from the disordered phase to the ordered one. Figure 2 shows the trace plots of free energy estimates ζ(t) and observed proportions ˆπ for three algorithms of self-adjusted mixture sampling. There are striking differences between these algorithms. For optimally adjusted mixture sampling, the free energy estimates ζ(t) fall quickly toward the truth (as indicated by the final estimates) in the first stage, with large fluctuations due to the gain factor of order t−0.8 . The estimates stay stable in the second stage, due to the optimal gain factor of order t−1 . The observed proportions ˆπj also fall quickly toward the target πj = 20%. But there are considerable deviations of ˆπj from πj over time, which reflects the presence of strong autocorrelations in the label sequence Lt . For the second algorithm, the use of a gain factor about 10 times the optimal one forces the observed proportions ˆπj to stay closer to the target ones, but leads to greater fluctuations in the free energy estimates ζ(t) than when the optimal SA scheme is used. For the third algorithm using the flat-histogram adaptive scheme, the observed proportions ˆπj are forced to be even closer to πj, and the free energy estimates ζ(t) are associated with even greater fluctuations than when the optimal SA scheme. These nonoptimal algorithms seem to control the observed propor- tions ˆπj tightly about πj, but potentially increase variances for free energy estimates and, as shown below, introduce biases for estimates of expectations. Figure 3 shows the Monte Carlo means and standard devia- tions for the estimates of −U/K, the internal energy per spin, C/K, the specific heat times T2 per spin, and free energies ζ∗ , based on 100 repeated simulations. Similar results are obtained by parallel tempering and optimally adjusted mixture sampling. But the latter algorithm achieves noticeable variance reduction for the estimates of −U/K and C/K at temperatures T4 and T5. The algorithm using a nonoptimal SA scheme performs much worse than the first two algorithms: not only the estimates of −U/K andC/K are noticeably biased at temperature T5, but also the online estimates of free energies have greater variances at temperatures T2 to T5. The algorithm using the flat-histogram scheme performs even poorly, with serious biases for the esti- mates of −U/K and C/K and large variances for the online esti- mates of free energies. These illustrate advantages of using opti- mally adjusted mixture sampling. Figure . Trace plots for self-adjusted mixture sampling witht0 = 2 × 105. The number of iterations is shown after subsampling. A vertical line is placed at the burn-in size.
  • 24. Figure S4: Trace plots for optimally adjusted mixture sampling with a two-stage SA scheme (t0 = 2 × 105) as shown in Figure 2. Index ζt 0 50 100 150 200 051015 sub 0 5000 10000 20000 051015 sub 0 50000 150000 051015 Index utK 0 50 100 150 200 −2.0−1.5−1.0−0.5 sub 0 5000 10000 20000 −2.0−1.5−1.0−0.5 sub 0 50000 150000 −2.0−1.5−1.0−0.5 Index Lt 0 50 100 150 200 12345 sub 0 5000 10000 20000 12345 sub 0 50000 150000 12345 πt 0 50 100 150 200 0.00.20.40.60.81.0 t(πt−π) 0 5000 10000 20000 −15−10−5051015 t(πt−π) 0 50000 150000 −15−10−5051015 21
  • 25. Figure S5: Trace plots similar to Figure S4 but for self-adjusted mixture sampling with a non-optimal, two-stage SA scheme (t0 = 2 × 105) as shown in Figure 2. Index ζt 0 50 100 150 200 051015 sub 0 5000 10000 20000 051015 sub 0 50000 150000 051015 Index utK 0 50 100 150 200 −2.0−1.5−1.0−0.5 sub 0 5000 10000 20000 −2.0−1.5−1.0−0.5 sub 0 50000 150000 −2.0−1.5−1.0−0.5 Index Lt 0 50 100 150 200 12345 sub 0 5000 10000 20000 12345 sub 0 50000 150000 12345 πt 0 50 100 150 200 0.00.20.40.60.81.0 t(πt−π) 0 5000 10000 20000 −15−10−5051015 t(πt−π) 0 50000 150000 −15−10−5051015 22
  • 26. Figure S6: Trace plots similar to Figure S4 but for self-adjusted mixture sampling with the flat-histogram scheme as shown in Figure 2. Index ζt 0 50 100 150 200 051015 sub 0 5000 10000 20000 051015 sub 0 50000 150000 051015 Index utK 0 50 100 150 200 −2.0−1.5−1.0−0.5 sub 0 5000 10000 20000 −2.0−1.5−1.0−0.5 sub 0 50000 150000 −2.0−1.5−1.0−0.5 Index Lt 0 50 100 150 200 12345 sub 0 5000 10000 20000 12345 sub 0 50000 150000 12345 πt 0 50 100 150 200 0.00.20.40.60.81.0 t(πt−π) 0 5000 10000 20000 −15−10−5051015 t(πt−π) 0 50000 150000 −15−10−5051015 23
  • 27. 10 Z. TAN Figure . Summary of estimates at the temperatures (T1, . . . , T5) labeled as 1, . . . , 5, based on  repeated simulations. For each vertical bar, the center indicates Monte Carlomeanminusthatobtainedfromparalleltempering(×:paralleltempering,◦:optimalSAscheme, :nonoptimalSAscheme,∇:flat-histogramscheme),andtheradius indicates Monte Carlo standard deviation of the  estimates from repeated simulations. For −U/K andC/K, all the estimates are directly based on sample averages ˜Ej (φ). For free energies, offline estimates ˜ζ(n) j are shown for parallel tempering (×), whereas online estimates ζ(n) j are shown for self-adjusted mixture sampling (◦, , or ∇). In Appendix VI, we provide additional results on offline esti- mates of free energies and expectations and other versions of self-adjusted mixture sampling. For optimal or nonoptimal SA, the single-stage algorithm performs similarly to the correspond- ing two-stage algorithm, due to the small number (m = 5) of distributions involved. The version with local jump and update scheme (14) or with global jump and update scheme (12) yields similar results to those of the basic version. 6.2 Censored Gaussian Random Field Consider a Gaussian random field measured on a regular 6 grid in [0, 1]2 but right-censored at 0 in Stein (1992). Let (u1, . . . , uK ) be the K = 36 locations of the grid, ξ = (ξ1, . . . , ξK ) be the uncensored data, and y = (y1, . . . , yK ) be the observed data such that yj = max(ξj, 0). Assume that ξ is multivariate Gaussian with E(ξj ) = β and cov(ξj, ξj ) = c e− uj−uj for j, j = 1, . . . , K, where · is the Euclidean norm. The density function of ξ is p(ξ; θ) = (2πc)−K/2 det−1/2 ( ) exp{−(ξ − β)T −1 (ξ − β)/(2c)}, where θ = (β, log c) and is the correlation matrix of ξ. The likelihood of the observed data can be decomposed as L(θ) = p(ξobs; θ) × Lmis(θ) with Lmis(θ) = 0 −∞ · · · 0 −∞ p(ξmis|ξobs; θ) j:yj=0 dξj, where ξobs or ξmis denotes the observed or censored subvector of ξ. Then, Lmis(θ) is the normalizing constant for the unnormal- ized density function p(ξmis|ξobs; θ) in ξmis. For the dataset in Figure 1 of Stein (1992), it is of interest to compute {L(θ) : θ ∈ }, where is a 21 × 21 regular grid in [−2.5, 2.5] × [−2, 1]. There are 17 censored observations in Stein’s dataset and hence Lmis(θ) is a 17-dimensional integral. .. Simulation Details We take qj(x) = p(ξmis|ξobs; θ) for j = 1, . . . , m(= 441), where θj1+21×( j2−1) denotes the grid point (θ1 j1 , θ2 j2 ) for j1, j2 = 1, . . . , 21, and (θ1 1 , . . . , θ1 21) are evenly spaced in [−2.5, 2.5] and (θ2 1 , . . . , θ2 21) are evenly spaced in [−2, 1]. The transition kernel j is defined as a systematic scan of Gibbs sampling for the target distribution Pj (Liu 2001). In this example, Gibbs sampling seems to work reasonably well for each Pj. Previously, Gelman and Meng (1998) and Tan (2013a, 2013b) studied the problem of computing {L(θ) : θ ∈ }, up to a multiplicative constant, using Gibbs sampling to simulate m Markov chains independently for (P1, . . . , Pm). We investigate self-adjusted local-jump mixture sampling, with two-stage modification (15) of the local update scheme (14), and locally weighted histogram analysis for estimating θ∗ j = log{Lmis(θj )/Lmis(θ1 11, θ2 11)} for j = 1, . . . , m. The use of self-adjusted mixture sampling is mainly to provide online esti- mates of ζ∗ j , to be compared with offline estimates, rather than to improve sampling as in the usual use of serial tempering (Geyer and Thompson 1995). As discussed in Sections 2–5, global- jump mixture sampling, the global update scheme (12), and globally weighted histogram analysis are too costly to be imple- mented for a large m. The neighborhood N ( j) is defined as the set of 2, 3, or 4 indices l such that θl lies within and next to θj in one of the four directions. That is, if j = j1 + 21 × ( j2 − 1), then N ( j) = {l1 + 21 × (l2 − 1) : l1 = j1 ± 1(1 ≤ l1 ≤ 21) and l2 = j2 ± 1(1 ≤ l2 ≤ 21)}. Additional simulations using larger neighborhoods (e.g., l1 = j1 ± 2 and l2 = j2 ± 2) lead to similar results to those reported in Section 6.2.2. The initial value L0 is set to (θ1 11, θ2 11) corresponding to the center of , and X0 is generated by independently drawing each censored component ξj from the conditional distribution of ξj, truncated to (−∞, 0], given the observed components of ξ, with θ = (θ1 11, θ2 11). The total number of iterations is set to 441 × 550 with the first 441 × 50 iterations as burn-in, corresponding to the cost in Gelman and Meng (1998) and Tan (2013a, 2013b), which involve simulating a Markov chain of length 550 per dis- tribution, with the first 50 iterations as burn-in. .. Simulation Results Figure 4 shows the output from a single run of self-adjusted mix- ture sampling with β = 0.8 and t0 set to 441 × 50 (the burn-in size). There are a number of interesting features. First, the esti- mates ζ(t) fall steadily toward the truth, with noticeable fluctu- ations, during the first stage, and then stay stable and close to the truth in the second stage, similarly as in Figure 2 for the Potts model. Second, the locally weighted offline estimates yield a more smooth contour than the online estimates. In fact, as shown later in Table 1 from repeated simulations, the offline esti- mates are orders of magnitude more accurate than the online estimates. Third, some of the observed proportions ˆπj differ
  • 28. Off-line estimation • As t → ∞, we expect (X1, X2, . . . , Xt) approx ∼ p(x; ζ∗ ) ∝ m∑ j=1 πj eζ∗ j qj(x). • Law of large numbers: 1 t ∑t i=1 g(Xi) → ∫ g(x)p(x; ζ∗ ) dx. • Taking g(x) = wj(x; ζ∗ ) yields 1 t t∑ i=1 πj e ζ∗ j qj(Xi) ∑m k=1 πk eζ∗ k qk(Xi) → πj, j = 1, 2, . . . , m. • Define ˜ζ = (˜ζ1, ˜ζ2, . . . , ˜ζm)T with ˜ζ1 = 0 as a solution to 1 t t∑ i=1 πj eζj qj(Xi) ∑m k=1 πk eζk qk(Xi) → πj, j = 2, . . . , m. → Self-consistent unstratified mixture-sampling estimator 22
  • 29. Off-line stratified estimation • As t → ∞, we expect {Xi : Li = j for i = 1, . . . , t} approx ∼ qj(x). • Let ˆπj = 1 t ∑t i=1 1{Li = j} for j = 1, . . . , m. • Stratified approximation (X1, X2, . . . , Xt) approx ∼ m∑ j=1 ˆπj eζ∗ j qj(x). • Define ˆζ = (ˆζ1, ˆζ2, . . . , ˆζm)T with ˆζ1 = 0 as a solution to 1 t t∑ i=1 e−ζj qj(Xi) ∑m k=1 πk eζk qk(Xi) → 1, j = 2, . . . , m. → Self-consistent stratified mixture-sampling estimator 23
  • 30. Connections • The (off-line) stratified estimator ˆζ is known as – (binless) weighted histogram analysis method (Ferrenberg & Swendsen 1989; Tan et al. 2012) – multi-state Bennet acceptance ratio method (Bennet 1976; Shirts & Chodera 2008) – reverse logistic regression (Geyer 1994) – multiple bridge sampling (Meng & Wong 1996) – (global) likelihood method (Kong et al. 2003; Tan 2004) • A new connection is that the (off-line) unstratified estimator ˜ζ corresponds to a Rao-Blackwellization of the (on-line) stochastic-approximation estimator ζ(t) . 24
  • 31. Locally weighted histogram analysis method • Global unstratified estimator ˜ζ is defined by solving: 1 t t∑ i=1 wj(Xi; ζ) = 1 t t∑ i=1 πj e ζ∗ j qj(Xi) ∑m k=1 πk eζ∗ k qk(Xi) = πj, j = 2, . . . , m. • Need to evaluate {qj(Xi) : j = 1, . . . , m, i = 1, . . . , t}. → tm = (t/m)m2 evaluations! • Define a local unstratified estimator ˜ζL as a solution to 1 t n∑ i=1 uj(Li, Xi; ζ) = πj, j = 2, . . . , m, where (recall) uj(L, X; ζ) =    Γ(L, j)min { 1, Γ(j,L)p(j|X;ζ) Γ(L,j)p(L|X;ζ) } , if j ∈ N(L), 1 − ∑ l∈N (L) ul(L, X; ζ), if j = L, • Need to evaluate only {qj(Xi) : j = Li or j ∈ N(Li), i = 1, . . . , t}. 25
  • 32. Locally weighted histogram analysis method (cont’d) • The estimator ˜ζL is, equivalently, a minimizer of the convex function κL (ζ) = 1 t t∑ i=1 ∑ j∈N (Li) Γ(Li, j) log { Γ(j, Li) πjqj(Xi) eζj + Γ(Li, j) πLi qLi (Xi) eζLi } + m∑ j=1 πjζj. → facilitating computation of ˜ζL . • Define a local stratified estimator ˆζL , by using (ˆπ1, . . . , ˆπm) in place of (π1, . . . , πm). • The local stratified estimator, like the global one, is broadly applicable to – unadjusted or adjusted mixture sampling – parallel tempering, – separate sampling (i.e., running m Markov chains independently) 26
  • 33. Biological host-guest system (Heptanoate β-cyclodextrins binding system) • Boltzmann distribution q(x; θ) = exp{−u(x; λ, T)}, where θ = (λ, T): – λ ∈ [0, 1] is the strength of coupling (λ = 0 uncoupled, λ = 1 coupled) – T is the temperature. • Replica exchange molecular dynamics (REMD) simulations (Ronald Levy, Temple University) • Performed 15 independent 1D REMD simulations at the 15 temperatures: 200, 206, 212, 218, 225, 231, 238, 245, 252, 260, 267, 275, 283, 291, 300. where for each temperature, the simulation involves 16 λ values 0.0, 0.001, 0.002, 0.004, 0.01, 0.04, 0.07, 0.1, 0.2, 0.4, 0.6, 0.7, 0.8, 0.9, 0.95, 1.0 • Data (binding energies) were recorded every 0.5 ps during 72ns simulation per replica, resulting a huge number of data points (144000 × 240 = 34.56 × 106 ) from 16 × 15 = 240 replicas associated with 240 pairs of (λ, T) values. 27
  • 34. Biological host-guest system (cont’d) • It is of interest to estimate – the binding free energies (between λ = 1 and λ = 0) at the 15 temperatures – the energy histograms in the coupled state (λ = 1) at the 15 temperatures • Comparison of different WHAMs: – 2D Global: Global WHAM applied to all data from 16 × 15 = 240 states – 2D Local: Local WHAM applied to all data. – 2D S-Local: Stochastic solution of local WHAM – 1D Global: Global WHAM applied separately to 15 datastes, each obtained from 16 λ values at one of the 15 temperatures. • To serve as a benchmark, a well-converged dataset was also obtained from 2D REMD simulations with the same 16 λ values and the same 15 temperatures: a total of 16 × 15 = 240 replicas. • Data (binding energies) were recorded every 1 ps during the simulation time of 30 ns per replica, resulting in a total of 7.2 × 106 data points (30000 × 240). 28
  • 35. Coupled states: UP and DOWN 31
  • 36. 034107-11 Tan et al. J. Chem. Phys. 144, 034107 (2016) FIG. 3. Free energies and binding energy distributions for λ = 1 at two different temperatures, calculated by four different reweighting methods from the well-converged 2D Async REMD simulations. T = 200 K for (a) and (b), and T = 300 K for (c) and (d). • 2D Local: the minimization method of local WHAM, based on the non-weighted, nearest-neighbor jump scheme with one jump attempt per cycle; • 2D S-Local: the stochastic method of local WHAM (i.e., the adaptive SOS-GST procedure), based on the non-weighted nearest-neighbor jump scheme with 10 jump attempts per cycle; • 1D Global: global WHAM applied separately to 15 datasets, each obtained from 16 λ values at one of the 15 temperatures. The adaptive SOS-GST procedure is run for a total of T = t0 + 4.8 × 107 cycles, with the burn-in size t0 = 4.8 × 106 and initial decay rate α = 0.6 in Eq. (32). As shown in Figs. 3, S1, and S2,75 the free energy estimates from different WHAM methods are similar over the entire range of simulation time (i.e., converging as fast as each other). Moreover, the distributions of binding energies from different WHAM reweightings agree closely with those from the raw data, suggesting that the 2D Async REMD simulations are well equilibrated at all thermodynamic TABLE I. CPU time for different WHAM methods. 2D Global 2D Local 2D S-Local 1D Global Data size N = 1.2×107 (25 ns per state) CPU timea 19510.1 s 39.7 436.4 159.7 Improvement rate 1 491.8 44.7 122.1 Data size N = 3.5×107 (72 ns per state) CPU time 56411.6 s 114.0 453.3 458.3 Improvement rate 1 494.8 124.5 123.1 aThe CPU time during the period of handling data up to the given simulation time, with the initial values of free energies taken from the previous period, in the stagewise scheme described in Section III C. See Table S175 for the cumulative CPU time. This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 128.6.168.245 On: Wed, 20 Jan 2016 14:57:31
  • 37. 034107-12 Tan et al. J. Chem. Phys. 144, 034107 (2016) FIG. 4. Free energies and binding energy distributions for λ = 1 at three different temperatures, calculated by four different reweighting methods from 15 independent 1D REMD simulations. T = 200 K for (a) and (b), T = 206 K for (c) and (d), and T = 300 K for (e) and (f). The bar lines in the distribution plots are raw histograms from 2D Async REMD simulations, and the dashed lines in the free energy plots are the global WHAM estimates from the 2D simulations. states. The convergence of binding energy distributions to the same distribution also ensures the correctness of our implementations of different reweighting methods. Hence, there are no significant differences except the Central Processing Unit (CPU) time (see further discussion below) when the four different reweighting methods are applied to the well-converged data from 2D Async REMD simulations. But such large-scale 2D simulations are computationally intensive.45,46 For the rest of this section, we will focus on the analysis of another dataset from 15 independent 1D REMD simulations (which are easier to implement than 2D simulations) and find that the differences of reweighting methods become more pronounced as the raw data are not fully converged at some thermodynamic states (see Fig. S375 for the binding energy distributions from the raw data). Table I displays the CPU time for different reweighting methods we implemented, using the dataset from the 15 independent 1D REMD simulations. Each method is This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 128.6.168.245 On: Wed, 20 Jan 2016 14:57:31
  • 38. 034107-11 Tan et al. J. Chem. Phys. 144, 034107 (2016) FIG. 3. Free energies and binding energy distributions for λ = 1 at two different temperatures, calculated by four different reweighting methods from the well-converged 2D Async REMD simulations. T = 200 K for (a) and (b), and T = 300 K for (c) and (d). • 2D Local: the minimization method of local WHAM, based on the non-weighted, nearest-neighbor jump scheme with one jump attempt per cycle; • 2D S-Local: the stochastic method of local WHAM (i.e., the adaptive SOS-GST procedure), based on the non-weighted nearest-neighbor jump scheme with 10 jump attempts per cycle; • 1D Global: global WHAM applied separately to 15 datasets, each obtained from 16 λ values at one of the 15 temperatures. The adaptive SOS-GST procedure is run for a total of T = t0 + 4.8 × 107 cycles, with the burn-in size t0 = 4.8 × 106 and initial decay rate α = 0.6 in Eq. (32). As shown in Figs. 3, S1, and S2,75 the free energy estimates from different WHAM methods are similar over the entire range of simulation time (i.e., converging as fast as each other). Moreover, the distributions of binding energies from different WHAM reweightings agree closely with those from the raw data, suggesting that the 2D Async REMD simulations are well equilibrated at all thermodynamic TABLE I. CPU time for different WHAM methods. 2D Global 2D Local 2D S-Local 1D Global Data size N = 1.2×107 (25 ns per state) CPU timea 19510.1 s 39.7 436.4 159.7 Improvement rate 1 491.8 44.7 122.1 Data size N = 3.5×107 (72 ns per state) CPU time 56411.6 s 114.0 453.3 458.3 Improvement rate 1 494.8 124.5 123.1 aThe CPU time during the period of handling data up to the given simulation time, with the initial values of free energies taken from the previous period, in the stagewise scheme described in Section III C. See Table S175 for the cumulative CPU time. This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 128.6.168.245 On: Wed, 20 Jan 2016 14:57:31
  • 39. Conclusion • We proposed methods – SAMS for sampling and on-line estimation: optimal SA – Local (v.s. global) WHAM for off-line estimation: similar efficiency but low computational cost • The proposed methods are – broadly applicable, – highly modular: taking Markov transitions as a black-box, – computationally effective for handling a large number of distributions. • Further development: – Stochastic solution of L-WHAM – Trans-dimensional self-adjusted mixture sampling – Applications to molecular simulations, language modeling, etc. 29