1.
Wang–Landau algorithm Improvements Example: variable selection Conclusion Parallel Adaptive Wang–Landau Algorithm Pierre E. Jacob CEREMADE - Universit´ Paris Dauphine, funded by AXA Research e GPU in Computational Statistics January 25th, 2012 joint work with Luke Bornn (UBC), Arnaud Doucet (Oxford),Pierre Del Moral (INRIA & Universit´ de Bordeaux), Robin J. Ryder (Dauphine) e Pierre E. Jacob PAWL 1/ 29
2.
Wang–Landau algorithm Improvements Example: variable selection ConclusionOutline 1 Wang–Landau algorithm 2 Improvements Automatic Binning Adaptive proposals Parallel Interacting Chains 3 Example: variable selection 4 Conclusion Pierre E. Jacob PAWL 2/ 29
3.
Wang–Landau algorithm Improvements Example: variable selection ConclusionWang–Landau Context unnormalized target density π on a state space X A kind of adaptive MCMC algorithm It iteratively generates a sequence Xt . The stationary distribution is not π itself. At each iteration a diﬀerent stationary distribution is targeted. Pierre E. Jacob PAWL 3/ 29
4.
Wang–Landau algorithm Improvements Example: variable selection ConclusionWang–Landau Partition the space The state space X is cut into d bins: d X = Xi and ∀i = j Xi ∩ Xj = ∅ i=1 Goal The generated sequence spends a desired proportion φi of time in each bin Xi , within each bin Xi the sequence is asymptotically distributed according to the restriction of π to Xi . Pierre E. Jacob PAWL 4/ 29
5.
Wang–Landau algorithm Improvements Example: variable selection ConclusionWang–Landau Stationary distribution Deﬁne the mass of π over Xi by: ψi = π(x)dx Xi The stationary distribution of the WL algorithm is: φJ(x) π (x) ∝ π(x) × ˜ ψJ(x) where J(x) is the index such that x ∈ XJ(x) Pierre E. Jacob PAWL 5/ 29
6.
Wang–Landau algorithm Improvements Example: variable selection ConclusionWang–Landau Example with a bimodal, univariate target density: π and two π ˜ corresponding to diﬀerent partitions. Here φi = d −1 . Original Density, with partition lines Biased by X Biased by Log Density 0 −2 −4 Log Density −6 −8 −10 −12 −5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15 X Pierre E. Jacob PAWL 6/ 29
7.
Wang–Landau algorithm Improvements Example: variable selection ConclusionWang–Landau Plugging estimates In practice we cannot compute ψi analytically. Instead we plug in estimates θt (i) of ψi /φi at iteration t, and deﬁne the distribution πθt by: 1 πθt (x) ∝ π(x) × θt (J(x)) Metropolis–Hastings The algorithm does a Metropolis–Hastings step, aiming πθt at iteration t, generating a new point Xt , updating θt . . . Pierre E. Jacob PAWL 7/ 29
8.
Wang–Landau algorithm Improvements Example: variable selection ConclusionWang–Landau Estimate of the bias The update of the estimated bias θt (i) is done according to: θt (i) ← θt−1 (i) [1 + γt (1 Xi (Xt ) − φi )] I with d the number of bins, γt a decreasing sequence or “step size”. E.g. γt = 1/t. If 1 Xi (Xt ) then θt (i) increases; I otherwise θt (i) decreases. Pierre E. Jacob PAWL 8/ 29
9.
Wang–Landau algorithm Improvements Example: variable selection ConclusionWang–Landau The algorithm itself 1: First, ∀i ∈ {1, . . . , d} set θ0 (i) ← 1. 2: Choose a decreasing sequence {γt }, typically γt = 1/t. 3: Sample X0 from an initial distribution π0 . 4: for t = 1 to T do 5: Sample Xt from Pt−1 (Xt−1 , ·), a MH kernel with invariant distribution πθt−1 (x). 6: Update the bias: θt (i) ← θt−1 (i)[1 + γt (1 Xi (Xt ) − φi )]. I 7: end for Pierre E. Jacob PAWL 9/ 29
10.
Wang–Landau algorithm Improvements Example: variable selection ConclusionWang–Landau Result In the end we get: a sequence Xt asymptotically following π , ˜ as well as estimates θt (i) of ψi /φi . Pierre E. Jacob PAWL 10/ 29
11.
Wang–Landau algorithm Improvements Example: variable selection ConclusionWang–Landau Usual improvement: Flat Histogram Wait for the FH criterion to occur before decreasing γt . νt (i) (FH) max − φi < c i=1...d t t where νt (i) = k=1 1 Xi (Xk ) I and c > 0. WL with stochastic schedule Let κt be the number of times FH was reached at iteration t. Use γκt at iteration t instead of γt . If FH reached, reset νt (i) to 0. Pierre E. Jacob PAWL 11/ 29
12.
Wang–Landau algorithm Improvements Example: variable selection ConclusionWang–Landau Theoretical Understanding of WL with deterministic schedule The schedule γt decreases at each iteration, hence θt converges, hence Pt (·, ·) converges . . . ≈ “diminishing adaptation”. Theoretical Understanding of WL with stochastic schedule Flat Histogram is reached in ﬁnite time for any γ, φ, c if one uses the following update: log θt (i) ← log θt−1 (i) + γ(1 Xt (Xt ) − φi ) I instead of θt (i) ← θt−1 (i)[1 + γ(1 Xt (Xt ) − φi )] I Pierre E. Jacob PAWL 12/ 29
13.
Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains ConclusionAutomate Binning Maintain some kind of uniformity within bins. If non-uniform, split the bin. Frequency Frequency Log density Log density (a) Before the split (b) After the split Pierre E. Jacob PAWL 13/ 29
14.
Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains ConclusionAdaptive proposals Target a speciﬁc acceptance rate: σt+1 = σt + ρt (21 > 0.234) − 1) I(A Or use the empirical covariance of the already-generated chain: Σt = δ × Cov (X1 , . . . , Xt ) Pierre E. Jacob PAWL 14/ 29
15.
Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains ConclusionParallel Interacting Chains (1) (N) N chains (Xt , . . . , Xt ) instead of one. targeting the same biased distribution πθt at iteration t, sharing the same estimated bias θt at iteration t. The update of the estimated bias becomes: N 1 (j) log θt (i) ← log θt−1 (i) + γκt 1 Xi (Xt ) − φi I N j=1 Pierre E. Jacob PAWL 15/ 29
16.
Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains ConclusionParallel Interacting Chains How “parallel” is PAWL? The algorithm’s additional cost compared to independent parallel MCMC chains lies in: 1 N (j) getting the proportions N j=1 1 Xi (Xt ) I updating (θt (1), . . . , θt (d)). Pierre E. Jacob PAWL 16/ 29
17.
Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains ConclusionParallel Interacting Chains Example: Normal distribution Histogram of the binned coordinate 0.4 0.3 Density 0.2 0.1 0.0 −4 −2 0 2 4 binned coordinate Pierre E. Jacob PAWL 17/ 29
18.
Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains ConclusionParallel Interacting Chains Reaching Flat Histogram 40 30 #FH N=1 20 N = 10 N = 100 10 2000 4000 6000 8000 10000 iterations Pierre E. Jacob PAWL 18/ 29
19.
Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains ConclusionParallel Interacting Chains Stabilization of the log penalties 10 5 value 0 −5 −10 2000 4000 6000 8000 10000 iterations Figure: log θt against t, for N = 1 Pierre E. Jacob PAWL 19/ 29
20.
Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains ConclusionParallel Interacting Chains Stabilization of the log penalties 10 5 value 0 −5 −10 2000 4000 6000 8000 10000 iterations Figure: log θt against t, for N = 10 Pierre E. Jacob PAWL 20/ 29
21.
Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains ConclusionParallel Interacting Chains Stabilization of the log penalties 10 5 value 0 −5 −10 2000 4000 6000 8000 10000 iterations Figure: log θt against t, for N = 100 Pierre E. Jacob PAWL 21/ 29
22.
Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains ConclusionParallel Interacting Chains Multiple eﬀects of parallel chains N 1 (j) log θt (i) ← log θt−1 (i) + γκt 1 Xi (Xt ) − φi I N j=1 FH is reached more often when N increases, hence γκt decreases quicker; log θt tends to vary much less when N increases, even for a ﬁxed value of γ. Pierre E. Jacob PAWL 22/ 29
23.
Wang–Landau algorithm Improvements Example: variable selection ConclusionVariable selection Settings Pollution data as in McDonald & Schwing (1973). For 60 metropolitan areas: 15 possible explanatory variables (including precipitation, population per household, . . . ) (denoted by X ), the response variable Y is the age-adjusted mortality rate. This leads to 32,768 possible models to explain the data. Pierre E. Jacob PAWL 23/ 29
24.
Wang–Landau algorithm Improvements Example: variable selection ConclusionVariable selection Introduce γ ∈ {0, 1}p the “variable selector”, qγ represents the number of variables in model “γ”, g some large value (g -prior, see Zellner 1986, Marin & Robert 2007). Posterior distribution π(γ|y, X) ∝ (g + 1)−(qγ +1)/2 −n/2 g T y y− yT Xγ (XT Xγ )−1 Xγ y γ . g +1 Pierre E. Jacob PAWL 24/ 29
25.
Wang–Landau algorithm Improvements Example: variable selection ConclusionVariable selection Most naive MH algorithm The proposal is ﬂipping a variable on / oﬀ at random, at each iteration. Binning Along values of log π(x), found with a preliminary exploration, in 20 bins. Pierre E. Jacob PAWL 25/ 29
26.
Wang–Landau algorithm Improvements Example: variable selection ConclusionVariable selection N=1 N = 10 N = 100 0 −20 Log(θ) −40 −60 20000 40000 60000 80000 5000 10000 15000 20000 25000 500 1000 1500 2000 2500 3000 3500 Iteration Figure: Each run took 2 minutes (+/- 5 seconds). Dotted lines show the real ψ. Pierre E. Jacob PAWL 26/ 29
27.
Wang–Landau algorithm Improvements Example: variable selection ConclusionVariable selection Wang−Landau Metropolis−Hastings, Temp = 1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Model Saturation 0.0 Metropolis−Hastings, Temp = 10 Metropolis−Hastings, Temp = 100 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500 Iteration Figure: qγ /p (mean and 95% interval) along iterations, for N = 100. Pierre E. Jacob PAWL 27/ 29
28.
Wang–Landau algorithm Improvements Example: variable selection ConclusionConclusion Automatic binning but. . . We still have to deﬁne a range of plausible (or “interesting”) values. Parallel Chains Seems reasonable to use more than N = 1 chain, with or without GPUs. No theoretical validation of this yet. Optimal N for a given computational eﬀort? Need of a stochastic schedule? It seems that using large N makes the use and hence the choice of γt irrelevant. Pierre E. Jacob PAWL 28/ 29
29.
Wang–Landau algorithm Improvements Example: variable selection ConclusionWould you like to know more? Article: An Adaptive Interacting Wang-Landau Algorithm for Automatic Density Exploration, with L. Bornn, P. Del Moral, A. Doucet. Article: The Wang-Landau algorithm reaches the Flat Histogram criterion in ﬁnite time, with R. Ryder. Software: PAWL, available on CRAN: install.packages("PAWL") References: F. Wang, D. Landau, Physical Review E, 64(5):56101 Y. Atchad´, J. Liu, Statistica Sinica, 20:209-233 e Pierre E. Jacob PAWL 29/ 29
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.
Be the first to comment