Minisymposium on "Selected topics in computation and dynamics: machine learning and multiscale methods" at SciCADE 2019, Innsbruck, July 2019.
https://scicade2019.uibk.ac.at/
Slides are based on the article in https://arxiv.org/abs/1509.08787
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Gibbs flow transport for Bayesian inference
1. Gibbs flow transport for Bayesian inference
Jeremy Heng
ESSEC Business School
Joint work with Arnaud Doucet (Oxford) & Yvo Pokern (UCL)
SciCADE 2019
Selected topics in computation and dynamics: machine learning and
multiscale methods
Innsbruck
22 July 2019
Jeremy Heng Flow transport 1/ 23
2. Problem specification
• Target distribution on Rd
⇡(dx) =
(x) dx
Z
where : Rd
! R+ can be evaluated pointwise and
Z =
Z
Rd
(x) dx
is unknown
• Problem 1: Obtain consistent estimator of ⇡(') :=
R
Rd '(x) ⇡(dx)
• Problem 2: Obtain unbiased and consistent estimator of Z
• Main challenge: dimension d is typically large
Jeremy Heng Flow transport 2/ 23
3. Motivation: Bayesian computation
• Prior distribution ⇡0 on unknown parameters of a model
• Likelihood function L : Rd
! R+ of data y
• Bayes update gives posterior distribution on Rd
⇡(dx) =
⇡0(x)L(x) dx
Z
,
where Z =
R
Rd ⇡0(x)L(x) dx is the marginal likelihood of y
• Problem: ⇡(') and Z are typically intractable
• Main challenge: complex models require large number of
parameters d
Jeremy Heng Flow transport 3/ 23
4. Monte Carlo methods
• Typically sampling from ⇡ is intractable, so we rely on Markov
chain Monte Carlo (MCMC) methods
• MCMC constructs a ⇡-invariant Markov transition kernel
K : Rd
⇥ B(Rd
) ! [0, 1]
• Sample X0 ⇠ ⇡0 and iterate Xn ⇠ K(Xn 1, ·) until convergence
• MCMC methods have been successful in many applications, but can
also fail in practice, for e.g. when ⇡ is highly multi-modal
Jeremy Heng Flow transport 4/ 23
5. Annealed importance sampling
• If ⇡0 and ⇡ are distant, define bridges
⇡ m
(dx) =
⇡0(x)L(x) m
dx
Z( m)
,
with 0 = 0 < 1 < . . . < M = 1 so
that ⇡1 = ⇡
• Initialize X0 ⇠ ⇡0 and move Xm ⇠ Km(Xm 1, ·) for m = 1, . . . , M,
where Km is ⇡ m -invariant
• Annealed importance sampling constructs w : (Rd
)M+1
! R+ so
that
⇡(') =
E ['(XM )w(X0:M )]
E [w(X0:M )]
, Z = E [w(X0:M )]
• AIS (Neal, 2001) and SMC samplers (Del Moral et al., 2006) are
considered state-of-the-art in statistics and machine learning
Jeremy Heng Flow transport 5/ 23
6. Jarzynski nonequilibrium equality
• Consider M ! 1, i.e. define the curve
of distribution {⇡t}t2[0,1]
⇡t(dx) =
⇡0(x)L(x) (t)
dx
Z(t)
,
where : [0, 1] ! [0, 1] is a strictly
increasing C1
function
• Initialize X0 ⇠ ⇡0 and run time-inhomogenous Langevin dynamics
dXt =
1
2
r log ⇡t(Xt)dt + dWt, t 2 [0, 1]
• Jarzynski equality (Jarzynski, 1997; Crooks, 1998) constructs
w : C([0, 1], Rd
) ! R+ so that
⇡(') =
E
⇥
'(X1)w(X[0,1])
⇤
E
⇥
w(X[0,1])
⇤ , Z = E
⇥
w(X[0,1])
⇤
Jeremy Heng Flow transport 6/ 23
7. Optimal dynamics
• Dynamical lag kLaw(Xt) ⇡tk impacts variance of estimators
• Vaikuntanathan & Jarzynski (2011) considered adding drift
f : [0, 1] ⇥ Rd
! Rd
to reduce lag
dXt = f (t, Xt)dt +
1
2
r log ⇡t(Xt)dt + dWt, t 2 [0, 1], X0 ⇠ ⇡0
• An optimal choice of f results in zero lag, i.e. Xt ⇠ ⇡t for t 2 [0, 1],
and zero variance estimator of Z
• Any optimal choice f satisfies Liouville PDE/continuity equation
r · (⇡t(x)f (t, x)) = @t⇡t(x)
• Zero lag also achieved by running deterministic dynamics
dXt = f (t, Xt)dt, X0 ⇠ ⇡0
• Main idea: solve Liouville PDE for f and run ODE to get trajectory
Jeremy Heng Flow transport 7/ 23
8. Time evolution of distributions
• Time evolution of ⇡t is given by
@t⇡t(x) = 0
(t) (log L(x) It) ⇡t(x),
where
It =
1
0(t)
d
dt
log Z(t)
!
= E⇡t
[log L(Xt)] < 1
• Integrating recovers path sampling (Gelman and Meng, 1998) or
thermodynamic integration (Kirkwood, 1935) identity
log
✓
Z(1)
Z(0)
◆
=
Z 1
0
0
(t)It dt.
Jeremy Heng Flow transport 8/ 23
9. Defining the flow transport problem
• We want to solve the Liouville equation
r · (⇡t(x)f (t, x)) = @t⇡t(x),
for a drift f ... but not all solutions will work!
• Validity relies on following result:
Theorem. Ambrosio et al. (2005)
Under the following assumptions:
A1 f is locally Lipschitz;
A2
R 1
0
R
Rd |f (t, x)|⇡t(x) dx dt < 1;
Eulerian Liouville PDE () Lagrangian ODE
• Define flow transport problem as solving Liouville for f that
satisfies [A1] & [A2]
Jeremy Heng Flow transport 9/ 23
10. Ill-posedness and regularization
• Under-determined: consider ⇡t = N((0, 0) , I2) for t 2 [0, 1],
f (x1, x2) = (0, 0) and f (x1, x2) = ( x2, x1)
are both solutions
• Regularization: seek minimal kinetic energy solution
argminf
⇢Z 1
0
Z
Rd
|f (t, x)|2
⇡t(x) dx dt : f solves Liouville
Euler-Lagrange
=) f ⇤
(t, x) = r t(x) where r · (⇡t(x)r t(x)) = @t⇡t(x)
• Analytical solution available when distributions are (mixtures of)
Gaussians (Reich, 2012)
Jeremy Heng Flow transport 10/ 23
11. Flow transport problem on R
• Minimal kinetic energy solution
f (t, x) =
R x
1
@t⇡t(u) du
⇡t(x)
• Checking Liouville
r · (⇡t(x)f (t, x)) = @x
Z x
1
@t⇡t(u) du = @t⇡t(x)
A1 For f to be locally Lipschitz, assume
⇡0, L 2 C1
(R, R+) =) f 2 C1
([0, 1] ⇥ R, R)
A2 For integrability of
R 1
0
R
Rd |f (t, x)|⇡t(x) dx dt < 1, necessarily
|⇡tf |(t, x) =
Z x
1
@t⇡t(u) du ! 0 as |x| ! 1
since
R 1
1
@t⇡t(u) du = 0
• Optimality: f (t, x) = r t(x) holds trivially
Jeremy Heng Flow transport 11/ 23
12. Flow transport problem on R
• Re-write solution as
f (t, x) =
0
(t)It {Ft(x) Ix
t /It}
⇡t(x)
where Ix
t = E⇡t
[ ( 1,x] log L(Xt)] and Ft is CDF of ⇡t
• Speed is controlled by 0
(t) and ⇡t(x)
• Sign is given by di↵erence between Ft(x) and Ix
t /It 2 [0, 1]
Jeremy Heng Flow transport 12/ 23
13. Flow transport problem on Rd
, d 1
• Multivariate solution for d = 3
(⇡tf1)(t, x1:3) =
Z x1
1
@t⇡t(u1, x2, x3) du1
+ g1(t, x1)
Z 1
1
@t⇡t(u1, x2, x3) du1
(⇡tf2)(t, x1:3) = g0
1(t, x1)
Z 1
1
Z x2
1
@t⇡t(u1, u2, x3) du1:2
+ g0
1(t, x1)g2(t, x2)
Z 1
1
Z 1
1
@t⇡t(u1, u2, x3) du1:2
(⇡tf3)(t, x1:3) = g0
1(t, x1)g0
2(t, x2)
Z 1
1
Z 1
1
Z x3
1
@t⇡t(u1, u2, u3) du1:3
where g1, g2 2 C2
([0, 1] ⇥ R, [0, 1])
Jeremy Heng Flow transport 13/ 23
14. Flow transport problem on Rd
, d 1
• Checking Liouville
@x1 (⇡tf1)(t, x1:3) = @t⇡t(x1, x2, x3)
+ g0
1(t, x1)
Z 1
1
@t⇡t(u1, x2, x3) du1
@x2 (⇡tf2)(t, x1:3) = g0
1(t, x1)
Z 1
1
@t⇡t(u1, x2, x3) du1:2
+ g0
1(t, x1)g0
2(t, x2)
Z 1
1
Z 1
1
@t⇡t(u1, u2, x3) du1:2
@x3 (⇡tf3)(t, x1:3) = g0
1(t, x1)g0
2(t, x2)
Z 1
1
Z 1
1
@t⇡t(u1, u2, x3) du1:2
• Taking divergence gives telescopic sum
r · (⇡tf )(t, x1:3) =
3X
i=1
@xi
(⇡tfi )(t, x1:3) = @t⇡t(x1:3)
Jeremy Heng Flow transport 14/ 23
15. Flow transport problem on Rd
, d 1
A1 For f to be locally Lipschitz, assume
⇡0, L 2 C1
(Rd
, R+) =) f 2 C1
([0, 1] ⇥ Rd
, Rd
)
A2 For integrability of
R 1
0
R
Rd |f (t, x)|⇡t(x) dx dt < 1, necessarily
|⇡tf |(t, x)| ! 0 as |x| ! 1
if {gi } are non-decreasing functions with tail behaviour
gi (t, xi ) ! 0 as xi ! 1,
gi (t, xi ) ! 1 as xi ! 1
• Choosing gi (t, xi ) = Ft(xi ) as marginal CDF of ⇡t allows f to
decouple if distributions are independent
Jeremy Heng Flow transport 15/ 23
16. Approximate Gibbs flow transport
• Solution involved integrals of increasing dimension as it tracks
increasing conditional distributions
⇡t(x1|x2:d ), ⇡t(x2|3:d ), . . . , ⇡t(xd ), xi 2 R
• Trade-o↵ accuracy for computational tractability: track full
conditional distributions
⇡t(xi |x i ), xi 2 R
• System of Liouville equations
@xi
·
n
⇡t(xi |x i )˜fi (t, x)
o
= @t⇡t(xi |x i ),
each defined on (0, 1) ⇥ R
Jeremy Heng Flow transport 16/ 23
17. Approximate Gibbs flow transport
• The solution is
˜fi (t, x) =
R xi
1
@t⇡t(ui |x i ) dui
⇡t(xi |x i )
• If ⇡0, L 2 C1
(Rd
, R+) with appropriate tail behaviour, the ODE
dXt = ˜f (t, Xt)dt, X0 ⇠ ⇡0
admits a unique solution on [0, 1], referred to as Gibbs flow
Jeremy Heng Flow transport 17/ 23
18. Error control
• Define local error
"t(x) = @t⇡t(x) + r · (⇡t(x)˜f (t, x))
= @t⇡t(x)
dX
i=1
@t⇡t(xi |x i )⇡t(x i )
• Gibbs flow exploits local independence:
⇡t(x) =
dY
i=1
⇡t(xi ) =) "t(x) = 0
• If Gibbs flow induces {˜⇡t}t2[0,1] with ˜⇡0 = ⇡0
k˜⇡t ⇡tk2
L2 t
Z t
0
k"uk2
L2 du · exp
✓
1 +
Z t
0
kr · ˜f (u, ·) k1 du
◆
Jeremy Heng Flow transport 18/ 23
19. Numerical implementation of Gibbs flow
• Implementation of the Gibbs flow involves one-dimensional
quadrature and numerical integration
• Previously, we considered the forward Euler scheme
Ym = Ym 1 + t ˜f (tm 1, Ym 1) = m(Ym 1)
• To get Law(Ym), we need Jacobian determinant of m which
typically costs O(d3
)
• In contrast, this scheme that cycles through each dimension
Ym[i] = Ym 1[i] + t ˜f (tm 1, Ym[1 : i 1], Ym 1[i : d])
Ym = m,d · · · m,1(Ym 1)
is also order one, but costs O(d)
Jeremy Heng Flow transport 19/ 23
20. Mixture modelling example
• Lack of identifiability induces ⇡ on R4
with 4! = 24 well-separated
and identical modes
• Gibbs flow approximation
t =0.0006
x1
x2
−10 −5 0 5 10
−10
−5
0
5
10
t =0.0058
x1
x2
−10 −5 0 5 10
−10
−5
0
5
10
t =0.0542
x1
x2
−10 −5 0 5 10
−10
−5
0
5
10
t =1.0000
x1
x2
−10 −5 0 5 10
−10
−5
0
5
10
Jeremy Heng Flow transport 20/ 23
21. Mixture modelling example
• Proportion of samples in each of the 24 modes
0.00
0.01
0.02
0.03
0.04
1 2 3 4 5 6 7 8 9 101112131415161718192021222324
Mode
Proportionofparticles
• Pearson’s Chi-squared test for uniformity gives p-value of 0.85
Jeremy Heng Flow transport 21/ 23
22. Cox point process model
• E↵ective sample size % in dimension d
• AIS: AIS with HMC moves
• GF-SIS: Gibbs flow
• GF-AIS: Gibbs flow with HMC moves
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00
time
ESS%
method AIS GF−SIS GF−AIS
d = 100
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00
time
ESS%
method AIS GF−SIS GF−AIS
d = 225
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00
time
ESS%
method AIS GF−SIS GF−AIS
d = 400
Jeremy Heng Flow transport 22/ 23
23. End
• Slides will be uploaded to my webpage
• Heng, J., Doucet, A., & Pokern, Y. (2015). Gibbs Flow for
Approximate Transport with Applications to Bayesian Computation.
arXiv preprint arXiv:1509.08787.
• Updated article and R package coming soon!
Jeremy Heng Flow transport 23/ 23