Diffusion Schrödinger bridges for score-based generative modeling

Diffusion Schrödinger bridges
for score-based generative modeling
Jeremy Heng
Joint work with Valentin De Bortoli, James Thornton and Arnaud Doucet
ESSEC Business School
ICSA China - 1 July 2022
1 / 20

Generative Modeling and Score-Based Generative Models
Diffusion Models Beat GANs on Image Synthesis - OpenAI, 2021
Over recent years, massive advances in generative modeling driven
by VAEs (Kingma & Welling, 2014), GANs (Goodfellow et al., 2014),
autoregressive models (van den Oord et al., 2016).
Score-based generative models aka denoising diffusion models
(Ho et al., 2020; Song et al., 2021) provide SOTA results in a large
number of domains.
2 / 20

One basic idea: ancestral sampling
From Song et al., ICLR 2021
Consider a Markov chain with X0 ∼ p0 and Xk+1 ∼ pk+1|k(·|Xk) then
p(x0:N ) = p0(x0)
QN−1
k=0 pk+1|k(xk+1|xk).
Denote by pk the marginal of Xk satisfying
pk(xk) =
R
pk|k−1(xk|xk−1)pk−1(xk−1)dxk−1.
Backward decomposition (pk|k+1 obtained with Bayes’ rule)
p(x0:N ) = pN (xN )
QN−1
k=0 pk|k+1(xk|xk+1).
In particular, one can sample from p(x0:N ) by ancestral sampling
Sample XN ∼ pN (·) then Xk ∼ pk|k+1(·|Xk+1) for k ∈ {N − 1, . . . , 1}.
3 / 20

Generative Modeling with ancestral sampling
From Song et al., ICLR 2021
Let p0 = pdata and set pk+1|k such that pN ≈ pref for N 1 where pref
is a “reference” easy-to-sample density.
Usual choice pk+1|k(x0
|x) = N(x0
; αx, (1 − α2
)Id) such that
pN (x) ≈ pref(x) where pref = N(x; 0d, Id) for N large enough.
Use ancestral sampling by replacing pN by pref
Sample XN ∼ pref(·) then Xk ∼ pk|k+1(·|Xk+1) for k ∈ {N − 1, . . . , 1}
Key Problem: Not only one needs forward transitions pk+1|k and N
large enough such that pN (x) ≈ pref(x) but also needs to
approximate the backward transitions pk|k+1.
4 / 20

Approximating Backward Transitions
We restrict ourselves to discretized Ornstein-Uhlenbeck processes
pk+1|k(xk+1|xk) = N(xk+1; αxk, (1 − α2
)Id),
(α 0 is close to 1)
Using a Taylor expansion we get
pk|k+1(xk|xk+1) = pk+1|k(xk+1|xk) exp[log pk(xk) − log pk+1(xk+1)]
≈ N(xk; (2 − α)xk+1 + (1 − α2
) ∇ log pk+1(xk+1)
| {z }
Score
, (1 − α2
)Id).
The score is not available but using that
pk+1(xk+1) =
R
p0(x0)pk+1|0(xk+1|x0)dx0, we get that
∇ log pk+1(xk+1) = EX0∼p0|k+1
[∇xk+1
log pk+1|0(xk+1|X0)].
5 / 20

Estimating the Scores using Score Matching and Sampling
Conditional expectation → Regression problem
sk+1 = arg mins Ep0,k+1
[||s(Xk+1) − ∇xk+1
log pk+1|0(Xk+1|X0)||2
].
In practice, we restrict ourselves to neural networks and estimate all
scores simultaneously i.e. sθ? (k, xk) ≈ ∇ log pk(xk) where
θ?
≈ arg minθ
PN
k=1 Ep0,k
[||sθ(k, Xk) − ∇xk
log pk|0(Xk|X0)||2
],
Generate samples from the backward process using XN ∼ pref and
the recursion
Xk = (2 − α)Xk+1 + (1 − α2
)sθ? (k + 1, Xk+1) +
√
1 − α2Zk+1.
Code available (JAX and Pytorch):
https://github.com/yang-song/score_sde
6 / 20

From Discrete to Continuous-Time
The Markov chain is an Euler discretization of the
Ornstein-Uhlenbeck
dXt = −βXtdt +
√
2dBt, X0 ∼ pdata.
(β 0 is a parameter, pref = N(0, 1/β Id)
The reverse-time process (Yt)t∈[0,T] = (XT−t)t∈[0,T] satisfies
dYt = {βYt + 2∇ log pT−t(Yt)}dt +
√
2dBt, Y0 ∼ pT
and the generative model is
dYt = {βYt + 2sθ? (T − t, Yt)}dt +
√
2dBt, Y0 ∼ pref .
7 / 20

From Discrete to Continuous-Time
Convergence of diffusion models (De Bortoli et al., 2021)
Assume there exists M ≥ 0 such that for any t ∈ [0, T] and x ∈ Rd
||sθ? (t, x) − ∇ log pt(x)|| ≤ M,
with sθ? ∈ C([0, T] × Rd
, Rd
) and regularity conditions on pdata and its
gradients.
Then there exist Bβ, Cβ, Dβ ≥ 0 s.t. for any N ∈ N and {γk}N
k=1 the
following hold:
||L(X0) − pdata||TV ≤ Bβ exp[−β1/2
T] + Cβ(M + γ̄1/2
) exp[DβT].
where T =
PN
k=1 γk and γ̄ = supk∈{1,...,N} γk ({γk}N
k=1 sequence of
stepsizes in the Euler-Maruyama discretization).
Take-home message: the “mixing time” of the reversal is entirely
given by the forward process. The bottleneck is not the mixing of the
chain but the approximation of the drift.
8 / 20

Practical Limitations
Not enough stepsizes lead to poor approximation (the
Ornstein-Uhlenbeck process does not mix fast enough).
Illustration of failure: N is too small so pN is very different from pref.
This harms the quality of the reconstruction for the time-reversal.
Our contribution: “iterating” diffusion models to force the correct
marginal distributions.
9 / 20

Revisiting Generative Modeling using Schrödinger Bridges
The Schrödinger Bridge problem: consider a base process p(x0:N ),
find π?
(x0:N ) such that
π?
= arg min{KL(π||p) : π0 = pdata, πN = pref}.
If π?
is available: XN ∼ pref, then Xk ∼ π?
k|k+1(·|Xk+1) for
k ∈ {N − 1, . . . , 0}.
We have π?
(x0:N ) = πs,?
(x0, xN )p(x1:N−1|x0, xN ) where
πs,?
= arg min{−Eπs [log pN|0(XN |X0)]−H(πs
) : πs
0 = pdata, πs
N = pref}
and, if pN|0(xN |x0) = N(xN ; x0, σ2
), then
πs,?
= arg min{Eπs [||X0 − XN ||2
] − 2σ2
H(πs
) : πs
0 = pdata, πs
N = pref}.
This is regularized Wasserstein 2 cost, i.e. σ → 0 implies that πs,?
converges to the optimal transport plan (Mikami, 2004).
10 / 20

Solving the Schrödinger Bridge Problem
The SB problem can be solved using Iterative Proportional Fitting
(IPF) (Fortet, 1940; Kullback, 1968), i.e. set π0
= p and for n ≥ 1
π2n+1
= arg min{KL(π||π2n
), πN = pref},
π2n+2
= arg min{KL(π||π2n+1
), π0 = pdata}.
limn→+∞ πn
= π?
under regularity conditions (Ruschendorf, 1995;
Léger, 2021; De Bortoli et al., 2021).
Explicit solution of the first IPF step
KL(π||π0
) = KL(πN ||pN ) + EπN
[KL(π|N ||p|N )]
Therefore,
π1
(x0:N ) = pref(xN )p(x0:N−1|xN )
= pref(xN )
Q0
k=N−1pk|k+1(xk|xk+1)
Take-home message: Approximation to first iteration of IPF
corresponds to current Score-Based Generative models.
11 / 20

Solving the Schrödinger Bridge Problem
The second iteration requires solving
π2
= arg min{KL(π||π1
), π0 = pdata}.
Therefore,
π2
(x0:N ) = pdata(x0)π1
(x1:N |x0)
= pdata(x0)
QN
k=1π1
k+1|k(xk+1|xk)
On an algorithmic level:
I IPF1: the time-reversal of the forward process π0
= p is
initialized by pref at time N to define the backward process π1
.
I IPF2: the time-reversal of the backward process π1
is initialized
by pdata at time 0 to define the forward process π2
.
I IPF3: the time-reversal of the forward process π2
is initialized
by pref at time N to define the backward process π3
.
I ...
12 / 20

Continuous-Time IPF
IPF can be formulated in continuous time
Π?
= arg min{KL(Π||P) : Π ∈ P(C), Π0 = pdata, ΠT = pref}.
Similarly, we define the IPF (Πn
) recursively Π0
= P using
Π2n+1
= arg min{KL(Π||Π2n
) : Π ∈ P(C), ΠT = pref},
Π2n+2
= arg min{KL(Π||Π2n+1
) : Π ∈ P(C), Π0 = pdata}.
Under regularity conditions, then
(Π2n+1
)R
→ dY2n+1
t = bn
T−t(Y2n+1
t )dt +
√
2dBt, Y2n+1
0 ∼ pref,
Π2n+2
→ dX2n+2
t = f n+1
t (X2n+2
t )dt +
√
2dBt, X2n+2
0 ∼ pdata,
where
bn
t (x) = −f n
t (x) + 2∇ log pn
t (x),
f n+1
t (x) = −bn
t (x) + 2∇ log qn
t (x),
with f 0
t (x) = f (x), and pn
t , qn
t the densities of Π2n
t and Π2n+1
t .
13 / 20

Diffusion Schrödinger Bridge
Sample generation: XN ∼ pref and Xk−1 = BβL (k, Xk) +
√
2γkZk.
14 / 20

Diffusion Schrödinger Bridge: 2D example
Diffusion Schrödinger Bridge (DSB) gives a solution to the “small
time problem”.
Approximation of Optimal Transport.
15 / 20

Applications: 2D distributions
Data distributions pdata vs distribution at t = 0 for T = 0.2 after 1 and
20 DSB steps
16 / 20

Applications: Downscaled CelebA
Generative model for CelebA after 10 DSB steps with N = 50,
T = 0.63 (d = 32 × 32 × 3 = 3072).
17 / 20

Applications: Datasets Interpolation
First row: Swiss-roll to S-curve (2D). Step 9 of DSB with T = 1
(N = 50). From left to right: t = 0, 0.4, 0.6, 1. Second row: EMNIST to
MNIST. Step 10 of DSB with T = 1.5 (N = 30). From left to right:
t = 0, 0.4, 1.25, 1.5.
18 / 20

Discussion
Quick summary
I Theoretical results for denoising diffusion models.
I Generative modeling can be reformulated as a Schrödinger
Bridge problem.
I Diffusion Schrödinger Bridge approximates its solution using
(discretized) forward-backward diffusions and score matching
ideas.
19 / 20

References
V. De Bortoli, J. Thornton, J. Heng A. Doucet, Diffusion Schrödinger
bridge with applications to score-based generative modeling. NeurIPS
2021.
V. De Bortoli, G. Deligiannidis A. Doucet, Quantitative uniform
stability of the iterative proportional fitting procedure.
arXiv:2108.08129.
J. Ho, A. Jain P. Abbeel, Denoising diffusion probabilistic models.
NeurIPS 2020.
Y. Song, J. Sohl-Dickstein, D.P. Kingma, A.Kumar, S. Ermon B.
Poole, Score-based generative modeling through stochastic differential
equations, ICLR 2021.
20 / 20

Diffusion Schrödinger bridges for score-based generative modeling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Diffusion Schrödinger bridges for score-based generative modeling

Similar to Diffusion Schrödinger bridges for score-based generative modeling (20)

More from JeremyHeng10

More from JeremyHeng10 (13)

Recently uploaded

Recently uploaded (20)

Diffusion Schrödinger bridges for score-based generative modeling