NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

NÜWA: Visual Synthesis Pre-training for
Neural visUal World creAtion
Chenfei Wu1∗ Jian Liang2∗ Lei Ji1 Fan Yang1 Yuejian Fang2 Daxin Jiang1 Nan Duan1†
1Microsoft Research Asia 2Peking University
Presentor: Kai Katsumata
Nakayama Lab.
∗
Both authors contributed equally to this research.
†
Corresponding author.

Basic information
Title NÜWA: Visual Synthesis Pre-training for Neural visUal World
creAtion
Authors Chenfei Wu Jian Liang Lei Ji Fan Yang Yuejian Fang
Daxin Jiang Nan Duan
Affiliation Microsoft Research Asia Peking University
Date 2021/11/24 (Arxiv) https://arxiv.org/abs/2111.12417
Project url https://github.com/microsoft/NUWA
1 / 26

Abstract
”This paper presents a unified multimodal pre-trained model called NÜWA that can generate
new or manipulate existing visual data (i.e., images and videos) for various visual synthesis
tasks. To cover language, image, and video at the same time for different scenarios, a 3D
transformer encoder-decoder framework is designed, which can not only deal with videos as
3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby
Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and
reduce the computational complexity. We evaluate NÜWA on 8 downstream tasks. Compared
to several strong baselines, NÜWA achieves state-of-the-art results on text-to-image generation,
text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good
zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is
https://github.com/microsoft/NUWA.” (Wu et al., 2021b)
2 / 26

Teaser figure
Text-To-Image (T2I)
A dog with
goggles
staring at
the camera.
A person is
preparing
some art. grass
water
house
sky
tree
a horse is running on the grassland
grass
water
house
sky
tree
grass
water
house
sky
tree
Sketch-To-Image (S2I)
The car is reversing
Image Completion (I2I) Image Manipulation (TI2I)
Text-To-Video (T2V) Sketch-To-Video (S2V) Video Prediction (V2V) Video Manipulation (TV2V)
grass
water
house
sky
tree
flower
cup
wall vase
door
table
Figure 1: NÜWA supports several typical generation and manipulation tasks. The figure
is cited from (Wu et al., 2021b) 3 / 26

VQ-VAE-based Visual Auto-Regressive Models (previous
work)
Transformer
Target
Flattened sequence
Discrete Latents
Codebook
Conv3D
Encoder
Discrete Latents
Conv3D
Decoder
Figure 2: The figure is cited from (Yan et al., 2021) CC BY
4 / 26

VQ-VAE-based Visual Auto-Regressive Models (previous
work)
Encoder
Decoder
Discretize Recover
Text Tokenizer (sentence pieces)
Image Tokenizer
(Discrete AutoEncoder)
[ROI1] Text Token Text Token [BASE] [BOI1] [EOI1]
Image Token Image Token
Flattern
Input Text: Input Image:
Transformer (GPT)
z }| {
<latexit sha1_base64="WkmkOQqV4y/G2CwEGjey+GFekFc=">AAACAnicbVDLSgMxFM3UV62vUVfiJlgEV2VGi7osuHFZwT6gM5RMeqcNzWSGJCOUobjxV9y4UMStX+HOvzHTzkJbD4Qczrn3JvcECWdKO863VVpZXVvfKG9WtrZ3dvfs/YO2ilNJoUVjHstuQBRwJqClmebQTSSQKODQCcY3ud95AKlYLO71JAE/IkPBQkaJNlLfPvJiYweSUMi8kUry+9JJ9HTat6tOzZkBLxO3IFVUoNm3v7xBTNMIhKacKNVzzRw/I1IzymFa8VIFZv6YDKFnqCARKD+brTDFp0YZ4DCW5giNZ+rvjoxESk2iwFRGRI/UopeL/3m9VIfXfsZEkmoQdP5QmHKsY5zngQdMAtV8Ygihkpm/YjoiJg9tUquYENzFlZdJ+7zmXtScu3q1US/iKKNjdILOkIuuUAPdoiZqIYoe0TN6RW/Wk/VivVsf89KSVfQcoj+wPn8A712XuA==</latexit>
z }| {
<latexit sha1_base64="WkmkOQqV4y/G2CwEGjey+GFekFc=">AAACAnicbVDLSgMxFM3UV62vUVfiJlgEV2VGi7osuHFZwT6gM5RMeqcNzWSGJCOUobjxV9y4UMStX+HOvzHTzkJbD4Qczrn3JvcECWdKO863VVpZXVvfKG9WtrZ3dvfs/YO2ilNJoUVjHstuQBRwJqClmebQTSSQKODQCcY3ud95AKlYLO71JAE/IkPBQkaJNlLfPvJiYweSUMi8kUry+9JJ9HTat6tOzZkBLxO3IFVUoNm3v7xBTNMIhKacKNVzzRw/I1IzymFa8VIFZv6YDKFnqCARKD+brTDFp0YZ4DCW5giNZ+rvjoxESk2iwFRGRI/UopeL/3m9VIfXfsZEkmoQdP5QmHKsY5zngQdMAtV8Ygihkpm/YjoiJg9tUquYENzFlZdJ+7zmXtScu3q1US/iKKNjdILOkIuuUAPdoiZqIYoe0TN6RW/Wk/VivVsf89KSVfQcoj+wPn8A712XuA==</latexit>
Figure 3: The figure is cited from (Ding et al., 2021)
5 / 26

Difference between previous work (DALL-E(Ramesh et al., 2021),
CogView(Ding et al., 2021), GODIVA(Wu et al., 2021a))
DALL-E(Ramesh et al., 2021)
extend to video
→ GODIVA(Wu et al., 2021a)
CogView(Ding et al., 2021) VideoGPT(Yan et al., 2021)
& ↓ extend to multimodal
NUWA
6 / 26

Overview of NÜWA
3D-Decoder
1D-Encoder
3D-Encoder
2D-Encoder
A light wind blew across
the country road.
Input Text
Output Image
Input Image Sketch
Input Video Sketch
Output Video
Input Image Parts
Input Video Frames
Output Remaining Parts
Output Future Frames
Visual
Generation
Visual
Completion,
Prediction,
Manipulation
Figure 4: Structure of NÜWA. The figure is cited from (Wu et al., 2021b)
7 / 26

VQGAN
zi = arg min
j∈{0,...,N−1}
||
RdB
z }| {
E(I)i −Bj||2
, (1)
ˆ
I = G(B[z]), (2)
LV
= ||
RH×W ×C
∈
I − ˆ
I||2
2 + ||sg[
Rh×w×dB
z}|{
E(I)] − B[z]||2
2 + ||E(I) − sg[B[z]]||2
2, (3)
LP
= ||CNN(I) − CNN(ˆ
I)||2
2, (4)
LG
= logD(I) + log(1 − D(ˆ
I)), (5)
where dB
= 256, N = 12, 288, H = W ∈ {256, 336}, h = w ∈ {16, 21, 32}.
8 / 26

3D Nearby Self-Attention (3DNA)
Y = 3DNA(
Rh×w×s×din
∈
X, C
∈
Rh0×w0×s0×din
; W), (6)
Reh×ew×es×din
∈
N(i,j,k)
=

Cabc

|a − i0
| ≤ eh
, |b − j0
| ≤ ew
, |c − k0
| ≤ es

, (7)
Rh×w×s×dout
z }| {
Q(i,j,k)
= XWQ
, (8)
Reh×ew×es×dout
z }| {
K(i,j,k)
= N(i,j,k)
WK
, (9)
Reh×ew×es×dout
z }| {
V (i,j,k)
= N(i,j,k)
WV
, (10)
yijk = softmax
(Q(i,j,k)
)T(K(i,j,k)
)T
√
din
!
V (i,j,k)
, (11)
where WQ, WK, WV ∈ Rdin×dout
.
9 / 26

3DNA
3D block-sparse 3D axial-sparse (row) 3D nearby-sparse (ours)
Considering previous tokens
in a fixed 3D-block.
in each 3D axis.
in a 3D nearby sliding window.
3D sparse attentions.
10 / 26

3D Encoder-Decoder
Yijk := Yijk + Ph
i + Pw
j + Ps
k (12)
Cijk := Cijk + Ph0
i + Pw0
j + Ps0
k (13)
C(l)
= 3DNA(C(l−1)
, C(l−1)
), (14)
Y
(l)
ijk =3DNA(Y
(l−1)
i,j,k, Y
(l−1)
i,j,k)
+3DNA(Y
(l−1)
i,j,k, C(L)
),
(15)
where Y ∈ Rh×w×s×dout
, C ∈ Rh0×w0×s0×din
, V
(1)
0,0,0 is bos .
11 / 26

Training Objective
L = −
Xh×w
t=1
log pθ yt

yt, Ctext
; θ

−
Xh×w×s
t=1
log pθ yt

yt, c; θ

−
Xh×w×s
t=1
log pθ yt

yt, Ctext
; θ

(16)
Training on Text-to-Image (T2I), Video Prediction (V2V) and Text-to-Video
(T2V) with cross-entropy loss.
12 / 26

Experiments - quantitative results
Model FID-0↓ FID-1 FID-2 FID-4 FID-8 IS↑ CLIPSIM↑
AttnGAN (Xu et al., 2018) 35.2 44.0 72.0 108.0 100.0 23.3 0.2772
DM-GAN (Zhu et al., 2019) 26.0 39.0 73.0 119.0 112.3 32.2 0.2838
DF-GAN (Tao et al., 2020) 26.0 33.8 55.9 91.0 97.0 18.7 0.2928
DALL-E (Ramesh et al., 2021) 27.5 28.0 45.5 83.5 85.0 17.9 -
CogView (Ding et al., 2021) 27.1 19.4 13.9 19.4 23.6 18.2 0.3325
XMC-GAN (Zhang et al., 2021) 9.3 - - - - 30.5 -
NÜWA 12.9 13.8 15.7 19.3 24 27.2 0.3429
Table 1: T2I task on MSCOCO (256×256).
Model Acc↑ FID-img↓ FID-vid↓ CLIPSIM↑
T2V (64×64) (Li et al., 2018) 42.6 82.13 14.65 0.2853
SC (128×128) (Balaji et al., 2019) 74.7 33.51 7.34 0.2915
TFGAN (128×128) (Balaji et al., 2019) 76.2 31.76 7.19 0.2961
NÜWA (128×128) 77.9 28.46 7.05 0.3012
Table 2: T2V task on the Kinetics dataset.
Model Cond. FVD↓
MoCoGAN (Tulyakov et al., 2018) 4 503
SVG-FP (Denton and Fergus, 2018) 2 315
CNDA (Finn et al., 2016) 2 297
SV2P (Babaeizadeh et al., 2017) 2 263
SRVP (Franceschi et al., 2020) 2 181
VideoFlow (Kumar et al., 2019) 3 131
LVT (Rakhimov et al., 2020) 1 126±3
SAVP (Lee et al., 2018) 2 116
DVD-GAN-FP (Clark et al., 2019) 1 110
Video Transformer (S) (Weissenborn et al., 2020) 1 106±3
TriVD-GAN-FP (Luc et al., 2020) 1 103
CCVS (Moing et al., 2021) 1 99±2
Video Transformer (L) (Weissenborn et al., 2020) 1 94±2
NÜWA 1 86.9
Table 3: V2V task on BAIR (64×64).
13 / 26

Experiments - qualitative results
A very cute cat
laying by a big bike.
China airlines plain
on the ground at an
airport with baggage
cars nearby.
A table that has a
train model on it
with other cars
and things.
A living room with
a tv on top of a
stand with a guitars
sitting next to.
A couple of people
are sitting on a
wood bench.
A very cute giraffe
making a funny
face.
A kitchen with a
fridge, stove and
sink.
A group of animals
are standing in the
snow.
A green train is coming
down the tracks.
A group of skiers are
preparing to ski
down a mountain.
A small kitchen
with low a ceiling.
A child eating a birthday cake near some balloons.
XMC-GAN (256×256) NÜWA(ours) (256×256)
A living area with a
television and a table.
NÜWA(ours)
(256×256)
XMC-GAN
(256×256)
NÜWA(ours)
(256×256)
DALL-E
(256×256)
Figure 6: T2I task on MSCOCO. The figure is cited from (Wu et al., 2021b).
14 / 26

Input Text: playing golf at swimming pool
Input Text: running on the sea
T2V
NÜWA(ours)
T2V
NÜWA(ours)
NÜWA(ours)
(336×336)
GODIVA
(128×128)
Input Text: playing golf on grass
TFGAN
(128×128)
T2V
(64×64)
Figure 7: T2V task on the Kinetics dataset. The figure is cited from (Wu et al., 2021b).
15 / 26

Input Ground Truth Taming (256×256)
SPADE (256×256)
NÜWA(ours)
NÜWA(ours) (256×256)
Figure 8: S2I)task on MSCOCO stuff
dataset. The figure is cited from (Wu et al.,
2021b).
Input
NÜWA(ours)
NÜWA(ours) (256×256)
Taming (256×256)
Figure 9: I2I in a zero-shot manner. The
figure is cited from (Wu et al., 2021b).
16 / 26

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

More Related Content

What's hot

Similar to NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Recently uploaded

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion