NÜWA: Visual Synthesis Pre-training for
Neural visUal World creAtion
Chenfei Wu1∗ Jian Liang2∗ Lei Ji1 Fan Yang1 Yuejian Fang2 Daxin Jiang1 Nan Duan1†
1Microsoft Research Asia 2Peking University
Presentor: Kai Katsumata
Nakayama Lab.
∗
Both authors contributed equally to this research.
†
Corresponding author.
Basic information
Title NÜWA: Visual Synthesis Pre-training for Neural visUal World
creAtion
Authors Chenfei Wu Jian Liang Lei Ji Fan Yang Yuejian Fang
Daxin Jiang Nan Duan
Affiliation Microsoft Research Asia Peking University
Date 2021/11/24 (Arxiv) https://arxiv.org/abs/2111.12417
Project url https://github.com/microsoft/NUWA
1 / 26
Abstract
”This paper presents a unified multimodal pre-trained model called NÜWA that can generate
new or manipulate existing visual data (i.e., images and videos) for various visual synthesis
tasks. To cover language, image, and video at the same time for different scenarios, a 3D
transformer encoder-decoder framework is designed, which can not only deal with videos as
3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby
Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and
reduce the computational complexity. We evaluate NÜWA on 8 downstream tasks. Compared
to several strong baselines, NÜWA achieves state-of-the-art results on text-to-image generation,
text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good
zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is
https://github.com/microsoft/NUWA.” (Wu et al., 2021b)
2 / 26
Teaser figure
Text-To-Image (T2I)
A dog with
goggles
staring at
the camera.
A person is
preparing
some art. grass
water
house
sky
tree
a horse is running on the grassland
grass
water
house
sky
tree
grass
water
house
sky
tree
Sketch-To-Image (S2I)
The car is reversing
Image Completion (I2I) Image Manipulation (TI2I)
Text-To-Video (T2V) Sketch-To-Video (S2V) Video Prediction (V2V) Video Manipulation (TV2V)
grass
water
house
sky
tree
flower
cup
wall vase
door
table
Figure 1: NÜWA supports several typical generation and manipulation tasks. The figure
is cited from (Wu et al., 2021b) 3 / 26
VQ-VAE-based Visual Auto-Regressive Models (previous
work)
Transformer
Target
Flattened sequence
Discrete Latents
Codebook
Conv3D
Encoder
Discrete Latents
Conv3D
Decoder
Figure 2: The figure is cited from (Yan et al., 2021) CC BY
4 / 26
VQ-VAE-based Visual Auto-Regressive Models (previous
work)
Encoder
Decoder
Discretize Recover
Text Tokenizer (sentence pieces)
Image Tokenizer
(Discrete AutoEncoder)
[ROI1] Text Token Text Token [BASE] [BOI1] [EOI1]
Image Token Image Token
Flattern
Input Text: Input Image:
Transformer (GPT)
z }| {
<latexit sha1_base64="WkmkOQqV4y/G2CwEGjey+GFekFc=">AAACAnicbVDLSgMxFM3UV62vUVfiJlgEV2VGi7osuHFZwT6gM5RMeqcNzWSGJCOUobjxV9y4UMStX+HOvzHTzkJbD4Qczrn3JvcECWdKO863VVpZXVvfKG9WtrZ3dvfs/YO2ilNJoUVjHstuQBRwJqClmebQTSSQKODQCcY3ud95AKlYLO71JAE/IkPBQkaJNlLfPvJiYweSUMi8kUry+9JJ9HTat6tOzZkBLxO3IFVUoNm3v7xBTNMIhKacKNVzzRw/I1IzymFa8VIFZv6YDKFnqCARKD+brTDFp0YZ4DCW5giNZ+rvjoxESk2iwFRGRI/UopeL/3m9VIfXfsZEkmoQdP5QmHKsY5zngQdMAtV8Ygihkpm/YjoiJg9tUquYENzFlZdJ+7zmXtScu3q1US/iKKNjdILOkIuuUAPdoiZqIYoe0TN6RW/Wk/VivVsf89KSVfQcoj+wPn8A712XuA==</latexit>
z }| {
<latexit sha1_base64="WkmkOQqV4y/G2CwEGjey+GFekFc=">AAACAnicbVDLSgMxFM3UV62vUVfiJlgEV2VGi7osuHFZwT6gM5RMeqcNzWSGJCOUobjxV9y4UMStX+HOvzHTzkJbD4Qczrn3JvcECWdKO863VVpZXVvfKG9WtrZ3dvfs/YO2ilNJoUVjHstuQBRwJqClmebQTSSQKODQCcY3ud95AKlYLO71JAE/IkPBQkaJNlLfPvJiYweSUMi8kUry+9JJ9HTat6tOzZkBLxO3IFVUoNm3v7xBTNMIhKacKNVzzRw/I1IzymFa8VIFZv6YDKFnqCARKD+brTDFp0YZ4DCW5giNZ+rvjoxESk2iwFRGRI/UopeL/3m9VIfXfsZEkmoQdP5QmHKsY5zngQdMAtV8Ygihkpm/YjoiJg9tUquYENzFlZdJ+7zmXtScu3q1US/iKKNjdILOkIuuUAPdoiZqIYoe0TN6RW/Wk/VivVsf89KSVfQcoj+wPn8A712XuA==</latexit>
Figure 3: The figure is cited from (Ding et al., 2021)
5 / 26
Difference between previous work (DALL-E(Ramesh et al., 2021),
CogView(Ding et al., 2021), GODIVA(Wu et al., 2021a))
DALL-E(Ramesh et al., 2021)
extend to video
→ GODIVA(Wu et al., 2021a)
CogView(Ding et al., 2021) VideoGPT(Yan et al., 2021)
& ↓ extend to multimodal
NUWA
6 / 26
Overview of NÜWA
3D-Decoder
1D-Encoder
3D-Encoder
2D-Encoder
A light wind blew across
the country road.
Input Text
Output Image
Input Image Sketch
Input Video Sketch
Output Video
Input Image Parts
Input Video Frames
Output Remaining Parts
Output Future Frames
Visual
Generation
Visual
Completion,
Prediction,
Manipulation
Figure 4: Structure of NÜWA. The figure is cited from (Wu et al., 2021b)
7 / 26
VQGAN
zi = arg min
j∈{0,...,N−1}
||
RdB
z }| {
E(I)i −Bj||2
, (1)
ˆ
I = G(B[z]), (2)
LV
= ||
RH×W ×C
∈
I − ˆ
I||2
2 + ||sg[
Rh×w×dB
z}|{
E(I)] − B[z]||2
2 + ||E(I) − sg[B[z]]||2
2, (3)
LP
= ||CNN(I) − CNN(ˆ
I)||2
2, (4)
LG
= logD(I) + log(1 − D(ˆ
I)), (5)
where dB
= 256, N = 12, 288, H = W ∈ {256, 336}, h = w ∈ {16, 21, 32}.
8 / 26
3D Nearby Self-Attention (3DNA)
Y = 3DNA(
Rh×w×s×din
∈
X, C
∈
Rh0×w0×s0×din
; W), (6)
Reh×ew×es×din
∈
N(i,j,k)
=

Cabc
|a − i0
| ≤ eh
, |b − j0
| ≤ ew
, |c − k0
| ≤ es

, (7)
Rh×w×s×dout
z }| {
Q(i,j,k)
= XWQ
, (8)
Reh×ew×es×dout
z }| {
K(i,j,k)
= N(i,j,k)
WK
, (9)
Reh×ew×es×dout
z }| {
V (i,j,k)
= N(i,j,k)
WV
, (10)
yijk = softmax
(Q(i,j,k)
)T(K(i,j,k)
)T
√
din
!
V (i,j,k)
, (11)
where WQ, WK, WV ∈ Rdin×dout
.
9 / 26
3DNA
3D block-sparse 3D axial-sparse (row) 3D nearby-sparse (ours)
Considering previous tokens
in a fixed 3D-block.
Considering previous tokens
in each 3D axis.
Considering previous tokens
in a 3D nearby sliding window.
3D sparse attentions.
10 / 26
3D Encoder-Decoder
Yijk := Yijk + Ph
i + Pw
j + Ps
k (12)
Cijk := Cijk + Ph0
i + Pw0
j + Ps0
k (13)
C(l)
= 3DNA(C(l−1)
, C(l−1)
), (14)
Y
(l)
ijk =3DNA(Y
(l−1)
i,j,k, Y
(l−1)
i,j,k)
+3DNA(Y
(l−1)
i,j,k, C(L)
),
(15)
where Y ∈ Rh×w×s×dout
, C ∈ Rh0×w0×s0×din
, V
(1)
0,0,0 is  bos .
11 / 26
Training Objective
L = −
Xh×w
t=1
log pθ yt
yt, Ctext
; θ

−
Xh×w×s
t=1
log pθ yt
yt, c; θ

−
Xh×w×s
t=1
log pθ yt
yt, Ctext
; θ

(16)
Training on Text-to-Image (T2I), Video Prediction (V2V) and Text-to-Video
(T2V) with cross-entropy loss.
12 / 26
Experiments - quantitative results
Model FID-0↓ FID-1 FID-2 FID-4 FID-8 IS↑ CLIPSIM↑
AttnGAN (Xu et al., 2018) 35.2 44.0 72.0 108.0 100.0 23.3 0.2772
DM-GAN (Zhu et al., 2019) 26.0 39.0 73.0 119.0 112.3 32.2 0.2838
DF-GAN (Tao et al., 2020) 26.0 33.8 55.9 91.0 97.0 18.7 0.2928
DALL-E (Ramesh et al., 2021) 27.5 28.0 45.5 83.5 85.0 17.9 -
CogView (Ding et al., 2021) 27.1 19.4 13.9 19.4 23.6 18.2 0.3325
XMC-GAN (Zhang et al., 2021) 9.3 - - - - 30.5 -
NÜWA 12.9 13.8 15.7 19.3 24 27.2 0.3429
Table 1: T2I task on MSCOCO (256×256).
Model Acc↑ FID-img↓ FID-vid↓ CLIPSIM↑
T2V (64×64) (Li et al., 2018) 42.6 82.13 14.65 0.2853
SC (128×128) (Balaji et al., 2019) 74.7 33.51 7.34 0.2915
TFGAN (128×128) (Balaji et al., 2019) 76.2 31.76 7.19 0.2961
NÜWA (128×128) 77.9 28.46 7.05 0.3012
Table 2: T2V task on the Kinetics dataset.
Model Cond. FVD↓
MoCoGAN (Tulyakov et al., 2018) 4 503
SVG-FP (Denton and Fergus, 2018) 2 315
CNDA (Finn et al., 2016) 2 297
SV2P (Babaeizadeh et al., 2017) 2 263
SRVP (Franceschi et al., 2020) 2 181
VideoFlow (Kumar et al., 2019) 3 131
LVT (Rakhimov et al., 2020) 1 126±3
SAVP (Lee et al., 2018) 2 116
DVD-GAN-FP (Clark et al., 2019) 1 110
Video Transformer (S) (Weissenborn et al., 2020) 1 106±3
TriVD-GAN-FP (Luc et al., 2020) 1 103
CCVS (Moing et al., 2021) 1 99±2
Video Transformer (L) (Weissenborn et al., 2020) 1 94±2
NÜWA 1 86.9
Table 3: V2V task on BAIR (64×64).
13 / 26
Experiments - qualitative results
A very cute cat
laying by a big bike.
China airlines plain
on the ground at an
airport with baggage
cars nearby.
A table that has a
train model on it
with other cars
and things.
A living room with
a tv on top of a
stand with a guitars
sitting next to.
A couple of people
are sitting on a
wood bench.
A very cute giraffe
making a funny
face.
A kitchen with a
fridge, stove and
sink.
A group of animals
are standing in the
snow.
A green train is coming
down the tracks.
A group of skiers are
preparing to ski
down a mountain.
A small kitchen
with low a ceiling.
A child eating a birthday cake near some balloons.
XMC-GAN (256×256) NÜWA(ours) (256×256)
A living area with a
television and a table.
NÜWA(ours)
(256×256)
XMC-GAN
(256×256)
NÜWA(ours)
(256×256)
DALL-E
(256×256)
Figure 6: T2I task on MSCOCO. The figure is cited from (Wu et al., 2021b).
14 / 26
Experiments - qualitative results
Input Text: playing golf at swimming pool
Input Text: running on the sea
T2V
NÜWA(ours)
T2V
NÜWA(ours)
NÜWA(ours)
(336×336)
GODIVA
(128×128)
Input Text: playing golf on grass
TFGAN
(128×128)
T2V
(64×64)
Figure 7: T2V task on the Kinetics dataset. The figure is cited from (Wu et al., 2021b).
15 / 26
Experiments - qualitative results
Input Ground Truth Taming (256×256)
SPADE (256×256)
NÜWA(ours)
NÜWA(ours) (256×256)
Figure 8: S2I)task on MSCOCO stuff
dataset. The figure is cited from (Wu et al.,
2021b).
Input
NÜWA(ours)
NÜWA(ours) (256×256)
Taming (256×256)
Figure 9: I2I in a zero-shot manner. The
figure is cited from (Wu et al., 2021b).
16 / 26

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

  • 1.
    NÜWA: Visual SynthesisPre-training for Neural visUal World creAtion Chenfei Wu1∗ Jian Liang2∗ Lei Ji1 Fan Yang1 Yuejian Fang2 Daxin Jiang1 Nan Duan1† 1Microsoft Research Asia 2Peking University Presentor: Kai Katsumata Nakayama Lab. ∗ Both authors contributed equally to this research. † Corresponding author.
  • 2.
    Basic information Title NÜWA:Visual Synthesis Pre-training for Neural visUal World creAtion Authors Chenfei Wu Jian Liang Lei Ji Fan Yang Yuejian Fang Daxin Jiang Nan Duan Affiliation Microsoft Research Asia Peking University Date 2021/11/24 (Arxiv) https://arxiv.org/abs/2111.12417 Project url https://github.com/microsoft/NUWA 1 / 26
  • 3.
    Abstract ”This paper presentsa unified multimodal pre-trained model called NÜWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate NÜWA on 8 downstream tasks. Compared to several strong baselines, NÜWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is https://github.com/microsoft/NUWA.” (Wu et al., 2021b) 2 / 26
  • 4.
    Teaser figure Text-To-Image (T2I) Adog with goggles staring at the camera. A person is preparing some art. grass water house sky tree a horse is running on the grassland grass water house sky tree grass water house sky tree Sketch-To-Image (S2I) The car is reversing Image Completion (I2I) Image Manipulation (TI2I) Text-To-Video (T2V) Sketch-To-Video (S2V) Video Prediction (V2V) Video Manipulation (TV2V) grass water house sky tree flower cup wall vase door table Figure 1: NÜWA supports several typical generation and manipulation tasks. The figure is cited from (Wu et al., 2021b) 3 / 26
  • 5.
    VQ-VAE-based Visual Auto-RegressiveModels (previous work) Transformer Target Flattened sequence Discrete Latents Codebook Conv3D Encoder Discrete Latents Conv3D Decoder Figure 2: The figure is cited from (Yan et al., 2021) CC BY 4 / 26
  • 6.
    VQ-VAE-based Visual Auto-RegressiveModels (previous work) Encoder Decoder Discretize Recover Text Tokenizer (sentence pieces) Image Tokenizer (Discrete AutoEncoder) [ROI1] Text Token Text Token [BASE] [BOI1] [EOI1] Image Token Image Token Flattern Input Text: Input Image: Transformer (GPT) z }| { <latexit sha1_base64="WkmkOQqV4y/G2CwEGjey+GFekFc=">AAACAnicbVDLSgMxFM3UV62vUVfiJlgEV2VGi7osuHFZwT6gM5RMeqcNzWSGJCOUobjxV9y4UMStX+HOvzHTzkJbD4Qczrn3JvcECWdKO863VVpZXVvfKG9WtrZ3dvfs/YO2ilNJoUVjHstuQBRwJqClmebQTSSQKODQCcY3ud95AKlYLO71JAE/IkPBQkaJNlLfPvJiYweSUMi8kUry+9JJ9HTat6tOzZkBLxO3IFVUoNm3v7xBTNMIhKacKNVzzRw/I1IzymFa8VIFZv6YDKFnqCARKD+brTDFp0YZ4DCW5giNZ+rvjoxESk2iwFRGRI/UopeL/3m9VIfXfsZEkmoQdP5QmHKsY5zngQdMAtV8Ygihkpm/YjoiJg9tUquYENzFlZdJ+7zmXtScu3q1US/iKKNjdILOkIuuUAPdoiZqIYoe0TN6RW/Wk/VivVsf89KSVfQcoj+wPn8A712XuA==</latexit> z }| { <latexit sha1_base64="WkmkOQqV4y/G2CwEGjey+GFekFc=">AAACAnicbVDLSgMxFM3UV62vUVfiJlgEV2VGi7osuHFZwT6gM5RMeqcNzWSGJCOUobjxV9y4UMStX+HOvzHTzkJbD4Qczrn3JvcECWdKO863VVpZXVvfKG9WtrZ3dvfs/YO2ilNJoUVjHstuQBRwJqClmebQTSSQKODQCcY3ud95AKlYLO71JAE/IkPBQkaJNlLfPvJiYweSUMi8kUry+9JJ9HTat6tOzZkBLxO3IFVUoNm3v7xBTNMIhKacKNVzzRw/I1IzymFa8VIFZv6YDKFnqCARKD+brTDFp0YZ4DCW5giNZ+rvjoxESk2iwFRGRI/UopeL/3m9VIfXfsZEkmoQdP5QmHKsY5zngQdMAtV8Ygihkpm/YjoiJg9tUquYENzFlZdJ+7zmXtScu3q1US/iKKNjdILOkIuuUAPdoiZqIYoe0TN6RW/Wk/VivVsf89KSVfQcoj+wPn8A712XuA==</latexit> Figure 3: The figure is cited from (Ding et al., 2021) 5 / 26
  • 7.
    Difference between previouswork (DALL-E(Ramesh et al., 2021), CogView(Ding et al., 2021), GODIVA(Wu et al., 2021a)) DALL-E(Ramesh et al., 2021) extend to video → GODIVA(Wu et al., 2021a) CogView(Ding et al., 2021) VideoGPT(Yan et al., 2021) & ↓ extend to multimodal NUWA 6 / 26
  • 8.
    Overview of NÜWA 3D-Decoder 1D-Encoder 3D-Encoder 2D-Encoder Alight wind blew across the country road. Input Text Output Image Input Image Sketch Input Video Sketch Output Video Input Image Parts Input Video Frames Output Remaining Parts Output Future Frames Visual Generation Visual Completion, Prediction, Manipulation Figure 4: Structure of NÜWA. The figure is cited from (Wu et al., 2021b) 7 / 26
  • 9.
    VQGAN zi = argmin j∈{0,...,N−1} || RdB z }| { E(I)i −Bj||2 , (1) ˆ I = G(B[z]), (2) LV = || RH×W ×C ∈ I − ˆ I||2 2 + ||sg[ Rh×w×dB z}|{ E(I)] − B[z]||2 2 + ||E(I) − sg[B[z]]||2 2, (3) LP = ||CNN(I) − CNN(ˆ I)||2 2, (4) LG = logD(I) + log(1 − D(ˆ I)), (5) where dB = 256, N = 12, 288, H = W ∈ {256, 336}, h = w ∈ {16, 21, 32}. 8 / 26
  • 10.
    3D Nearby Self-Attention(3DNA) Y = 3DNA( Rh×w×s×din ∈ X, C ∈ Rh0×w0×s0×din ; W), (6) Reh×ew×es×din ∈ N(i,j,k) = Cabc
  • 14.
    |a − i0 |≤ eh , |b − j0 | ≤ ew , |c − k0 | ≤ es , (7) Rh×w×s×dout z }| { Q(i,j,k) = XWQ , (8) Reh×ew×es×dout z }| { K(i,j,k) = N(i,j,k) WK , (9) Reh×ew×es×dout z }| { V (i,j,k) = N(i,j,k) WV , (10) yijk = softmax (Q(i,j,k) )T(K(i,j,k) )T √ din ! V (i,j,k) , (11) where WQ, WK, WV ∈ Rdin×dout . 9 / 26
  • 15.
    3DNA 3D block-sparse 3Daxial-sparse (row) 3D nearby-sparse (ours) Considering previous tokens in a fixed 3D-block. Considering previous tokens in each 3D axis. Considering previous tokens in a 3D nearby sliding window. 3D sparse attentions. 10 / 26
  • 16.
    3D Encoder-Decoder Yijk :=Yijk + Ph i + Pw j + Ps k (12) Cijk := Cijk + Ph0 i + Pw0 j + Ps0 k (13) C(l) = 3DNA(C(l−1) , C(l−1) ), (14) Y (l) ijk =3DNA(Y (l−1) i,j,k, Y (l−1) i,j,k) +3DNA(Y (l−1) i,j,k, C(L) ), (15) where Y ∈ Rh×w×s×dout , C ∈ Rh0×w0×s0×din , V (1) 0,0,0 is bos . 11 / 26
  • 17.
    Training Objective L =− Xh×w t=1 log pθ yt
  • 19.
  • 21.
  • 23.
    yt, Ctext ; θ (16) Trainingon Text-to-Image (T2I), Video Prediction (V2V) and Text-to-Video (T2V) with cross-entropy loss. 12 / 26
  • 24.
    Experiments - quantitativeresults Model FID-0↓ FID-1 FID-2 FID-4 FID-8 IS↑ CLIPSIM↑ AttnGAN (Xu et al., 2018) 35.2 44.0 72.0 108.0 100.0 23.3 0.2772 DM-GAN (Zhu et al., 2019) 26.0 39.0 73.0 119.0 112.3 32.2 0.2838 DF-GAN (Tao et al., 2020) 26.0 33.8 55.9 91.0 97.0 18.7 0.2928 DALL-E (Ramesh et al., 2021) 27.5 28.0 45.5 83.5 85.0 17.9 - CogView (Ding et al., 2021) 27.1 19.4 13.9 19.4 23.6 18.2 0.3325 XMC-GAN (Zhang et al., 2021) 9.3 - - - - 30.5 - NÜWA 12.9 13.8 15.7 19.3 24 27.2 0.3429 Table 1: T2I task on MSCOCO (256×256). Model Acc↑ FID-img↓ FID-vid↓ CLIPSIM↑ T2V (64×64) (Li et al., 2018) 42.6 82.13 14.65 0.2853 SC (128×128) (Balaji et al., 2019) 74.7 33.51 7.34 0.2915 TFGAN (128×128) (Balaji et al., 2019) 76.2 31.76 7.19 0.2961 NÜWA (128×128) 77.9 28.46 7.05 0.3012 Table 2: T2V task on the Kinetics dataset. Model Cond. FVD↓ MoCoGAN (Tulyakov et al., 2018) 4 503 SVG-FP (Denton and Fergus, 2018) 2 315 CNDA (Finn et al., 2016) 2 297 SV2P (Babaeizadeh et al., 2017) 2 263 SRVP (Franceschi et al., 2020) 2 181 VideoFlow (Kumar et al., 2019) 3 131 LVT (Rakhimov et al., 2020) 1 126±3 SAVP (Lee et al., 2018) 2 116 DVD-GAN-FP (Clark et al., 2019) 1 110 Video Transformer (S) (Weissenborn et al., 2020) 1 106±3 TriVD-GAN-FP (Luc et al., 2020) 1 103 CCVS (Moing et al., 2021) 1 99±2 Video Transformer (L) (Weissenborn et al., 2020) 1 94±2 NÜWA 1 86.9 Table 3: V2V task on BAIR (64×64). 13 / 26
  • 25.
    Experiments - qualitativeresults A very cute cat laying by a big bike. China airlines plain on the ground at an airport with baggage cars nearby. A table that has a train model on it with other cars and things. A living room with a tv on top of a stand with a guitars sitting next to. A couple of people are sitting on a wood bench. A very cute giraffe making a funny face. A kitchen with a fridge, stove and sink. A group of animals are standing in the snow. A green train is coming down the tracks. A group of skiers are preparing to ski down a mountain. A small kitchen with low a ceiling. A child eating a birthday cake near some balloons. XMC-GAN (256×256) NÜWA(ours) (256×256) A living area with a television and a table. NÜWA(ours) (256×256) XMC-GAN (256×256) NÜWA(ours) (256×256) DALL-E (256×256) Figure 6: T2I task on MSCOCO. The figure is cited from (Wu et al., 2021b). 14 / 26
  • 26.
    Experiments - qualitativeresults Input Text: playing golf at swimming pool Input Text: running on the sea T2V NÜWA(ours) T2V NÜWA(ours) NÜWA(ours) (336×336) GODIVA (128×128) Input Text: playing golf on grass TFGAN (128×128) T2V (64×64) Figure 7: T2V task on the Kinetics dataset. The figure is cited from (Wu et al., 2021b). 15 / 26
  • 27.
    Experiments - qualitativeresults Input Ground Truth Taming (256×256) SPADE (256×256) NÜWA(ours) NÜWA(ours) (256×256) Figure 8: S2I)task on MSCOCO stuff dataset. The figure is cited from (Wu et al., 2021b). Input NÜWA(ours) NÜWA(ours) (256×256) Taming (256×256) Figure 9: I2I in a zero-shot manner. The figure is cited from (Wu et al., 2021b). 16 / 26
  • 28.
    Experiments - qualitativeresults A photo of a camping tent A photo of a bouquet of flowers A photo of a blue firetruck Manipulation Raw Image Paint By Word NÜWA(ours) Figure 10: TI2I in a zero-shot manner. The figure is cited from (Wu et al., 2021b). Reconstructed Image Raw Image Raw Sketch Reconstructed Sketch Figure 11: Reconstruction samples of VQ-GAN and VQ-GAN-Seg. The figure is cited from (Wu et al., 2021b). 17 / 26
  • 29.
    Experiments - qualitativeresults Manipulation1: The diver is swimming to the surface. Manipulation2: The diver is swimming to the bottom. Manipulation3: The diver is flying to the sky Raw Video: Figure 12: Samples of different manipulations on the same video. The figure is cited from (Wu et al., 2021b). 18 / 26
  • 30.
    Experiments - quantitativeresults Model Dataset R → D Rate SSIM FID VQ-VAE ImageNet 2562 → 162 F16 0.7026 13.3 VQ-GAN ImageNet 2562 → 162 F16 0.7105 6.04 VQ-GAN ImageNet 2562 → 322 F8 0.8285 2.03 VQ-GAN ImageNet 3362 → 212 F16 0.7213 4.79 VQ-GAN OpenImages 3362 → 212 F16 0.7527 4.31 Model Dataset R → D Rate PA FWIoU VQ-GAN-Seg MSCOCO 3362 → 212 F16 96.82 93.91 VQ-GAN-Seg VSPW 3362 → 212 F16 95.36 91.82 Table 4: Effectiveness of different VQ-VAE (VQ-GAN) settings. Model Pre-trained Tasks FID-vid↓ CLIPSIM↑ NÜWA-TV T2V 52.98 0.2314 NÜWA-TV-TI T2V+T2I 53.92 0.2379 NÜWA-TV-VV T2V+V2V 51.81 0.2335 NÜWA T2V+T2I+V2V 47.68 0.2439 Table 5: Effectiveness of multi-task pre-training for T2V task on MSRVTT. Model Encoder Decoder FID-vid↓ Detected PA↑ NÜWA-FF Full Full 35.21 0.5220 NÜWA-NF Nearby Full 33.63 0.5357 NÜWA-FN Full Nearby 32.06 0.5438 NÜWA-AA Axis Axis 29.18 0.5957 NÜWA Nearby Nearby 27.79 0.6085 Table 6: Effectiveness of 3D nearby attention for S2V task on VSPW. 19 / 26
  • 31.
    Experiments - qualitativeresults Raw Image VQVAE ImageNet VQGAN ImageNet VQGAN ImageNet VQGAN OpenImages VQGAN ImageNet Figure 13: Reconstruction results of R → D compression settings on VQ-VAE (VQ-GAN). Figure is cited from (Wu et al., 2021b) but is not appeared in submitted paper. 20 / 26
  • 32.
    Question weakness •Need ablation study. • Previous work use different decoders (dVAE, VQVAE, VQGAN). • Need more comparision with CogView. • More pre-training task? 21 / 26
  • 33.
    Summary • NUWA isa novel multimodal zero shot image generation model. • 3DNA enables unifined representation of text, image, and video. • Important: 64 A100 GPUs for two weeks. 22 / 26
  • 34.
    References i Mohammad Babaeizadeh,Chelsea Finn, Dumitru Erhan, Roy H. Campbell, and Sergey Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017. Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chellappa, and Hans Peter Graf. Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis. In IJCAI, pages 1995–2001, 2019. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019. Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In International Conference on Machine Learning, pages 1174–1183. PMLR, 2018. Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. CogView: Mastering Text-to-Image Generation via Transformers. pages 1–18, 2021. URL http://arxiv.org/abs/2105.13290. Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. Advances in neural information processing systems, 29:64–72, 2016. Jean-Yves Franceschi, Edouard Delasalles, Mickaël Chen, Sylvain Lamprier, and Patrick Gallinari. Stochastic latent residual video prediction. In International Conference on Machine Learning, pages 3233–3246. PMLR, 2020. 23 / 26
  • 35.
    References ii Manoj Kumar,Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. Videoflow: A conditional flow-based model for stochastic video generation. arXiv preprint arXiv:1903.01434, 2019. Alex X. Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018. Yitong Li, Martin Min, Dinghan Shen, David Carlson, and Lawrence Carin. Video generation from text. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Pauline Luc, Aidan Clark, Sander Dieleman, Diego de Las Casas, Yotam Doron, Albin Cassirer, and Karen Simonyan. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035, 2020. Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. CCVS: Context-aware Controllable Video Synthesis. arXiv preprint arXiv:2107.08037, 2021. Ruslan Rakhimov, Denis Volkhonskiy, Alexey Artemov, Denis Zorin, and Evgeny Burnaev. Latent Video Transformer. arXiv preprint arXiv:2006.10704, 2020. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021. 24 / 26
  • 36.
    References iii Ming Tao,Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865, 2020. Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1526–1535, 2018. Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autoregressive video models. In ICLR, 2020. Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions. arXiv:2104.14806 [cs], April 2021a. Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. NUWA: Visual Synthesis Pre-training for Neural visUal World creAtion. 2021b. URL http://arxiv.org/abs/2111.12417. 25 / 26
  • 37.
    References iv Tao Xu,Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1316–1324, 2018. Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. VideoGPT: Video Generation using VQ-VAE and Transformers. arXiv preprint arXiv:2104.10157, 2021. Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 833–842, 2021. Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5802–5810, 2019. 26 / 26