Latent diffusions vs DALL-E v2

Latent diffusions
vs
DALL-E v2
by Vitaly Bondar
johngull @ ODS | gmail

Latent diffusions
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann et al.
https://arxiv.org/pdf/2112.10752.pdf
https://github.com/CompVis/latent-diffusion
https://colab.research.google.com/github/multimodalart/latent-diffusion-notebook/blob/ma
in/Latent_Diffusion_LAION_400M_model_text_to_image.ipynb
https://huggingface.co/spaces/multimodalart/latentdiffusion

Latent diffusions: long story short
1. Take “taiming transformers”
2. Replace transformer with the
conditional diffusion model
3. PROFIT

Latent diffusions
● VQGAN used for encoding/decoding
● Generation happens in compact, semantically equal space
● UNet in DM uses inductive bias and scales
● Cross-attention or channels stacking used for conditioning

Latent diffusions: training
2 training phases:
1. Autoencoder
a. Loss: Patch-based GAN loss + Perceptual loss
b. Regularization: KL-loss (close to VAE) OR quantization in Decoder (like VQ-GAN)
2. Various generative tasks
a. All trainings done on single A100
b. Loss: classical diffusion L2 restoration loss

Latent diffusions: autoencoder
Downsampling for 4-16x: speedup of generative training without sampling quality loss
KL-regularization give better autoencoder metrics, but quantization in Decoder version
shows better samples quality.

Latent diffusions: results
Without space conditioning

Text-to-image: a sunset behind a mountain range, vector image

Text-to-image

DALL-E 2 (unClip)
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
https://cdn.openai.com/papers/dall-e-2.pdf
https://openai.com/dall-e-2/
P.S: paper is 💩

DALL-E 2 (unClip)
2 stages (4 in reality ;) )
● Generate CLIP image embedding from text encoding/image encoding
● Decode image embedding to the image (decoder + 2 stages of diffusion SR)

DALL-E 2: Decoder
Modified GLIDE model (3.5B) convert embedding
to the image, then diffusion upsampling 64->256, 256->1024
GLIDE input: CLIP embedding projections, timestamp, 4 context tokens (?)
Training:
● Use ¼ of the image
● Set the CLIP embeddings to zero (or a learned embedding) 10% of the time
● dropping the text caption 50% of the time
● For upsampling models add noise to the inputs (1-gaussian, 2-BSR degradations)
“Our decoder model provides a unique opportunity to explore CLIP latent space by
allowing us to directly visualize what the CLIP image encoder is seeing.”

DALL-E 2: Decoder
“Our decoder model provides a unique opportunity to explore CLIP latent space by
allowing us to directly visualize what the CLIP image encoder is seeing.”

DALL-E 2: Prior model
2 types:
● Autoregressive
○ GPT-like
○ 319 main PCA components from 1024 clip values (quantized to 1024 values)
○ Dot-product of text and image embeddings as input token (0.5 on inference)
● Diffusion model conditioned on input
○ Transformer-based
○ Casual mask for predicted embedding
○ Prompt: encoded text, the CLIP text embedding, an embedding for the diffusion
timestep, the noise CLIP image embedding
○ generate two samples and select the one with a higher dot product with z_t.

Latent diffusions vs DALL-E 2
Latent diffusions
● Good results
● Open-source
● Trained on open dataset
● Quick generation
DALL-E 2
● Amazing results (no independent check)
● No source code
● Proprietary huge dataset
● Unknown speed (probably slow)

Latent diffusions vs DALL-E v2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Latent diffusions vs DALL-E v2

Similar to Latent diffusions vs DALL-E v2 (20)

Recently uploaded

Recently uploaded (20)

Latent diffusions vs DALL-E v2