8. Latent diffusions
● VQGAN used for encoding/decoding
● Generation happens in compact, semantically equal space
● UNet in DM uses inductive bias and scales
● Cross-attention or channels stacking used for conditioning
9. Latent diffusions: training
2 training phases:
1. Autoencoder
a. Loss: Patch-based GAN loss + Perceptual loss
b. Regularization: KL-loss (close to VAE) OR quantization in Decoder (like VQ-GAN)
2. Various generative tasks
a. All trainings done on single A100
b. Loss: classical diffusion L2 restoration loss
10. Latent diffusions: autoencoder
Downsampling for 4-16x: speedup of generative training without sampling quality loss
KL-regularization give better autoencoder metrics, but quantization in Decoder version
shows better samples quality.
20. DALL-E 2 (unClip)
2 stages (4 in reality ;) )
● Generate CLIP image embedding from text encoding/image encoding
● Decode image embedding to the image (decoder + 2 stages of diffusion SR)
21. DALL-E 2: Decoder
Modified GLIDE model (3.5B) convert embedding
to the image, then diffusion upsampling 64->256, 256->1024
GLIDE input: CLIP embedding projections, timestamp, 4 context tokens (?)
Training:
● Use ¼ of the image
● Set the CLIP embeddings to zero (or a learned embedding) 10% of the time
● dropping the text caption 50% of the time
● For upsampling models add noise to the inputs (1-gaussian, 2-BSR degradations)
“Our decoder model provides a unique opportunity to explore CLIP latent space by
allowing us to directly visualize what the CLIP image encoder is seeing.”
22. DALL-E 2: Decoder
“Our decoder model provides a unique opportunity to explore CLIP latent space by
allowing us to directly visualize what the CLIP image encoder is seeing.”
23. DALL-E 2: Prior model
2 types:
● Autoregressive
○ GPT-like
○ 319 main PCA components from 1024 clip values (quantized to 1024 values)
○ Dot-product of text and image embeddings as input token (0.5 on inference)
● Diffusion model conditioned on input
○ Transformer-based
○ Casual mask for predicted embedding
○ Prompt: encoded text, the CLIP text embedding, an embedding for the diffusion
timestep, the noise CLIP image embedding
○ generate two samples and select the one with a higher dot product with z_t.