Latent diffusions
vs
DALL-E v2
by Vitaly Bondar
johngull @ ODS | gmail
Recap: diffusion models
Recap: diffusion models
Latent diffusions
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann et al.
https://arxiv.org/pdf/2112.10752.pdf
https://github.com/CompVis/latent-diffusion
https://colab.research.google.com/github/multimodalart/latent-diffusion-notebook/blob/ma
in/Latent_Diffusion_LAION_400M_model_text_to_image.ipynb
https://huggingface.co/spaces/multimodalart/latentdiffusion
Latent diffusions
Latent diffusions: long story short
1. Take “taiming transformers”
2. Replace transformer with the
conditional diffusion model
3. PROFIT
Thank you.
Questions?
Latent diffusions
● VQGAN used for encoding/decoding
● Generation happens in compact, semantically equal space
● UNet in DM uses inductive bias and scales
● Cross-attention or channels stacking used for conditioning
Latent diffusions: training
2 training phases:
1. Autoencoder
a. Loss: Patch-based GAN loss + Perceptual loss
b. Regularization: KL-loss (close to VAE) OR quantization in Decoder (like VQ-GAN)
2. Various generative tasks
a. All trainings done on single A100
b. Loss: classical diffusion L2 restoration loss
Latent diffusions: autoencoder
Downsampling for 4-16x: speedup of generative training without sampling quality loss
KL-regularization give better autoencoder metrics, but quantization in Decoder version
shows better samples quality.
Latent diffusions: results
Latent diffusions: results
Latent diffusions: results
Without space conditioning
Latent diffusions: results
Latent diffusions: results
Latent diffusions: results
Text-to-image: a sunset behind a mountain range, vector image
Latent diffusions: results
Text-to-image
DALL-E 2 (unClip)
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
https://cdn.openai.com/papers/dall-e-2.pdf
https://openai.com/dall-e-2/
P.S: paper is 💩
DALL-E 2 (unClip)
DALL-E 2 (unClip)
2 stages (4 in reality ;) )
● Generate CLIP image embedding from text encoding/image encoding
● Decode image embedding to the image (decoder + 2 stages of diffusion SR)
DALL-E 2: Decoder
Modified GLIDE model (3.5B) convert embedding
to the image, then diffusion upsampling 64->256, 256->1024
GLIDE input: CLIP embedding projections, timestamp, 4 context tokens (?)
Training:
● Use ¼ of the image
● Set the CLIP embeddings to zero (or a learned embedding) 10% of the time
● dropping the text caption 50% of the time
● For upsampling models add noise to the inputs (1-gaussian, 2-BSR degradations)
“Our decoder model provides a unique opportunity to explore CLIP latent space by
allowing us to directly visualize what the CLIP image encoder is seeing.”
DALL-E 2: Decoder
“Our decoder model provides a unique opportunity to explore CLIP latent space by
allowing us to directly visualize what the CLIP image encoder is seeing.”
DALL-E 2: Prior model
2 types:
● Autoregressive
○ GPT-like
○ 319 main PCA components from 1024 clip values (quantized to 1024 values)
○ Dot-product of text and image embeddings as input token (0.5 on inference)
● Diffusion model conditioned on input
○ Transformer-based
○ Casual mask for predicted embedding
○ Prompt: encoded text, the CLIP text embedding, an embedding for the diffusion
timestep, the noise CLIP image embedding
○ generate two samples and select the one with a higher dot product with z_t.
DALL-E 2: results
DALL-E 2: results
DALL-E 2: results
DALL-E 2: results
DALL-E 2: results
Latent diffusions vs DALL-E 2
Latent diffusions vs DALL-E 2
Latent diffusions vs DALL-E 2
Latent diffusions
● Good results
● Open-source
● Trained on open dataset
● Quick generation
DALL-E 2
● Amazing results (no independent check)
● No source code
● Proprietary huge dataset
● Unknown speed (probably slow)
Thank you.

Latent diffusions vs DALL-E v2

  • 1.
    Latent diffusions vs DALL-E v2 byVitaly Bondar johngull @ ODS | gmail
  • 2.
  • 3.
  • 4.
    Latent diffusions High-Resolution ImageSynthesis with Latent Diffusion Models Robin Rombach, Andreas Blattmann et al. https://arxiv.org/pdf/2112.10752.pdf https://github.com/CompVis/latent-diffusion https://colab.research.google.com/github/multimodalart/latent-diffusion-notebook/blob/ma in/Latent_Diffusion_LAION_400M_model_text_to_image.ipynb https://huggingface.co/spaces/multimodalart/latentdiffusion
  • 5.
  • 6.
    Latent diffusions: longstory short 1. Take “taiming transformers” 2. Replace transformer with the conditional diffusion model 3. PROFIT
  • 7.
  • 8.
    Latent diffusions ● VQGANused for encoding/decoding ● Generation happens in compact, semantically equal space ● UNet in DM uses inductive bias and scales ● Cross-attention or channels stacking used for conditioning
  • 9.
    Latent diffusions: training 2training phases: 1. Autoencoder a. Loss: Patch-based GAN loss + Perceptual loss b. Regularization: KL-loss (close to VAE) OR quantization in Decoder (like VQ-GAN) 2. Various generative tasks a. All trainings done on single A100 b. Loss: classical diffusion L2 restoration loss
  • 10.
    Latent diffusions: autoencoder Downsamplingfor 4-16x: speedup of generative training without sampling quality loss KL-regularization give better autoencoder metrics, but quantization in Decoder version shows better samples quality.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    Latent diffusions: results Text-to-image:a sunset behind a mountain range, vector image
  • 17.
  • 18.
    DALL-E 2 (unClip) HierarchicalText-Conditional Image Generation with CLIP Latents Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen https://cdn.openai.com/papers/dall-e-2.pdf https://openai.com/dall-e-2/ P.S: paper is 💩
  • 19.
  • 20.
    DALL-E 2 (unClip) 2stages (4 in reality ;) ) ● Generate CLIP image embedding from text encoding/image encoding ● Decode image embedding to the image (decoder + 2 stages of diffusion SR)
  • 21.
    DALL-E 2: Decoder ModifiedGLIDE model (3.5B) convert embedding to the image, then diffusion upsampling 64->256, 256->1024 GLIDE input: CLIP embedding projections, timestamp, 4 context tokens (?) Training: ● Use ¼ of the image ● Set the CLIP embeddings to zero (or a learned embedding) 10% of the time ● dropping the text caption 50% of the time ● For upsampling models add noise to the inputs (1-gaussian, 2-BSR degradations) “Our decoder model provides a unique opportunity to explore CLIP latent space by allowing us to directly visualize what the CLIP image encoder is seeing.”
  • 22.
    DALL-E 2: Decoder “Ourdecoder model provides a unique opportunity to explore CLIP latent space by allowing us to directly visualize what the CLIP image encoder is seeing.”
  • 23.
    DALL-E 2: Priormodel 2 types: ● Autoregressive ○ GPT-like ○ 319 main PCA components from 1024 clip values (quantized to 1024 values) ○ Dot-product of text and image embeddings as input token (0.5 on inference) ● Diffusion model conditioned on input ○ Transformer-based ○ Casual mask for predicted embedding ○ Prompt: encoded text, the CLIP text embedding, an embedding for the diffusion timestep, the noise CLIP image embedding ○ generate two samples and select the one with a higher dot product with z_t.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    Latent diffusions vsDALL-E 2 Latent diffusions ● Good results ● Open-source ● Trained on open dataset ● Quick generation DALL-E 2 ● Amazing results (no independent check) ● No source code ● Proprietary huge dataset ● Unknown speed (probably slow)
  • 32.