Decoding Stable Diffusion:
a journey through key concepts
Vitaly Bondar
ML Team Lead @ theMind
ML Enthusiast
Latent diffusion
Rombach, Blattmann et al. High-Resolution Image Synthesis with Latent Diffusion Models, 2021
Autoencoder
Illustration: Lilian
Weng
LeCun, Y. Modeles connexionistes de l’apprentissage ` . Ph.D. thesis, 1987
Variational autoencoder (VAE)
Illustrations:
Lilian Weng, D.
Kingma
Kingma, Welling. Auto-Encoding Variational Bayes. 2013
Variational autoencoder (VAE)
Kingma, Welling. Auto-Encoding Variational Bayes. 2013
Vector Quantized Variational Autoencoder (VQ-VAE)
Oord et al. Neural Discrete Representation Learning, 2017
Vector Quantized Variational Autoencoder (VQ-VAE)
Oord et al. Neural Discrete Representation Learning, 2017
VQ-GAN
Esser, Rombach et al. Taming Transformers for High-Resolution Image Synthesis, 2020
VQ-GAN
Esser, Rombach et al. Taming Transformers for High-Resolution Image Synthesis, 2020
VQ-GAN
Esser, Rombach et al. Taming Transformers for High-Resolution Image Synthesis, 2020
Diffusion process
Diffusion models
Sohl-Dickstein et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics, 2015;
Yang & Ermon, 2019; DDPM; Ho et al. 2020; … ?
Yang, Ling, et al. "Diffusion models: A comprehensive survey of methods and applications.", 2023
Diffusion models
Ho et al. Denoising diffusion probabilistic models, 2020
Best diffusion paper so far
Tero Karras et al.
Elucidating the Design Space of Diffusion-Based Generative Models.
2022
Diffusion choices
- ODE/SDE solver
- Noise scheduling
- Time steps
- Scalings
- Model architecture
- Training noise distribution
- Training loss
- Training loss weighting
Tero Karras et al. Elucidating the Design Space of Diffusion-Based Generative Models, 2022
U-Net
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. 2015
CLIP
Radford, Alec, et al. Learning transferable visual models from natural language supervision. 2021.
Latent diffusion
Rombach, Blattmann et al. High-Resolution Image Synthesis with Latent Diffusion Models, 2021
Latent diffusion
Training phases:
● Autoencoder
○ Loss: Patch-based GAN loss + Perceptual loss
○ Regularization: KL-loss (close to VAE) OR quantization in Decoder (like VQ-GAN)
● Various generative tasks
○ Loss: classical diffusion L2 restoration loss
○ All trainings done on single A100
Rombach, Blattmann et al. High-Resolution Image Synthesis with Latent Diffusion Models, 2021
Latent diffusion
Rombach, Blattmann et al. High-Resolution Image Synthesis with Latent Diffusion Models, 2021
Stable diffusion v1
Key components:
● High quality decoder from VQ-GAN
● Diffusion in latent space
● Frozen language model (CLIP ViT-L/14 embeddings)
● Classifier-free guidance
● A lot of data (LAION-5B and its subsets)
● A lot of compute power
A lot to discover
● Stable diffusion 2, xl, 3, 3.5
● Imagen
● PixArt-α
● …
● Guidance (classifier, classifier-free, …)
● …
● Distillations
● Rectified flow
● Matching flows
● Cold fusion
● 𝛼-(de)Blending
● Consistency models
● Continuous normalizing flows
● …
Illustration: FLux.1[dev]

Vitaly Bondar: Decoding Stable Diffusion: a journey through key concepts (UA)

  • 1.
    Decoding Stable Diffusion: ajourney through key concepts Vitaly Bondar ML Team Lead @ theMind ML Enthusiast
  • 4.
    Latent diffusion Rombach, Blattmannet al. High-Resolution Image Synthesis with Latent Diffusion Models, 2021
  • 5.
    Autoencoder Illustration: Lilian Weng LeCun, Y.Modeles connexionistes de l’apprentissage ` . Ph.D. thesis, 1987
  • 6.
    Variational autoencoder (VAE) Illustrations: LilianWeng, D. Kingma Kingma, Welling. Auto-Encoding Variational Bayes. 2013
  • 7.
    Variational autoencoder (VAE) Kingma,Welling. Auto-Encoding Variational Bayes. 2013
  • 8.
    Vector Quantized VariationalAutoencoder (VQ-VAE) Oord et al. Neural Discrete Representation Learning, 2017
  • 9.
    Vector Quantized VariationalAutoencoder (VQ-VAE) Oord et al. Neural Discrete Representation Learning, 2017
  • 10.
    VQ-GAN Esser, Rombach etal. Taming Transformers for High-Resolution Image Synthesis, 2020
  • 11.
    VQ-GAN Esser, Rombach etal. Taming Transformers for High-Resolution Image Synthesis, 2020
  • 12.
    VQ-GAN Esser, Rombach etal. Taming Transformers for High-Resolution Image Synthesis, 2020
  • 13.
  • 14.
    Diffusion models Sohl-Dickstein etal. Deep Unsupervised Learning using Nonequilibrium Thermodynamics, 2015; Yang & Ermon, 2019; DDPM; Ho et al. 2020; … ? Yang, Ling, et al. "Diffusion models: A comprehensive survey of methods and applications.", 2023
  • 15.
    Diffusion models Ho etal. Denoising diffusion probabilistic models, 2020
  • 16.
    Best diffusion paperso far Tero Karras et al. Elucidating the Design Space of Diffusion-Based Generative Models. 2022
  • 17.
    Diffusion choices - ODE/SDEsolver - Noise scheduling - Time steps - Scalings - Model architecture - Training noise distribution - Training loss - Training loss weighting Tero Karras et al. Elucidating the Design Space of Diffusion-Based Generative Models, 2022
  • 18.
    U-Net Ronneberger, Olaf, PhilippFischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. 2015
  • 19.
    CLIP Radford, Alec, etal. Learning transferable visual models from natural language supervision. 2021.
  • 20.
    Latent diffusion Rombach, Blattmannet al. High-Resolution Image Synthesis with Latent Diffusion Models, 2021
  • 21.
    Latent diffusion Training phases: ●Autoencoder ○ Loss: Patch-based GAN loss + Perceptual loss ○ Regularization: KL-loss (close to VAE) OR quantization in Decoder (like VQ-GAN) ● Various generative tasks ○ Loss: classical diffusion L2 restoration loss ○ All trainings done on single A100 Rombach, Blattmann et al. High-Resolution Image Synthesis with Latent Diffusion Models, 2021
  • 22.
    Latent diffusion Rombach, Blattmannet al. High-Resolution Image Synthesis with Latent Diffusion Models, 2021
  • 23.
    Stable diffusion v1 Keycomponents: ● High quality decoder from VQ-GAN ● Diffusion in latent space ● Frozen language model (CLIP ViT-L/14 embeddings) ● Classifier-free guidance ● A lot of data (LAION-5B and its subsets) ● A lot of compute power
  • 24.
    A lot todiscover ● Stable diffusion 2, xl, 3, 3.5 ● Imagen ● PixArt-α ● … ● Guidance (classifier, classifier-free, …) ● … ● Distillations ● Rectified flow ● Matching flows ● Cold fusion ● 𝛼-(de)Blending ● Consistency models ● Continuous normalizing flows ● … Illustration: FLux.1[dev]