Latent diffusions vs DALL-E v2

•

2 likes•1,517 views

Vitaly Bondar

Quick overview and comparison of the latest text-to-image models: Latent diffusions and DALL-E 2 (unCLIP).

Data & Analytics

Latent diffusions
vs
DALL-E v2
by Vitaly Bondar
johngull @ ODS | gmail

Latent diffusions
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann et al.
https://arxiv.org/pdf/2112.10752.pdf
https://github.com/CompVis/latent-diffusion
https://colab.research.google.com/github/multimodalart/latent-diffusion-notebook/blob/ma
in/Latent_Diffusion_LAION_400M_model_text_to_image.ipynb
https://huggingface.co/spaces/multimodalart/latentdiffusion

Latent diffusions: long story short
1. Take “taiming transformers”
2. Replace transformer with the
conditional diffusion model
3. PROFIT

Latent diffusions
● VQGAN used for encoding/decoding
● Generation happens in compact, semantically equal space
● UNet in DM uses inductive bias and scales
● Cross-attention or channels stacking used for conditioning

Latent diffusions: training
2 training phases:
1. Autoencoder
a. Loss: Patch-based GAN loss + Perceptual loss
b. Regularization: KL-loss (close to VAE) OR quantization in Decoder (like VQ-GAN)
2. Various generative tasks
a. All trainings done on single A100
b. Loss: classical diffusion L2 restoration loss

Latent diffusions: autoencoder
Downsampling for 4-16x: speedup of generative training without sampling quality loss
KL-regularization give better autoencoder metrics, but quantization in Decoder version
shows better samples quality.

Latent diffusions: results
Without space conditioning

Latent diffusions: results
Text-to-image: a sunset behind a mountain range, vector image

Latent diffusions: results
Text-to-image

DALL-E 2 (unClip)
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
https://cdn.openai.com/papers/dall-e-2.pdf
https://openai.com/dall-e-2/
P.S: paper is 💩

DALL-E 2 (unClip)
2 stages (4 in reality ;) )
● Generate CLIP image embedding from text encoding/image encoding
● Decode image embedding to the image (decoder + 2 stages of diffusion SR)

DALL-E 2: Decoder
Modified GLIDE model (3.5B) convert embedding
to the image, then diffusion upsampling 64->256, 256->1024
GLIDE input: CLIP embedding projections, timestamp, 4 context tokens (?)
Training:
● Use ¼ of the image
● Set the CLIP embeddings to zero (or a learned embedding) 10% of the time
● dropping the text caption 50% of the time
● For upsampling models add noise to the inputs (1-gaussian, 2-BSR degradations)
“Our decoder model provides a unique opportunity to explore CLIP latent space by
allowing us to directly visualize what the CLIP image encoder is seeing.”

DALL-E 2: Decoder
“Our decoder model provides a unique opportunity to explore CLIP latent space by
allowing us to directly visualize what the CLIP image encoder is seeing.”

DALL-E 2: Prior model
2 types:
● Autoregressive
○ GPT-like
○ 319 main PCA components from 1024 clip values (quantized to 1024 values)
○ Dot-product of text and image embeddings as input token (0.5 on inference)
● Diffusion model conditioned on input
○ Transformer-based
○ Casual mask for predicted embedding
○ Prompt: encoded text, the CLIP text embedding, an embedding for the diffusion
timestep, the noise CLIP image embedding
○ generate two samples and select the one with a higher dot product with z_t.

Latent diffusions vs DALL-E 2
Latent diffusions
● Good results
● Open-source
● Trained on open dataset
● Quick generation
DALL-E 2
● Amazing results (no independent check)
● No source code
● Proprietary huge dataset
● Unknown speed (probably slow)

What's hot

BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham

Variational AutoencoderMark Chang

Generative adversarial networks남주 김

Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev

Transformers in Vision: From Zero to HeroBill Liu

Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...Vitaly Bondar

Generative Adversarial Networks (GANs)Amol Patil

Exploring Generating AI with Diffusion ModelsKonfHubTechConferenc

Swin transformerJAEMINJEONG5

Transformer Introduction (Seminar Material)Yuta Niki

PR-409: Denoising Diffusion Probabilistic ModelsHyeongmin Lee

Introduction to Transformer ModelNuwan Sriyantha Bandara

Transformers In Vision From Zero to Hero (DLI).pptxDeep Learning Italia

[Paper Reading] Attention is All You NeedDaiki Tanaka

BERT introductionHanwha System / ICT

Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks

Transformers AI PPT.pptxRahulKumar854607

NLP using transformers Arvind Devaraj

GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim

BertAbdallah Bashir

What's hot (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Variational Autoencoder

Generative adversarial networks

Introduction to Transformers for NLP - Olga Petrova

Transformers in Vision: From Zero to Hero

Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...

Generative Adversarial Networks (GANs)

Exploring Generating AI with Diffusion Models

Swin transformer

Transformer Introduction (Seminar Material)

PR-409: Denoising Diffusion Probabilistic Models

Introduction to Transformer Model

Transformers In Vision From Zero to Hero (DLI).pptx

[Paper Reading] Attention is All You Need

BERT introduction

Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...

Transformers AI PPT.pptx

NLP using transformers

GPT-2: Language Models are Unsupervised Multitask Learners

Bert

Similar to Latent diffusions vs DALL-E v2

Understanding Flamingo - DeepMind's VLM Architecturerahul_net

Unsupervised image to-image translation via pre-trained style gan2 network광희 이

What multimodal foundation models cannot perceiveUniversity of Amsterdam

BRV CTO Summit Deep Learning TalkDoug Chang

BigDL webinar - Deep Learning Library for SparkDESMOND YUEN

Preparing for Scala 3Martin Odersky

Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia

[DSC Europe 23] Alexander Kovalchuk - Finetuning Stable Diffusion with low-ra...DataScienceConferenc1

Introduction to Video Compression Techniques - Anurag JainVideoguy

Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityJim Dowling

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin

DALL-E.pdfdsfajkh

Serverless Deep LearningAlexey Grigorev

Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Sangmin Woo

BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...Databricks

The software design principlesAman Kesarwani

Fine tuning large LMsSylvainGugger

Kubernetes - State of the Union (Q1-2016)DoiT International

Challenges for advanced domain-specific frameworksIstvan Rath

Transformer ZooGrigory Sapunov

Similar to Latent diffusions vs DALL-E v2 (20)

Understanding Flamingo - DeepMind's VLM Architecture

Unsupervised image to-image translation via pre-trained style gan2 network

What multimodal foundation models cannot perceive

BRV CTO Summit Deep Learning Talk

BigDL webinar - Deep Learning Library for Spark

Preparing for Scala 3

Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)

[DSC Europe 23] Alexander Kovalchuk - Finetuning Stable Diffusion with low-ra...

Introduction to Video Compression Techniques - Anurag Jain

Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

DALL-E.pdf

Serverless Deep Learning

Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...

BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...

The software design principles

Fine tuning large LMs

Kubernetes - State of the Union (Q1-2016)

Challenges for advanced domain-specific frameworks

Transformer Zoo

Recently uploaded

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

RadioAdProWritingCinderellabyButleri.pdfgstagge

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

Decoding Loan Approval: Predictive Modeling in ActionBoston Institute of Analytics

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Recently uploaded (20)

RA-11058_IRR-COMPRESS Do 198 series of 1998

RadioAdProWritingCinderellabyButleri.pdf

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

Decoding Loan Approval: Predictive Modeling in Action

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

Call Girls In Dwarka 9654467111 Escorts Service

20240419 - Measurecamp Amsterdam - SAM.pdf

DBA Basics: Getting Started with Performance Tuning.pdf

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Schema on read is obsolete. Welcome metaprogramming..pdf

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

Latent diffusions vs DALL-E v2

1. Latent diffusions vs DALL-E v2 by Vitaly Bondar johngull @ ODS | gmail

2. Recap: diffusion models

3. Recap: diffusion models

4. Latent diffusions High-Resolution Image Synthesis with Latent Diffusion Models Robin Rombach, Andreas Blattmann et al. https://arxiv.org/pdf/2112.10752.pdf https://github.com/CompVis/latent-diffusion https://colab.research.google.com/github/multimodalart/latent-diffusion-notebook/blob/ma in/Latent_Diffusion_LAION_400M_model_text_to_image.ipynb https://huggingface.co/spaces/multimodalart/latentdiffusion

5. Latent diffusions

6. Latent diffusions: long story short 1. Take “taiming transformers” 2. Replace transformer with the conditional diffusion model 3. PROFIT

7. Thank you. Questions?

8. Latent diffusions ● VQGAN used for encoding/decoding ● Generation happens in compact, semantically equal space ● UNet in DM uses inductive bias and scales ● Cross-attention or channels stacking used for conditioning

9. Latent diffusions: training 2 training phases: 1. Autoencoder a. Loss: Patch-based GAN loss + Perceptual loss b. Regularization: KL-loss (close to VAE) OR quantization in Decoder (like VQ-GAN) 2. Various generative tasks a. All trainings done on single A100 b. Loss: classical diffusion L2 restoration loss

10. Latent diffusions: autoencoder Downsampling for 4-16x: speedup of generative training without sampling quality loss KL-regularization give better autoencoder metrics, but quantization in Decoder version shows better samples quality.

11. Latent diffusions: results

12. Latent diffusions: results

13. Latent diffusions: results Without space conditioning

14. Latent diffusions: results

15. Latent diffusions: results

16. Latent diffusions: results Text-to-image: a sunset behind a mountain range, vector image

17. Latent diffusions: results Text-to-image

18. DALL-E 2 (unClip) Hierarchical Text-Conditional Image Generation with CLIP Latents Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen https://cdn.openai.com/papers/dall-e-2.pdf https://openai.com/dall-e-2/ P.S: paper is 💩

19. DALL-E 2 (unClip)

20. DALL-E 2 (unClip) 2 stages (4 in reality ;) ) ● Generate CLIP image embedding from text encoding/image encoding ● Decode image embedding to the image (decoder + 2 stages of diffusion SR)

21. DALL-E 2: Decoder Modified GLIDE model (3.5B) convert embedding to the image, then diffusion upsampling 64->256, 256->1024 GLIDE input: CLIP embedding projections, timestamp, 4 context tokens (?) Training: ● Use ¼ of the image ● Set the CLIP embeddings to zero (or a learned embedding) 10% of the time ● dropping the text caption 50% of the time ● For upsampling models add noise to the inputs (1-gaussian, 2-BSR degradations) “Our decoder model provides a unique opportunity to explore CLIP latent space by allowing us to directly visualize what the CLIP image encoder is seeing.”

22. DALL-E 2: Decoder “Our decoder model provides a unique opportunity to explore CLIP latent space by allowing us to directly visualize what the CLIP image encoder is seeing.”

23. DALL-E 2: Prior model 2 types: ● Autoregressive ○ GPT-like ○ 319 main PCA components from 1024 clip values (quantized to 1024 values) ○ Dot-product of text and image embeddings as input token (0.5 on inference) ● Diffusion model conditioned on input ○ Transformer-based ○ Casual mask for predicted embedding ○ Prompt: encoded text, the CLIP text embedding, an embedding for the diffusion timestep, the noise CLIP image embedding ○ generate two samples and select the one with a higher dot product with z_t.

24. DALL-E 2: results

25. DALL-E 2: results

26. DALL-E 2: results

27. DALL-E 2: results

28. DALL-E 2: results

29. Latent diffusions vs DALL-E 2

30. Latent diffusions vs DALL-E 2

31. Latent diffusions vs DALL-E 2 Latent diffusions ● Good results ● Open-source ● Trained on open dataset ● Quick generation DALL-E 2 ● Amazing results (no independent check) ● No source code ● Proprietary huge dataset ● Unknown speed (probably slow)

32. Thank you.

Latent diffusions vs DALL-E v2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Latent diffusions vs DALL-E v2

Similar to Latent diffusions vs DALL-E v2 (20)

Recently uploaded

Recently uploaded (20)

Latent diffusions vs DALL-E v2