Denoising Diffusion Probabilistic Model
Contrastive models like CLIP as a key inspiration.
Demonstrates robust image representations capturing both semantics and style.
Project Objectives:
Two-stage model proposed:
Prior generating a CLIP image embedding from a given text.
Decoder generating an image based on these CLIP image embeddings.
3. INTRODUCTION
• Evolution of Generative Models:
• Rapid progress in recent years.
• Capable of generating human-like language, synthetic
images, and diverse audio.
• Challenges with Current Techniques:
• Existing computer vision techniques predict predetermined
object characteristics.
• Limitations in generality and usability.
• Alternative Approach - Learning from Raw Text:
• Proposal to learn directly from raw text describing the image.
• Overcomes limitations of predefined labels.
3
4. INTRODUCTION – CONT.…
• Introduction to CLIP:
• Contrastive models like CLIP as a key inspiration.
• Demonstrates robust image representations capturing
both semantics and style.
• Project Objectives:
• Two-stage model proposed:
• Prior generating a CLIP image embedding from a
given text.
• Decoder generating an image based on these CLIP
image embeddings.
4
6. METHODOLOGY
• CLIP as a Representation Learner:
• CLIP (Contrastive Language-Image Pretraining) recognized for
robust image representations.
• Desirable properties, including robustness and applicability to
various tasks.
• Two-Stage Model: Prior and Decoder:
• Introduction of the proposed two-stage model for image generation.
• Prior Mechanism:
• Generates a CLIP image embedding from a given text caption.
• Decoder:
• Generates an image based on CLIP image embeddings.
• Leveraging CLIP Image Embeddings:
• CLIP image embeddings used as a bridge between textual
descriptions and image generation.
• Enhances generality and adaptability in the image generation
process.
6
7. METHODOLOGY
• Combining CLIP and Diffusion Models:
• Synergy between CLIP and diffusion models for improved
image synthesis.
• Language-guided image manipulations achieved through the
joint embedding space of CLIP.
• Language-Guided Image Manipulations:
• Application of CLIP for language-guided modifications in
generated images.
• Explores the joint embedding space of CLIP for enhanced
control and manipulation
7
8. DIFFUSION MODELS
• Introduction to Diffusion Models:
• Diffusion models represent a state-of-the-art family of deep
generative models.
• Break the long-time dominance of GANs in challenging
image synthesis tasks.
• Role of Denoising Diffusion Probabilistic Models (DDPM):
• DDPMs use two Markov chains: forward and reverse.
• Forward chain perturbs data to noise, while the reverse chain
converts noise back to data.
• Three Predominant Formulations:
• DDPMs, Score-Based Generative Models (SGMs), and
Stochastic Differential Equations (Score SDEs).
• Each formulation offers unique features and benefits for the
diffusion model.
8
9. DIFFUSION MODELS
• Training the Diffusion Model on Stanford Cars Dataset:
• Dataset: Stanford Cars dataset with approximately 8000
images in the train set.
• Training process involves progressively destructing data by
injecting noise.
• Forward and Reverse Markov Chains in Diffusion Models:
• Forward process injects noise into data until all structures are
lost.
• Reverse process gradually removes noise by running a
learnable Markov chain.
9
10. DENOISING DIFFUSION
PROBABILISTIC MODELS (DDPM)
A denoising diffusion probabilistic model (DDPM) makes use of two
Markov chains: a forward chain that perturbs data to noise, and a
reverse chain that converts noise back to data.
New data points are subsequently generated by first sampling a
random vector from the prior distribution, followed by ancestral
sampling through the reverse Markov chain.
Using the chain rule of probability and the Markov property, we can
factorize the joint
distribution of x1, x2 . . . xT conditioned on x0, denoted as q(x1, . . . ,
xT| x0), into
10
11. TRAINING
We trained the diffusion model to generate images of cars as per the
theory explained above. As a dataset, we used the Stanford Cars
dataset which consists of around 8000 images in the trainset
11
12. TRAINING RESULTS
• Training the Diffusion Model on Stanford Cars Dataset:
• Utilized the Stanford Cars dataset comprising around 8000
images in the training set.
• Focus on generating images of cars as per the project's
objectives.
• Results from the Forward Markov Chain:
• Forward process involves injecting noise into data
progressively.
• Reverse Markov Chain Using Simple UNet Architecture:
• Utilized a Simple UNet Architecture for the reverse Markov
chain.
• Trained on the dataset for 100 epochs to optimize the
learnable transition kernel.
12
15. RESULTS
15
• Explored diffusion models comprehensively.
• Showed diffusion models, as likelihood-based models,
outperform state-of-the-art GANs.
• Highlighted the stationary training objective's role in
achieving superior sample quality.
• Introduced an improved architecture, successfully
applied to unconditional image generation.
• Presented a classifier guidance technique for extending
quality to class-conditional tasks.