Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Deep Generative
Learning for All
(a.k.a. The GenAI Hype)
Xavier Giro-i-Nieto
@DocXavi
xavigiro.upc@gmail.com
Associate Professor (on leave)
Universitat Politècnica de Catalunya
Institut de Robòtica Industrial
ELLIS Unit Barcelona

2
Acknowledgements
Santiago Pascual
@santty128
PhD 2019
Universitat Politecnica de Catalunya
Technical University of Catalonia
Albert Pumarola
@AlbertPumarola
PhD 2021
Institut de Robòtica Industrial (IRI)
Universitat Politècnica de Catalunya (UPC)
Kevin McGuinness
Research Fellow
Insight Centre for Data Analytics
Dublin City University
Gerard I. Gállego
PhD Student
Universitat Politècnica de Catalunya
gerard.ion.gallego@upc.edu
@geiongallego

3
Acknowledgements
Eduard Ramon
Applied Scientist
Amazon Barcelona
@eram1205
Wentong Liao
Applied Scientist
Amazon Barcelona
Ciprian Corneanu
Applied Scientist
Amazon Seattle
Laia Tarrés
PhD Student
Universita Pompeu Fabra
laia.tarres@upf.edu

Outline
1. Motivation
2. Discriminative vs Generative Models
a. P(Y|X): Discriminative Models
b. P(X): Generative Models
c. P(X|Y): Conditioned Generative Models
3. Latent variable
4. Architectures
a. GAN
b. Auto-regressive
c. VAE
d. Diﬀusion

Image generation
5
#StyleGAN3 (NVIDIA) Karras, Tero, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and
Timo Aila. "Alias-free generative adversarial networks." NeurIPS 2021. [code]

6
#DiT (UC Berkeley, NYU) Peebles, William, and Saining Xie. "Scalable Diﬀusion Models with Transformers." ICCV 2023.
Image generation

7
#Imagen2 (Google Deepmind) Blog post
Text-to-Image (T2I) generation
“A shot of a 32-year old female, up and
coming conservationist in a jungle;
athletic with short, curly hair and a warm
smile

8
#DALL-E-2 (OpenAI) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen "Hierarchical Text-Conditional Image Generation with CLIP
Latents." 2022. [blog]
#DALL·E-3 (OpenAI) James Betker, Gabriel Goh, et al, “Improving Image Generation with Better Captions” 2023 [blog]
““An expressive oil painting of a basketball player dunking, depicted as an explosion of a nebula.

9
Text-to-Music (T2M) generation
(Stability AI) Evans, Zach, Julian D. Parker, C. J. Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. "Long-form music generation with latent diﬀusion."
arXiv 2024.

10
Text-to-Video (T2V) generation
#Sora (OpenAI) Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman,
Clarence Wing Yin Ng, Ricky Wang, Aditya Ramesh. Video generation models as world simulators. OpenAI 2024.

Human Motion Transfer
11
#DreamMoving Mengyang Feng, Jinlin Liu, Kai Yu, et al., DreaMoving: A Human Video Generation Framework based on Diﬀusion Models. arXIv
2023
“A girl, smiling, dancing in a wooden
house, wearing sweater, and long
pants.”
“A girl, smiling, dancing in the park
with golden leaves in autumn,
wearing light blue dress.”
“A girl, smiling, dancing in Times
Square, wearing dress-like white
shirt, with long sleeves, long pants.”

12
Image Edition & Text-to-Video (T2V) generation
#EMU (Meta) Girdhar, Rohit, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and
Ishan Misra. "Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning." arXiv 2023. [blog]

13
#Imagine Flash (Meta) Jonas Kohler, Albert Pumarola, Edgar Schönfeld, Artsiom Sanakoyeu, Roshan Sumbaly, Peter Vajda, and Ali Thabet. Imagine
Flash: Accelerating Emu Diﬀusion Models with Backward Distillation. Meta 2024.
Albert Pumarola
@AlbertPumarola
PhD 2021
Institut de Robòtica Industrial (IRI)
Universitat Politècnica de Catalunya (UPC)

15
Discriminative vs Generative Models
Philip Isola, Generative Models of Images. MIT 2023.

Outline
1. Motivation
a. Pθ
(Y|X): Discriminative Models
b. Pθ
(X): Generative Models
c. Pθ
(X|Y): Conditioned Generative Models
3. Latent variable
4. Architectures
a. GAN
b. Auto-regressive
c. VAE
d. Diﬀusion

Pθ
17
Slide credit:
Albert Pumarola (UPC 2019)
Classiﬁcation Regression
Text Prob. of being a Potential Customer
Image
Audio Speech Translation
Jim Carrey
What Language?
X=Data
Y=Labels
θ = Model parameters
Discriminative Modeling
Pθ
(Y|X)

18
0.01
0.09
0.9
input
Network (θ) output
class
Figure credit: Javier Ruiz (UPC TelecomBCN)
Discriminative model: Tell me the probability of some ‘Y’ responses given ‘X’
inputs.
Pθ
(Y | X = [pixel1
, pixel2
, …, pixel784
])
Pθ

Outline
1. Motivation
3. Sampling
4. Architectures
a. GAN
b. Auto-regressive
c. VAE
d. Diﬀusion

20
Slide Concept: Albert Pumarola (UPC 2019)
Pθ
Classiﬁcation Regression Generative
Text Prob. of being a Potential Customer
“What about Ron magic?” offered Ron.
To Harry, Ron was loud, slow and soft
bird. Harry did not like to think about
birds.
Image
Audio Language Translation
Music Composer and Interpreter
MuseNet Sample
Jim Carrey
What Language?
Discriminative Modeling
Pθ
(Y|X)
Generative Modeling
Pθ
(X)
X=Data
Y=Labels
θ = Model parameters

Each real sample xi
comes from
an M-dimensional probability
distribution P(X).
X = {x1
, x2
, …, xN
}
Pθ

22
1) We want our model with parameters θ to output samples with distribution
Pθ
(X), matching the distribution of our training data P(X).
2) We can sample points from Pθ
(X) plausibly looking how P(X) distributed.
P(X)
Distribution of training data
Pλ,μ,σ
(X)
Distribution of training data
Example: Gaussian Mixture Models (GMM)
Pθ

23
What are the parameters θ we need to estimate in deep neural networks ?
θ = (weights & biases)
output
Network (θ)
?
Pθ

Pθ
(X|Y): Conditioned Generative Models
Joint probabilities P(X|Y) to
model conditioning variables on
the generative process:
X = {x1
, x2
, …, xN
}
Y = {y1
, y2
, …, yN
}
DOG
CAT
TRUCK
PIZZA
THRILLER
SCI-FI
HISTORY
/aa/
/e/
/o/

Outline
1. Motivation
3. Sampling
4. Architectures
a. Generative Adversarial Networks (GANs)
b. Auto-regressive
c. Variational Autoencoders (VAEs)
d. Diﬀusion

Our learned model should be able to make up new samples from the distribution,
not just copy and paste existing samples!
27
Figure from NIPS 2016 Tutorial: Generative Adversarial Networks (I. Goodfellow)
Sampling

Sampling

Slide concept: Albert Pumarola (UPC 2019)
Learn
Sample Out
Training Dataset
Generated Samples
Feature
space
Manifold Pθ
(X)
“Model the data distribution so that we can sample new points out of the
distribution”
Sampling

Sampling
z
Generated Samples
How could we generate diverse samples from a deterministic deep neural network ?
Generator
(θ)

Sampling
Generated Samples
How could we generate diverse samples from a deterministic deep neural network ?
Generator
(θ)
Sample z from a known prior, for example, a multivariate normal distribution N(0, I).
Example: dim(z)=2
x’
z

Slide concept: Albert Pumarola (UPC 2019)
Learn
Training Dataset
Interpolated Samples
Feature
space
Manifold Pθ
(X)
Traversing the learned manifold through interpolation.
Interpolation

Disentanglement

Outline
1. Motivation
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
■ Generator & Discriminator Networks
■ Adversarial Training
○ Auto-regressive
○ Variational Autoencoders (VAEs)
○ Diﬀusion

36
Credit: Santiago Pascual [slides] [video]

37
Generator & Discriminator
We have two modules: Generator (G) and Discriminator (D).
● They “ﬁght” against each other during training→ Adversarial Learning
D’s goal:
Classify between real
samples and those
produced by G.
G’s goal:
Fool D to
missclassify.
Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. "Generative Adversarial Nets." NeurIPS 2014.

38
Discriminator
Discriminator network D → binary classiﬁer between real (x) and generated (x’).
samples.
Generated (1)
Discriminator
(θ)
x’
Discriminator
(θ)
x Real (0)

39
Generator
Real world
samples
Database
Discriminator
Real
Loss
Latent
random
variable
Sample
Sample
Generated
z
Generator & Discriminator

Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to
detect whether money is real or fake.
100
100
FAKE: It’s
not even
green
Adversarial Training Analogy: is it fake money?
Figure: Santiago Pascual (UPC)

Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect
whether money is real or fake.
100
100
FAKE:
There is no
watermark

Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect
whether money is real or fake.
100
100
FAKE:
Watermark
should be
rounded

Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to
detect whether money is real or fake.
After enough iterations, and if the counterfeiter is good enough (in terms of G network it
means “has enough parameters”), the police should be confused.
REAL?
FAKE?

Adversarial Training
Generator
Real world
images
Discriminator
Real
Loss
Latent
random
variable
Sample
Sample
Generated
Alternate between training the discriminator and generator
Neural Network
Neural Network
Figure: Kevin McGuinness (DCU)

Adversarial Training: Discriminator
Generator
Real world
images
Discriminator
Real
Loss
Latent
random
variable
Sample
Sample
Generated
1. Fix generator weights, draw samples from both real world and generated images
2. Train discriminator to distinguish between real world and generated images
Backprop error to
update discriminator
weights

Adversarial Training: Discriminator
Generator
Discriminator
Real
Loss
Latent
random
variable
Sample
Backprop error to
update discriminator
weights
Generated
Consider a binary encoding of “1” (Real) and “0” (Fake). Which of these two values should the
discriminator predict when we train the discriminator with a generated image ?

Adversarial Training: Generator
1. Fix discriminator weights
2. Sample from generator by injecting noise.
3. Backprop error through discriminator to update generator weights
Generator
Real world
images
Discriminator
Real
Loss
Latent
random
variable
Sample
Sample
Backprop error to
update generator
weights
Generated

Adversarial Training: Generator
Generator
Real world
images
Discriminator
Real
Loss
Latent
random
variable
Sample
Sample
Backprop error to
update generator
weights
Generated
Consider a binary encoding of “1” (Real) and “0” (Fake). Which of these two values should the
discriminator predict when we train the generator with a generated image ?

Conditional GANs (cGAN)
#StyleGAN-T Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, Timo Aila, "StyleGAN-T: Unlocking the Power of GANs for
Fast Large-Scale Text-to-Image Synthesis". ICML 2023

Outline
1. Motivation
3. Sampling
4. Architectures
■ Generator & Discriminator Networks
■ Adversarial Training
■ Conditional GANs
○ Diﬀusion
○ Auto-regressive

Non-Conditional GANs
52
Slide credit: Víctor Garcia
Discriminator
D(·)
Generator
G(·)
Real World
Random
seed (z)
Real/Generated

53
Conditional GANs (cGAN)
Slide credit: Víctor Garcia
Conditional Adversarial Networks
Real World
Real/Generated
Condition
Discriminator
D(·)
Generator
G(·)

54
Learn more about GANs
Ian Goodfellow.
NeurIPS Barcelona 2016.
Mihaela Rosca & Jeﬀ Donahue.
UCL x Deepmind 2020.

Soumith Chintala, “How to train a GAN ? Tips and
tricks to make GAN work”. Github 2016.
Learn more about GANs
Ian Goodfellow @ Lex Friedman
(2020)

Outline
1. Motivation
3. Sampling
4. Architectures
○ Diﬀusion
○ Auto-regressive
Figure source: Lilian Weng, What are diﬀusion models ?, Lil’Log 2021.

Outline
1. Motivation
3. Sampling
4. Architectures
■ Limitations of the AE
■ VAE training
■ VAE inferenc
○ Diﬀusion
○ Auto-regressive

Encode Decode
“Generate”
58
Auto-Encoder (AE)
z
Feature
space
● Trained with a reconstruction loss.
● Proposed as a pre-training stage for the encoder (“self-supervised learning”).
#AE Hinton, Geoﬀrey E., Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 2006.
x’
Learned
manifold
Loss term:
Reconstruction
loss

59
Auto-Encoder (AE) for generation ?
Could we generate new samples by sampling from a normal distribution and
feeding it into the encoder ? And into the decoder (as in GANs) ?
Encode Decode
“Generate”
?
Learned
manifold
Feature
space

60
Auto-Encoder (AE) for generation ?
No, because the noise (or encoded noise) would be out of the learned manifold.
Encode Decode
“Generate”
Could we generate new samples by sampling from a normal distribution and
feeding it into the encoder, or the decoder (as in GANs) ?
Learned
manifold
Feature
space

Outline
1. Motivation
3. Sampling
4. Architectures
■ VAE training
■ VAE inferen
○ Diﬀusion
○ Auto-regressive

62
Source: Wikipedia. Image by Bscan - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=25235145
Maths 101: Multivariate normal distribution
If all dimensions are independent:
∑ = σ I,
where σ ∈ Rk
and
I ∈ Rk x k
is the identity matrix.

63
VAE - Training
Encoder (training only): Instead of mapping the input x to a ﬁxed vector, we map
it into a multivariate normal distribution z ~ N( μ(X), ∑(X) ).
Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." ICLR 2014.
DKL
[ N( μ(X), ∑(X) ) || N( μ, σ I ) ]
z
Encode
Loss term #1:
KL divergence
Sample ~

64
VAE - Training
Encode
Decoder: Reconstruct the input data from z ~ N( μ(X), ∑(X) ).
Decode
Loss term #2:
Reconstruction
loss
z
Sample ~

65
VAE - Training
Encode
Decode
Loss term #2:
Reconstruction
loss
z
Sample ~
Do you see any challenge in this scheme if use deep neural networks to estimate
Q(z|X) and P(X|z) ?

Sample ~
66
VAE - Training
Encode
Decode
Loss term #2:
Reconstruction
loss
z
Challenge: We cannot backprop through “Sampling” because it is not a diﬀerentiable
operation.

67
VAE - Training
Sampling: The reparametrization trick uses an auxiliary variable ε~N(0,I) to obtain
a sample z from the distribution N( μ, σI ).
ε~N(0,I)
Decode
Loss term #2:
Reconstruction
loss
z

VAE Inference
69
How can we generate new samples from a trained VAE ?
ε~N(0,I)
Decode
z

We can sample from our prior ε~N(0,I), discarding the encoder path.
70
Generative behaviour
ε~N(0,I)
Decode
z

71
Generative behaviour
#NVAE Vahdat, Arash, and Jan Kautz. "NVAE: A deep hierarchical variational autoencoder." NeurIPS 2020. [code]

Outline
1. Motivation
3. Sampling
4. Architectures
■ VAE training
■ VAE inferee
○ Diﬀusion
○ Auto-regressive

73
VQ-VAE
#VQ-VAE (Deepmind) Van Den Oord, Aaron, Oriol Vinyals, Koray Kavukcuoglu. "Neural discrete representation learning."
NeurIPS 2017.
Z space is a discretized and learned embedding space (vector quantization), so every encoded point
z(x) is mapped to nearest embedding e, which is the information given to decode the sample.

Learn more about VAEs
74
Andriy Mnih (UCL - Deepmind 2020)
Max Welling - University of Amsterdam (2020)

Outline
1. Motivation
3. Sampling
4. Architectures
○ Denoising Diffusion Models (DDM)
■ Forward diffusion process
■ Reverse denoising process
■ Latent diffusion models
○ Auto-regressive

#DDPM Jonathan Ho, Ajay Jain, Pieter Abbeel. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020.
Diffusion-based
Data
x0
x1
x2
x3
x4
… xT
Forward diffusion process (fixed) 😊
Data
Reverse diffusion process (must learn) 🤔
Noise
Noise

#DDPM Jonathan Ho, Ajay Jain, Pieter Abbeel. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020.
Diffusion-based
Data Manifold P(x0
)
x0
x1
xT
…
Reverse denoising
Forward diffusion
Noise
In which stage are neural networks needed ?

Forward Diﬀusion Process

Data
Noise
x0
x1
x2
x3
x4
… xT
What is distribution q(xt
| xt-1
) ?
It is a normal distribution of:
● mean: (1- βt
)1/2
xt-1
● covariance: βt
I
…with βt
increasing from 0 to 1 with t.
Forward diﬀusion process (ﬁxed)

Data
Noise
x0
x1
x2
x3
x4
… xT
What is q(xT
|x0
), the approximate
data distribution of xT
when T is
large (eg. 1000) ?

Data
Noise
x0
x1
x2
x3
x4
… xT
What is q(xT
|x0
), the approximate
data distribution of xT
when T is
large (eg. 1000) ?
Sohl-Dickstein, Jascha, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. "Deep unsupervised learning using
nonequilibrium thermodynamics." ICML 2015.

Related work: Denoising Autoencoder (DAE)
Encode Decode
“Generate”
#DAE Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. "Extracting and composing robust
features with denoising autoencoders." ICML 2008.

Reverse Denoising process

xt
: sample
at time t
timestep t , to account
for the schedule
Data
Noise
x0
x1
x2
x3
x4
… xT
What is the distribution of xt-1
, given xt
?
Normal distribution of:
● mean: μӨ
(xt
,t)
● covariance: ΣӨ
(xt
,t)

What is the dimension of the latent variable in diﬀusion models ?
Same dimensionality as the diﬀused data.
Data Manifold P(x0
)
x0
xT
Noise
Image
Network learns to
denoise step by step
NN
(eg.UNet, ViT)
xt-1
xt

Relation with VAEs
Diffusion models can be considered as a special form of VAEs.
However in diffusion models:
● The encoder is fixed
● The latent variables have the same dimension as the data
● The decoder is run multiple times in an autoregressive fashion.
Encoder
qϴ
(z|x)
Decoder
pϴ
(x|z)
z
Forward “encoding” → fixed
Reverse “decoding” → trainable
VAE Diffusion

Latent Diﬀusion
#LDM (Munich, Heidelberg) Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. "High-resolution image
synthesis with latent diﬀusion models." CVPR 2022. [talk]

Latent Diﬀusion
#LDM (Munich, Heidelberg) Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. "High-resolution image
synthesis with latent diﬀusion models." CVPR 2022. [talk]
VQ-VAE
Decoder
Sampler
4
4
Denoised
latent
Denoising UNet
Text CLIP
Encoder
768
“The girl with
pearl earring by
Johannes
Vermeer.”

Outline
1. Motivation
3. Sampling
4. Architectures
○ Auto-regressive

Outline
1. Motivation
3. Sampling
4. Architectures
○ Auto-regressive Models (AR)
■ PixelRNN
■ PixelCNN
■ Transformer-based solutions

Joint probability distribution
Model the joint probability distribution of data x as a product of
element-wise conditional distributions for each element xi
conditioned on the
previous elements x1
,...xi-1
.
Example:
○ An image x of size (n, n) is understood as a sequence of pixels x1
,...xn*n
● Apply probability chain rule: xi
is the i-th pixel in the image.
95
#PixelRNN Van Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. Pixel recurrent neural networks. ICML 2016.

PixelRNN used a deep neural network (an RNN) to model
96
Joint probability distribution
Recurrent
layer (RNN)

PixelRNN
97
Why are not all completions identical ?
(aka how can AR oﬀer a generative behaviour ?)

99
Auto-Regressive (AR)
An auto-regressive approach, where the previous outputs become inputs in
future steps, also allows modelling .
x[t-L+2], …, x
̂ [t+2]
x[t-L+1], …, x
̂ [t+1],
x
̂ [t+1]
x[t-L], …, x
̂ [t]
x
̂ [t+2] x
̂ [t+3]

Wavenet
100
Wavenet used dilated convolutions to produce synthetic audio, sample by
sample, conditioned over by receptive ﬁeld of size T:
#Wavenet Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv 2016. [blog]

101
Training AR with Teacher Forcing
#TeacherForcing Williams, Ronald J., and David Zipser. "A learning algorithm for continually running fully recurrent
neural networks." Neural computation 1, no. 2 (1989): 270-280.
AR
(eg. CNN, MLP…)
RNN
x[t-L+1], …, x[t+1],
x
̂ [t+1]
x[t-L], …, x
̂ [t]
x
̂ [t+2]
X
Known label

The Transformer
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
The Transformer removed the recurrence mechanism thanks to self-attention...

The Transformer
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
…which allows parallelizing across multiple machines using teacher forcing.
AR

The Transformer
Figure: Jay Alammar, “The illustrated Transformer” (2018)

Zero-shot learning
#GPT-2 Alec Radford, Jeﬀrey Wu, Dario Amodei, Daniela Amodei, Jack Clark, Miles Brundage, Ilya Sutskever, “Better
Language Models and Their Implications”. OpenAI Blog 2019.
“GPT-2 is trained with a simple objective: predict the next word, given all of the
previous words within some text.”
Zero-shot task performances
(GPT-2 was never trained for these tasks)

#iGPT Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. Generative Pretraining from Pixels. ICML
2020. [blog]
iGPT - 1.5B params (similar to GPT-2)
Training: Next-pixel prediction or masked pixel prediction.
Inference: Autoregressive generation of zq
’s with a transformer (global attention).
Input resolutions =
{322
,482
,962
,1922
} x 3
Model resolutions=
{322
,482
}

#iGPT Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. Generative Pretraining from Pixels. ICML
2020. [blog]
iGPT - 1.5B params (similar to GPT-2)

#DALL·E (OpenAI) Ramesh, Aditya, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya
Sutskever. "Zero-shot text-to-image generation." ICML 2021.
Two-Stage Approaches: DALL·E (12B params)
Stage 1: Discrete variational autoencoder (dVAE).
Stage 2: Autoregressive transformer

110
#Parti (Google) Yu, Jiahui, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan et al. "Scaling autoregressive models
for content-rich text-to-image generation." TMLR 2023. [blog]
Scaling up to 20B params
Two-Stage Approaches: Parti (20B params)

Learn more about AR models
Nal Kalchbrenner, Mediterranean Machine Learning
Summer School 2022.

Outline
1. Motivation
3. Sampling
4. Architectures
○ Auto-regressive Models (AR)

Recommended books
Interview of David Foster for Machine
Learning Street Talk (2023)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Recommended

Recommended

More Related Content

Similar to Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Similar to Deep Generative Learning for All - The Gen AI Hype (Spring 2024) (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)