SlideShare a Scribd company logo
1 of 115
Download to read offline
Deep Generative
Learning for All
(a.k.a. The GenAI Hype)
Xavier Giro-i-Nieto
@DocXavi
xavigiro.upc@gmail.com
Associate Professor (on leave)
Universitat Politècnica de Catalunya
Institut de Robòtica Industrial
ELLIS Unit Barcelona
2
Acknowledgements
Santiago Pascual
@santty128
PhD 2019
Universitat Politecnica de Catalunya
Technical University of Catalonia
Albert Pumarola
@AlbertPumarola
PhD 2021
Institut de Robòtica Industrial (IRI)
Universitat Politècnica de Catalunya (UPC)
Kevin McGuinness
Research Fellow
Insight Centre for Data Analytics
Dublin City University
Gerard I. Gállego
PhD Student
Universitat Politècnica de Catalunya
gerard.ion.gallego@upc.edu
@geiongallego
3
Acknowledgements
Eduard Ramon
Applied Scientist
Amazon Barcelona
@eram1205
Wentong Liao
Applied Scientist
Amazon Barcelona
Ciprian Corneanu
Applied Scientist
Amazon Seattle
Laia Tarrés
PhD Student
Universita Pompeu Fabra
laia.tarres@upf.edu
Outline
1. Motivation
2. Discriminative vs Generative Models
a. P(Y|X): Discriminative Models
b. P(X): Generative Models
c. P(X|Y): Conditioned Generative Models
3. Latent variable
4. Architectures
a. GAN
b. Auto-regressive
c. VAE
d. Diffusion
Image generation
5
#StyleGAN3 (NVIDIA) Karras, Tero, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and
Timo Aila. "Alias-free generative adversarial networks." NeurIPS 2021. [code]
6
#DiT (UC Berkeley, NYU) Peebles, William, and Saining Xie. "Scalable Diffusion Models with Transformers." ICCV 2023.
Image generation
7
#Imagen2 (Google Deepmind) Blog post
Text-to-Image (T2I) generation
“A shot of a 32-year old female, up and
coming conservationist in a jungle;
athletic with short, curly hair and a warm
smile
8
#DALL-E-2 (OpenAI) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen "Hierarchical Text-Conditional Image Generation with CLIP
Latents." 2022. [blog]
#DALL·E-3 (OpenAI) James Betker, Gabriel Goh, et al, “Improving Image Generation with Better Captions” 2023 [blog]
Text-to-Image (T2I) generation
““An expressive oil painting of a basketball player dunking, depicted as an explosion of a nebula.
9
Text-to-Music (T2M) generation
(Stability AI) Evans, Zach, Julian D. Parker, C. J. Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. "Long-form music generation with latent diffusion."
arXiv 2024.
10
Text-to-Video (T2V) generation
#Sora (OpenAI) Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman,
Clarence Wing Yin Ng, Ricky Wang, Aditya Ramesh. Video generation models as world simulators. OpenAI 2024.
Human Motion Transfer
11
#DreamMoving Mengyang Feng, Jinlin Liu, Kai Yu, et al., DreaMoving: A Human Video Generation Framework based on Diffusion Models. arXIv
2023
“A girl, smiling, dancing in a wooden
house, wearing sweater, and long
pants.”
“A girl, smiling, dancing in the park
with golden leaves in autumn,
wearing light blue dress.”
“A girl, smiling, dancing in Times
Square, wearing dress-like white
shirt, with long sleeves, long pants.”
12
Image Edition & Text-to-Video (T2V) generation
#EMU (Meta) Girdhar, Rohit, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and
Ishan Misra. "Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning." arXiv 2023. [blog]
13
#Imagine Flash (Meta) Jonas Kohler, Albert Pumarola, Edgar Schönfeld, Artsiom Sanakoyeu, Roshan Sumbaly, Peter Vajda, and Ali Thabet. Imagine
Flash: Accelerating Emu Diffusion Models with Backward Distillation. Meta 2024.
Text-to-Image (T2I) generation
Albert Pumarola
@AlbertPumarola
PhD 2021
Institut de Robòtica Industrial (IRI)
Universitat Politècnica de Catalunya (UPC)
Outline
1. Motivation
2. Discriminative vs Generative Models
a. P(Y|X): Discriminative Models
b. P(X): Generative Models
c. P(X|Y): Conditioned Generative Models
3. Latent variable
4. Architectures
a. GAN
b. Auto-regressive
c. VAE
d. Diffusion
15
Discriminative vs Generative Models
Philip Isola, Generative Models of Images. MIT 2023.
Outline
1. Motivation
2. Discriminative vs Generative Models
a. Pθ
(Y|X): Discriminative Models
b. Pθ
(X): Generative Models
c. Pθ
(X|Y): Conditioned Generative Models
3. Latent variable
4. Architectures
a. GAN
b. Auto-regressive
c. VAE
d. Diffusion
Pθ
(Y|X): Discriminative Models
17
Slide credit:
Albert Pumarola (UPC 2019)
Classification Regression
Text Prob. of being a Potential Customer
Image
Audio Speech Translation
Jim Carrey
What Language?
X=Data
Y=Labels
θ = Model parameters
Discriminative Modeling
Pθ
(Y|X)
18
0.01
0.09
0.9
input
Network (θ) output
class
Figure credit: Javier Ruiz (UPC TelecomBCN)
Discriminative model: Tell me the probability of some ‘Y’ responses given ‘X’
inputs.
Pθ
(Y | X = [pixel1
, pixel2
, …, pixel784
])
Pθ
(Y|X): Discriminative Models
Outline
1. Motivation
2. Discriminative vs Generative Models
a. P(Y|X): Discriminative Models
b. P(X): Generative Models
c. P(X|Y): Conditioned Generative Models
3. Sampling
4. Architectures
a. GAN
b. Auto-regressive
c. VAE
d. Diffusion
20
Slide Concept: Albert Pumarola (UPC 2019)
Pθ
(X): Generative Models
Classification Regression Generative
Text Prob. of being a Potential Customer
“What about Ron magic?” offered Ron.
To Harry, Ron was loud, slow and soft
bird. Harry did not like to think about
birds.
Image
Audio Language Translation
Music Composer and Interpreter
MuseNet Sample
Jim Carrey
What Language?
Discriminative Modeling
Pθ
(Y|X)
Generative Modeling
Pθ
(X)
X=Data
Y=Labels
θ = Model parameters
Each real sample xi
comes from
an M-dimensional probability
distribution P(X).
X = {x1
, x2
, …, xN
}
Pθ
(X): Generative Models
22
1) We want our model with parameters θ to output samples with distribution
Pθ
(X), matching the distribution of our training data P(X).
2) We can sample points from Pθ
(X) plausibly looking how P(X) distributed.
P(X)
Distribution of training data
Pλ,μ,σ
(X)
Distribution of training data
Example: Gaussian Mixture Models (GMM)
Pθ
(X): Generative Models
23
What are the parameters θ we need to estimate in deep neural networks ?
θ = (weights & biases)
output
Network (θ)
?
Pθ
(X): Generative Models
Outline
1. Motivation
2. Discriminative vs Generative Models
a. P(Y|X): Discriminative Models
b. P(X): Generative Models
c. P(X|Y): Conditioned Generative Models
3. Sampling
4. Architectures
a. GAN
b. Auto-regressive
c. VAE
d. Diffusion
Pθ
(X|Y): Conditioned Generative Models
Joint probabilities P(X|Y) to
model conditioning variables on
the generative process:
X = {x1
, x2
, …, xN
}
Y = {y1
, y2
, …, yN
}
DOG
CAT
TRUCK
PIZZA
THRILLER
SCI-FI
HISTORY
/aa/
/e/
/o/
Outline
1. Motivation
2. Discriminative vs Generative Models
a. P(Y|X): Discriminative Models
b. P(X): Generative Models
c. P(X|Y): Conditioned Generative Models
3. Sampling
4. Architectures
a. Generative Adversarial Networks (GANs)
b. Auto-regressive
c. Variational Autoencoders (VAEs)
d. Diffusion
Our learned model should be able to make up new samples from the distribution,
not just copy and paste existing samples!
27
Figure from NIPS 2016 Tutorial: Generative Adversarial Networks (I. Goodfellow)
Sampling
Philip Isola, Generative Models of Images. MIT 2023.
Sampling
Slide concept: Albert Pumarola (UPC 2019)
Learn
Sample Out
Training Dataset
Generated Samples
Feature
space
Manifold Pθ
(X)
“Model the data distribution so that we can sample new points out of the
distribution”
Sampling
Sampling
z
Generated Samples
How could we generate diverse samples from a deterministic deep neural network ?
Generator
(θ)
Sampling
Generated Samples
How could we generate diverse samples from a deterministic deep neural network ?
Generator
(θ)
Sample z from a known prior, for example, a multivariate normal distribution N(0, I).
Example: dim(z)=2
x’
z
Slide concept: Albert Pumarola (UPC 2019)
Learn
Training Dataset
Interpolated Samples
Feature
space
Manifold Pθ
(X)
Traversing the learned manifold through interpolation.
Interpolation
Disentanglement
Philip Isola, Generative Models of Images. MIT 2023.
Disentanglement
Philip Isola, Generative Models of Images. MIT 2023.
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
■ Generator & Discriminator Networks
■ Adversarial Training
○ Auto-regressive
○ Variational Autoencoders (VAEs)
○ Diffusion
36
Credit: Santiago Pascual [slides] [video]
37
Generator & Discriminator
We have two modules: Generator (G) and Discriminator (D).
● They “fight” against each other during training→ Adversarial Learning
D’s goal:
Classify between real
samples and those
produced by G.
G’s goal:
Fool D to
missclassify.
Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. "Generative Adversarial Nets." NeurIPS 2014.
38
Discriminator
Discriminator network D → binary classifier between real (x) and generated (x’).
samples.
Generated (1)
Discriminator
(θ)
x’
Discriminator
(θ)
x Real (0)
39
Generator
Real world
samples
Database
Discriminator
Real
Loss
Latent
random
variable
Sample
Sample
Generated
z
Generator & Discriminator
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
■ Generator & Discriminator Networks
■ Adversarial Training
○ Auto-regressive
○ Variational Autoencoders (VAEs)
○ Diffusion
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to
detect whether money is real or fake.
100
100
FAKE: It’s
not even
green
Adversarial Training Analogy: is it fake money?
Figure: Santiago Pascual (UPC)
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect
whether money is real or fake.
100
100
FAKE:
There is no
watermark
Adversarial Training Analogy: is it fake money?
Figure: Santiago Pascual (UPC)
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect
whether money is real or fake.
100
100
FAKE:
Watermark
should be
rounded
Adversarial Training Analogy: is it fake money?
Figure: Santiago Pascual (UPC)
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to
detect whether money is real or fake.
After enough iterations, and if the counterfeiter is good enough (in terms of G network it
means “has enough parameters”), the police should be confused.
REAL?
FAKE?
Adversarial Training Analogy: is it fake money?
Figure: Santiago Pascual (UPC)
Adversarial Training
Generator
Real world
images
Discriminator
Real
Loss
Latent
random
variable
Sample
Sample
Generated
Alternate between training the discriminator and generator
Neural Network
Neural Network
Figure: Kevin McGuinness (DCU)
Adversarial Training: Discriminator
Generator
Real world
images
Discriminator
Real
Loss
Latent
random
variable
Sample
Sample
Generated
1. Fix generator weights, draw samples from both real world and generated images
2. Train discriminator to distinguish between real world and generated images
Backprop error to
update discriminator
weights
Figure: Kevin McGuinness (DCU)
Adversarial Training: Discriminator
Generator
Discriminator
Real
Loss
Latent
random
variable
Sample
Backprop error to
update discriminator
weights
Figure: Kevin McGuinness (DCU)
Generated
Consider a binary encoding of “1” (Real) and “0” (Fake). Which of these two values should the
discriminator predict when we train the discriminator with a generated image ?
Adversarial Training: Generator
1. Fix discriminator weights
2. Sample from generator by injecting noise.
3. Backprop error through discriminator to update generator weights
Generator
Real world
images
Discriminator
Real
Loss
Latent
random
variable
Sample
Sample
Backprop error to
update generator
weights
Figure: Kevin McGuinness (DCU)
Generated
Adversarial Training: Generator
Generator
Real world
images
Discriminator
Real
Loss
Latent
random
variable
Sample
Sample
Backprop error to
update generator
weights
Figure: Kevin McGuinness (DCU)
Generated
Consider a binary encoding of “1” (Real) and “0” (Fake). Which of these two values should the
discriminator predict when we train the generator with a generated image ?
Conditional GANs (cGAN)
#StyleGAN-T Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, Timo Aila, "StyleGAN-T: Unlocking the Power of GANs for
Fast Large-Scale Text-to-Image Synthesis". ICML 2023
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
■ Generator & Discriminator Networks
■ Adversarial Training
■ Conditional GANs
○ Variational Autoencoders (VAEs)
○ Diffusion
○ Auto-regressive
Non-Conditional GANs
52
Slide credit: Víctor Garcia
Discriminator
D(·)
Generator
G(·)
Real World
Random
seed (z)
Real/Generated
53
Conditional GANs (cGAN)
Slide credit: Víctor Garcia
Conditional Adversarial Networks
Real World
Real/Generated
Condition
Discriminator
D(·)
Generator
G(·)
54
Learn more about GANs
Ian Goodfellow.
NeurIPS Barcelona 2016.
Mihaela Rosca & Jeff Donahue.
UCL x Deepmind 2020.
Soumith Chintala, “How to train a GAN ? Tips and
tricks to make GAN work”. Github 2016.
Learn more about GANs
Ian Goodfellow @ Lex Friedman
(2020)
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
○ Diffusion
○ Auto-regressive
Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
■ Limitations of the AE
■ VAE training
■ VAE inferenc
○ Diffusion
○ Auto-regressive
Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
Encode Decode
“Generate”
58
Auto-Encoder (AE)
z
Feature
space
● Trained with a reconstruction loss.
● Proposed as a pre-training stage for the encoder (“self-supervised learning”).
#AE Hinton, Geoffrey E., Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 2006.
x’
Learned
manifold
Loss term:
Reconstruction
loss
59
Auto-Encoder (AE) for generation ?
Could we generate new samples by sampling from a normal distribution and
feeding it into the encoder ? And into the decoder (as in GANs) ?
Encode Decode
“Generate”
?
Learned
manifold
Feature
space
60
Auto-Encoder (AE) for generation ?
No, because the noise (or encoded noise) would be out of the learned manifold.
Encode Decode
“Generate”
Could we generate new samples by sampling from a normal distribution and
feeding it into the encoder, or the decoder (as in GANs) ?
Learned
manifold
Feature
space
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
■ Limitations of the AE
■ VAE training
■ VAE inferen
○ Diffusion
○ Auto-regressive
Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
62
Source: Wikipedia. Image by Bscan - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=25235145
Maths 101: Multivariate normal distribution
If all dimensions are independent:
∑ = σ I,
where σ ∈ Rk
and
I ∈ Rk x k
is the identity matrix.
63
VAE - Training
Encoder (training only): Instead of mapping the input x to a fixed vector, we map
it into a multivariate normal distribution z ~ N( μ(X), ∑(X) ).
Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." ICLR 2014.
DKL
[ N( μ(X), ∑(X) ) || N( μ, σ I ) ]
z
Encode
Loss term #1:
KL divergence
Sample ~
64
VAE - Training
Encode
Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." ICLR 2014.
Decoder: Reconstruct the input data from z ~ N( μ(X), ∑(X) ).
Decode
Loss term #2:
Reconstruction
loss
z
Sample ~
65
VAE - Training
Encode
Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." ICLR 2014.
Decode
Loss term #2:
Reconstruction
loss
z
Sample ~
Do you see any challenge in this scheme if use deep neural networks to estimate
Q(z|X) and P(X|z) ?
Sample ~
66
VAE - Training
Encode
Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." ICLR 2014.
Decode
Loss term #2:
Reconstruction
loss
z
Challenge: We cannot backprop through “Sampling” because it is not a differentiable
operation.
67
VAE - Training
Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." ICLR 2014.
Sampling: The reparametrization trick uses an auxiliary variable ε~N(0,I) to obtain
a sample z from the distribution N( μ, σI ).
ε~N(0,I)
Decode
Loss term #2:
Reconstruction
loss
z
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
■ Limitations of the AE
■ VAE training
■ VAE inferen
○ Diffusion
○ Auto-regressive
Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
VAE Inference
69
How can we generate new samples from a trained VAE ?
ε~N(0,I)
Decode
z
We can sample from our prior ε~N(0,I), discarding the encoder path.
70
Generative behaviour
ε~N(0,I)
Decode
z
71
Generative behaviour
#NVAE Vahdat, Arash, and Jan Kautz. "NVAE: A deep hierarchical variational autoencoder." NeurIPS 2020. [code]
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
■ Limitations of the AE
■ VAE training
■ VAE inferee
○ Diffusion
○ Auto-regressive
Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
73
VQ-VAE
#VQ-VAE (Deepmind) Van Den Oord, Aaron, Oriol Vinyals, Koray Kavukcuoglu. "Neural discrete representation learning."
NeurIPS 2017.
Z space is a discretized and learned embedding space (vector quantization), so every encoded point
z(x) is mapped to nearest embedding e, which is the information given to decode the sample.
Learn more about VAEs
74
Andriy Mnih (UCL - Deepmind 2020)
Max Welling - University of Amsterdam (2020)
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
○ Diffusion
○ Auto-regressive
Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
○ Denoising Diffusion Models (DDM)
■ Forward diffusion process
■ Reverse denoising process
■ Latent diffusion models
○ Auto-regressive
#DDPM Jonathan Ho, Ajay Jain, Pieter Abbeel. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020.
Diffusion-based
Data
x0
x1
x2
x3
x4
… xT
Forward diffusion process (fixed) 😊
Data
Reverse diffusion process (must learn) 🤔
Noise
Noise
#DDPM Jonathan Ho, Ajay Jain, Pieter Abbeel. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020.
Diffusion-based
Data Manifold P(x0
)
x0
x1
xT
…
Reverse denoising
Forward diffusion
Noise
In which stage are neural networks needed ?
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
○ Denoising Diffusion Models (DDM)
■ Forward diffusion process
■ Reverse denoising process
■ Latent diffusion models
○ Auto-regressive
Forward Diffusion Process
Philip Isola, Generative Models of Images. MIT 2023.
Forward Diffusion Process
Data
Noise
x0
x1
x2
x3
x4
… xT
What is distribution q(xt
| xt-1
) ?
It is a normal distribution of:
● mean: (1- βt
)1/2
xt-1
● covariance: βt
I
…with βt
increasing from 0 to 1 with t.
Forward diffusion process (fixed)
Forward Diffusion Process
Data
Noise
x0
x1
x2
x3
x4
… xT
Forward diffusion process (fixed)
What is q(xT
|x0
), the approximate
data distribution of xT
when T is
large (eg. 1000) ?
Forward Diffusion Process
Data
Noise
x0
x1
x2
x3
x4
… xT
Forward diffusion process (fixed)
What is q(xT
|x0
), the approximate
data distribution of xT
when T is
large (eg. 1000) ?
Sohl-Dickstein, Jascha, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. "Deep unsupervised learning using
nonequilibrium thermodynamics." ICML 2015.
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
○ Denoising Diffusion Models (DDM)
■ Forward diffusion process
■ Reverse denoising process
■ Latent diffusion models
○ Auto-regressive
Related work: Denoising Autoencoder (DAE)
Encode Decode
“Generate”
#DAE Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. "Extracting and composing robust
features with denoising autoencoders." ICML 2008.
Philip Isola, Generative Models of Images. MIT 2023.
Reverse Denoising process
Reverse Denoising process
xt
: sample
at time t
timestep t , to account
for the schedule
Data
Noise
x0
x1
x2
x3
x4
… xT
What is the distribution of xt-1
, given xt
?
Normal distribution of:
● mean: μӨ
(xt
,t)
● covariance: ΣӨ
(xt
,t)
Reverse Denoising process
What is the dimension of the latent variable in diffusion models ?
Same dimensionality as the diffused data.
Data Manifold P(x0
)
x0
xT
Noise
Image
Network learns to
denoise step by step
NN
(eg.UNet, ViT)
xt-1
xt
Relation with VAEs
Diffusion models can be considered as a special form of VAEs.
However in diffusion models:
● The encoder is fixed
● The latent variables have the same dimension as the data
● The decoder is run multiple times in an autoregressive fashion.
Encoder
qϴ
(z|x)
Decoder
pϴ
(x|z)
z
Forward “encoding” → fixed
Reverse “decoding” → trainable
VAE Diffusion
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
○ Denoising Diffusion Models (DDM)
■ Forward diffusion process
■ Reverse denoising process
■ Latent diffusion models
○ Auto-regressive
Latent Diffusion
#LDM (Munich, Heidelberg) Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. "High-resolution image
synthesis with latent diffusion models." CVPR 2022. [talk]
Latent Diffusion
#LDM (Munich, Heidelberg) Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. "High-resolution image
synthesis with latent diffusion models." CVPR 2022. [talk]
VQ-VAE
Decoder
Sampler
4
4
Denoised
latent
Denoising UNet
Text CLIP
Encoder
768
“The girl with
pearl earring by
Johannes
Vermeer.”
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
○ Denoising Diffusion Models (DDM)
○ Auto-regressive
Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
○ Denoising Diffusion Models (DDM)
○ Auto-regressive Models (AR)
■ PixelRNN
■ PixelCNN
■ Transformer-based solutions
Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
Joint probability distribution
Model the joint probability distribution of data x as a product of
element-wise conditional distributions for each element xi
conditioned on the
previous elements x1
,...xi-1
.
Example:
○ An image x of size (n, n) is understood as a sequence of pixels x1
,...xn*n
● Apply probability chain rule: xi
is the i-th pixel in the image.
95
#PixelRNN Van Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. Pixel recurrent neural networks. ICML 2016.
PixelRNN used a deep neural network (an RNN) to model
96
#PixelRNN Van Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. Pixel recurrent neural networks. ICML 2016.
Joint probability distribution
Recurrent
layer (RNN)
PixelRNN
97
#PixelRNN Van Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. Pixel recurrent neural networks. ICML 2016.
Why are not all completions identical ?
(aka how can AR offer a generative behaviour ?)
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
○ Denoising Diffusion Models (DDM)
○ Auto-regressive Models (AR)
■ PixelRNN
■ PixelCNN
■ Transformer-based solutions
Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
99
Auto-Regressive (AR)
An auto-regressive approach, where the previous outputs become inputs in
future steps, also allows modelling .
x[t-L+2], …, x
̂ [t+2]
x[t-L+1], …, x
̂ [t+1],
x
̂ [t+1]
x[t-L], …, x
̂ [t]
x
̂ [t+2] x
̂ [t+3]
Wavenet
100
Wavenet used dilated convolutions to produce synthetic audio, sample by
sample, conditioned over by receptive field of size T:
#Wavenet Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv 2016. [blog]
101
Training AR with Teacher Forcing
#TeacherForcing Williams, Ronald J., and David Zipser. "A learning algorithm for continually running fully recurrent
neural networks." Neural computation 1, no. 2 (1989): 270-280.
AR
(eg. CNN, MLP…)
RNN
x[t-L+1], …, x[t+1],
x
̂ [t+1]
x[t-L], …, x
̂ [t]
x
̂ [t+2]
X
Known label
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
○ Denoising Diffusion Models (DDM)
○ Auto-regressive Models (AR)
■ PixelRNN
■ PixelCNN
■ Transformer-based solutions
Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
The Transformer
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
The Transformer removed the recurrence mechanism thanks to self-attention...
The Transformer
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
…which allows parallelizing across multiple machines using teacher forcing.
AR
The Transformer
Figure: Jay Alammar, “The illustrated Transformer” (2018)
Zero-shot learning
#GPT-2 Alec Radford, Jeffrey Wu, Dario Amodei, Daniela Amodei, Jack Clark, Miles Brundage, Ilya Sutskever, “Better
Language Models and Their Implications”. OpenAI Blog 2019.
“GPT-2 is trained with a simple objective: predict the next word, given all of the
previous words within some text.”
Zero-shot task performances
(GPT-2 was never trained for these tasks)
#iGPT Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. Generative Pretraining from Pixels. ICML
2020. [blog]
iGPT - 1.5B params (similar to GPT-2)
Training: Next-pixel prediction or masked pixel prediction.
Inference: Autoregressive generation of zq
’s with a transformer (global attention).
Input resolutions =
{322
,482
,962
,1922
} x 3
Model resolutions=
{322
,482
}
#iGPT Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. Generative Pretraining from Pixels. ICML
2020. [blog]
iGPT - 1.5B params (similar to GPT-2)
#DALL·E (OpenAI) Ramesh, Aditya, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya
Sutskever. "Zero-shot text-to-image generation." ICML 2021.
Two-Stage Approaches: DALL·E (12B params)
Stage 1: Discrete variational autoencoder (dVAE).
Stage 2: Autoregressive transformer
110
#Parti (Google) Yu, Jiahui, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan et al. "Scaling autoregressive models
for content-rich text-to-image generation." TMLR 2023. [blog]
Scaling up to 20B params
Two-Stage Approaches: Parti (20B params)
Learn more about AR models
Nal Kalchbrenner, Mediterranean Machine Learning
Summer School 2022.
Outline
1. Motivation
2. Discriminative vs Generative Models
3. Sampling
4. Architectures
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
○ Denoising Diffusion Models (DDM)
○ Auto-regressive Models (AR)
Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
113
Source: David Foster
Recommended books
Interview of David Foster for Machine
Learning Street Talk (2023)
Deep Generative
Learning for All
(a.k.a. The GenAI Hype)
Xavier Giro-i-Nieto
@DocXavi
xavigiro.upc@gmail.com
Associate Professor (on leave)
Universitat Politècnica de Catalunya
Institut de Robòtica Industrial
ELLIS Unit Barcelona

More Related Content

Similar to Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Generative Adversarial Networks (GANs) at the Data Science Meetup Luxembourg ...
Generative Adversarial Networks (GANs) at the Data Science Meetup Luxembourg ...Generative Adversarial Networks (GANs) at the Data Science Meetup Luxembourg ...
Generative Adversarial Networks (GANs) at the Data Science Meetup Luxembourg ...Chris Hammerschmidt
 
Adversarial examples in deep learning (Gregory Chatel)
Adversarial examples in deep learning (Gregory Chatel)Adversarial examples in deep learning (Gregory Chatel)
Adversarial examples in deep learning (Gregory Chatel)MeetupDataScienceRoma
 
Yolos you only look one sequence
Yolos you only look one sequenceYolos you only look one sequence
Yolos you only look one sequencetaeseon ryu
 
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...NTNU
 
Ex nihilo nihil fit: A COMMONSENSE REASONING FRAMEWORK FOR DYNAMIC KNOWLEDGE...
Ex nihilo nihil fit:  A COMMONSENSE REASONING FRAMEWORK FOR DYNAMIC KNOWLEDGE...Ex nihilo nihil fit:  A COMMONSENSE REASONING FRAMEWORK FOR DYNAMIC KNOWLEDGE...
Ex nihilo nihil fit: A COMMONSENSE REASONING FRAMEWORK FOR DYNAMIC KNOWLEDGE...Antonio Lieto
 
The Success of Deep Generative Models
The Success of Deep Generative ModelsThe Success of Deep Generative Models
The Success of Deep Generative Modelsinside-BigData.com
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Striving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational ModellingStriving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational ModellingMarco Wirthlin
 
ISBA 2022 Susie Bayarri lecture
ISBA 2022 Susie Bayarri lectureISBA 2022 Susie Bayarri lecture
ISBA 2022 Susie Bayarri lecturePierre Jacob
 
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
variational bayes in biophysics
variational bayes in biophysicsvariational bayes in biophysics
variational bayes in biophysicschris wiggins
 
Steps Towards Building a Story Understanding Engine
Steps Towards Building a Story Understanding  EngineSteps Towards Building a Story Understanding  Engine
Steps Towards Building a Story Understanding EngineChristos Rodosthenous
 
Using model-based statistical inference to learn about evolution
Using model-based statistical inference to learn about evolutionUsing model-based statistical inference to learn about evolution
Using model-based statistical inference to learn about evolutionErick Matsen
 
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...Amit Sheth
 
Some Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdf
Some Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdfSome Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdf
Some Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdfKent Bye
 
[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...
[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...
[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...Sou Yoshihara
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsEnrico Palumbo
 

Similar to Deep Generative Learning for All - The Gen AI Hype (Spring 2024) (20)

Generative Adversarial Networks (GANs) at the Data Science Meetup Luxembourg ...
Generative Adversarial Networks (GANs) at the Data Science Meetup Luxembourg ...Generative Adversarial Networks (GANs) at the Data Science Meetup Luxembourg ...
Generative Adversarial Networks (GANs) at the Data Science Meetup Luxembourg ...
 
Adversarial examples in deep learning (Gregory Chatel)
Adversarial examples in deep learning (Gregory Chatel)Adversarial examples in deep learning (Gregory Chatel)
Adversarial examples in deep learning (Gregory Chatel)
 
Yolos you only look one sequence
Yolos you only look one sequenceYolos you only look one sequence
Yolos you only look one sequence
 
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
 
Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)
 
Ex nihilo nihil fit: A COMMONSENSE REASONING FRAMEWORK FOR DYNAMIC KNOWLEDGE...
Ex nihilo nihil fit:  A COMMONSENSE REASONING FRAMEWORK FOR DYNAMIC KNOWLEDGE...Ex nihilo nihil fit:  A COMMONSENSE REASONING FRAMEWORK FOR DYNAMIC KNOWLEDGE...
Ex nihilo nihil fit: A COMMONSENSE REASONING FRAMEWORK FOR DYNAMIC KNOWLEDGE...
 
The Success of Deep Generative Models
The Success of Deep Generative ModelsThe Success of Deep Generative Models
The Success of Deep Generative Models
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Striving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational ModellingStriving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational Modelling
 
ISBA 2022 Susie Bayarri lecture
ISBA 2022 Susie Bayarri lectureISBA 2022 Susie Bayarri lecture
ISBA 2022 Susie Bayarri lecture
 
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
 
variational bayes in biophysics
variational bayes in biophysicsvariational bayes in biophysics
variational bayes in biophysics
 
Steps Towards Building a Story Understanding Engine
Steps Towards Building a Story Understanding  EngineSteps Towards Building a Story Understanding  Engine
Steps Towards Building a Story Understanding Engine
 
Using model-based statistical inference to learn about evolution
Using model-based statistical inference to learn about evolutionUsing model-based statistical inference to learn about evolution
Using model-based statistical inference to learn about evolution
 
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
From NLP to NLU: Why we need varied, comprehensive, and stratified knowledge,...
 
Some Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdf
Some Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdfSome Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdf
Some Preliminary Thoughts on Artificial Intelligence - April 20, 2023.pdf
 
[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...
[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...
[Explained] "Partial Success in Closing the Gap between Human and Machine Vis...
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender Systems
 
graziani_bias.pdf
graziani_bias.pdfgraziani_bias.pdf
graziani_bias.pdf
 
Multimodal Deep Learning
Multimodal Deep LearningMultimodal Deep Learning
Multimodal Deep Learning
 

More from Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Universitat Politècnica de Catalunya
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Universitat Politècnica de Catalunya
 
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
 

Recently uploaded

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 

Recently uploaded (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

  • 1. Deep Generative Learning for All (a.k.a. The GenAI Hype) Xavier Giro-i-Nieto @DocXavi xavigiro.upc@gmail.com Associate Professor (on leave) Universitat Politècnica de Catalunya Institut de Robòtica Industrial ELLIS Unit Barcelona
  • 2. 2 Acknowledgements Santiago Pascual @santty128 PhD 2019 Universitat Politecnica de Catalunya Technical University of Catalonia Albert Pumarola @AlbertPumarola PhD 2021 Institut de Robòtica Industrial (IRI) Universitat Politècnica de Catalunya (UPC) Kevin McGuinness Research Fellow Insight Centre for Data Analytics Dublin City University Gerard I. Gállego PhD Student Universitat Politècnica de Catalunya gerard.ion.gallego@upc.edu @geiongallego
  • 3. 3 Acknowledgements Eduard Ramon Applied Scientist Amazon Barcelona @eram1205 Wentong Liao Applied Scientist Amazon Barcelona Ciprian Corneanu Applied Scientist Amazon Seattle Laia Tarrés PhD Student Universita Pompeu Fabra laia.tarres@upf.edu
  • 4. Outline 1. Motivation 2. Discriminative vs Generative Models a. P(Y|X): Discriminative Models b. P(X): Generative Models c. P(X|Y): Conditioned Generative Models 3. Latent variable 4. Architectures a. GAN b. Auto-regressive c. VAE d. Diffusion
  • 5. Image generation 5 #StyleGAN3 (NVIDIA) Karras, Tero, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. "Alias-free generative adversarial networks." NeurIPS 2021. [code]
  • 6. 6 #DiT (UC Berkeley, NYU) Peebles, William, and Saining Xie. "Scalable Diffusion Models with Transformers." ICCV 2023. Image generation
  • 7. 7 #Imagen2 (Google Deepmind) Blog post Text-to-Image (T2I) generation “A shot of a 32-year old female, up and coming conservationist in a jungle; athletic with short, curly hair and a warm smile
  • 8. 8 #DALL-E-2 (OpenAI) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen "Hierarchical Text-Conditional Image Generation with CLIP Latents." 2022. [blog] #DALL·E-3 (OpenAI) James Betker, Gabriel Goh, et al, “Improving Image Generation with Better Captions” 2023 [blog] Text-to-Image (T2I) generation ““An expressive oil painting of a basketball player dunking, depicted as an explosion of a nebula.
  • 9. 9 Text-to-Music (T2M) generation (Stability AI) Evans, Zach, Julian D. Parker, C. J. Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. "Long-form music generation with latent diffusion." arXiv 2024.
  • 10. 10 Text-to-Video (T2V) generation #Sora (OpenAI) Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Wing Yin Ng, Ricky Wang, Aditya Ramesh. Video generation models as world simulators. OpenAI 2024.
  • 11. Human Motion Transfer 11 #DreamMoving Mengyang Feng, Jinlin Liu, Kai Yu, et al., DreaMoving: A Human Video Generation Framework based on Diffusion Models. arXIv 2023 “A girl, smiling, dancing in a wooden house, wearing sweater, and long pants.” “A girl, smiling, dancing in the park with golden leaves in autumn, wearing light blue dress.” “A girl, smiling, dancing in Times Square, wearing dress-like white shirt, with long sleeves, long pants.”
  • 12. 12 Image Edition & Text-to-Video (T2V) generation #EMU (Meta) Girdhar, Rohit, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. "Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning." arXiv 2023. [blog]
  • 13. 13 #Imagine Flash (Meta) Jonas Kohler, Albert Pumarola, Edgar Schönfeld, Artsiom Sanakoyeu, Roshan Sumbaly, Peter Vajda, and Ali Thabet. Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation. Meta 2024. Text-to-Image (T2I) generation Albert Pumarola @AlbertPumarola PhD 2021 Institut de Robòtica Industrial (IRI) Universitat Politècnica de Catalunya (UPC)
  • 14. Outline 1. Motivation 2. Discriminative vs Generative Models a. P(Y|X): Discriminative Models b. P(X): Generative Models c. P(X|Y): Conditioned Generative Models 3. Latent variable 4. Architectures a. GAN b. Auto-regressive c. VAE d. Diffusion
  • 15. 15 Discriminative vs Generative Models Philip Isola, Generative Models of Images. MIT 2023.
  • 16. Outline 1. Motivation 2. Discriminative vs Generative Models a. Pθ (Y|X): Discriminative Models b. Pθ (X): Generative Models c. Pθ (X|Y): Conditioned Generative Models 3. Latent variable 4. Architectures a. GAN b. Auto-regressive c. VAE d. Diffusion
  • 17. Pθ (Y|X): Discriminative Models 17 Slide credit: Albert Pumarola (UPC 2019) Classification Regression Text Prob. of being a Potential Customer Image Audio Speech Translation Jim Carrey What Language? X=Data Y=Labels θ = Model parameters Discriminative Modeling Pθ (Y|X)
  • 18. 18 0.01 0.09 0.9 input Network (θ) output class Figure credit: Javier Ruiz (UPC TelecomBCN) Discriminative model: Tell me the probability of some ‘Y’ responses given ‘X’ inputs. Pθ (Y | X = [pixel1 , pixel2 , …, pixel784 ]) Pθ (Y|X): Discriminative Models
  • 19. Outline 1. Motivation 2. Discriminative vs Generative Models a. P(Y|X): Discriminative Models b. P(X): Generative Models c. P(X|Y): Conditioned Generative Models 3. Sampling 4. Architectures a. GAN b. Auto-regressive c. VAE d. Diffusion
  • 20. 20 Slide Concept: Albert Pumarola (UPC 2019) Pθ (X): Generative Models Classification Regression Generative Text Prob. of being a Potential Customer “What about Ron magic?” offered Ron. To Harry, Ron was loud, slow and soft bird. Harry did not like to think about birds. Image Audio Language Translation Music Composer and Interpreter MuseNet Sample Jim Carrey What Language? Discriminative Modeling Pθ (Y|X) Generative Modeling Pθ (X) X=Data Y=Labels θ = Model parameters
  • 21. Each real sample xi comes from an M-dimensional probability distribution P(X). X = {x1 , x2 , …, xN } Pθ (X): Generative Models
  • 22. 22 1) We want our model with parameters θ to output samples with distribution Pθ (X), matching the distribution of our training data P(X). 2) We can sample points from Pθ (X) plausibly looking how P(X) distributed. P(X) Distribution of training data Pλ,μ,σ (X) Distribution of training data Example: Gaussian Mixture Models (GMM) Pθ (X): Generative Models
  • 23. 23 What are the parameters θ we need to estimate in deep neural networks ? θ = (weights & biases) output Network (θ) ? Pθ (X): Generative Models
  • 24. Outline 1. Motivation 2. Discriminative vs Generative Models a. P(Y|X): Discriminative Models b. P(X): Generative Models c. P(X|Y): Conditioned Generative Models 3. Sampling 4. Architectures a. GAN b. Auto-regressive c. VAE d. Diffusion
  • 25. Pθ (X|Y): Conditioned Generative Models Joint probabilities P(X|Y) to model conditioning variables on the generative process: X = {x1 , x2 , …, xN } Y = {y1 , y2 , …, yN } DOG CAT TRUCK PIZZA THRILLER SCI-FI HISTORY /aa/ /e/ /o/
  • 26. Outline 1. Motivation 2. Discriminative vs Generative Models a. P(Y|X): Discriminative Models b. P(X): Generative Models c. P(X|Y): Conditioned Generative Models 3. Sampling 4. Architectures a. Generative Adversarial Networks (GANs) b. Auto-regressive c. Variational Autoencoders (VAEs) d. Diffusion
  • 27. Our learned model should be able to make up new samples from the distribution, not just copy and paste existing samples! 27 Figure from NIPS 2016 Tutorial: Generative Adversarial Networks (I. Goodfellow) Sampling
  • 28. Philip Isola, Generative Models of Images. MIT 2023. Sampling
  • 29. Slide concept: Albert Pumarola (UPC 2019) Learn Sample Out Training Dataset Generated Samples Feature space Manifold Pθ (X) “Model the data distribution so that we can sample new points out of the distribution” Sampling
  • 30. Sampling z Generated Samples How could we generate diverse samples from a deterministic deep neural network ? Generator (θ)
  • 31. Sampling Generated Samples How could we generate diverse samples from a deterministic deep neural network ? Generator (θ) Sample z from a known prior, for example, a multivariate normal distribution N(0, I). Example: dim(z)=2 x’ z
  • 32. Slide concept: Albert Pumarola (UPC 2019) Learn Training Dataset Interpolated Samples Feature space Manifold Pθ (X) Traversing the learned manifold through interpolation. Interpolation
  • 33. Disentanglement Philip Isola, Generative Models of Images. MIT 2023.
  • 34. Disentanglement Philip Isola, Generative Models of Images. MIT 2023.
  • 35. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ■ Generator & Discriminator Networks ■ Adversarial Training ○ Auto-regressive ○ Variational Autoencoders (VAEs) ○ Diffusion
  • 36. 36 Credit: Santiago Pascual [slides] [video]
  • 37. 37 Generator & Discriminator We have two modules: Generator (G) and Discriminator (D). ● They “fight” against each other during training→ Adversarial Learning D’s goal: Classify between real samples and those produced by G. G’s goal: Fool D to missclassify. Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative Adversarial Nets." NeurIPS 2014.
  • 38. 38 Discriminator Discriminator network D → binary classifier between real (x) and generated (x’). samples. Generated (1) Discriminator (θ) x’ Discriminator (θ) x Real (0)
  • 40. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ■ Generator & Discriminator Networks ■ Adversarial Training ○ Auto-regressive ○ Variational Autoencoders (VAEs) ○ Diffusion
  • 41. Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake. 100 100 FAKE: It’s not even green Adversarial Training Analogy: is it fake money? Figure: Santiago Pascual (UPC)
  • 42. Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake. 100 100 FAKE: There is no watermark Adversarial Training Analogy: is it fake money? Figure: Santiago Pascual (UPC)
  • 43. Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake. 100 100 FAKE: Watermark should be rounded Adversarial Training Analogy: is it fake money? Figure: Santiago Pascual (UPC)
  • 44. Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake. After enough iterations, and if the counterfeiter is good enough (in terms of G network it means “has enough parameters”), the police should be confused. REAL? FAKE? Adversarial Training Analogy: is it fake money? Figure: Santiago Pascual (UPC)
  • 45. Adversarial Training Generator Real world images Discriminator Real Loss Latent random variable Sample Sample Generated Alternate between training the discriminator and generator Neural Network Neural Network Figure: Kevin McGuinness (DCU)
  • 46. Adversarial Training: Discriminator Generator Real world images Discriminator Real Loss Latent random variable Sample Sample Generated 1. Fix generator weights, draw samples from both real world and generated images 2. Train discriminator to distinguish between real world and generated images Backprop error to update discriminator weights Figure: Kevin McGuinness (DCU)
  • 47. Adversarial Training: Discriminator Generator Discriminator Real Loss Latent random variable Sample Backprop error to update discriminator weights Figure: Kevin McGuinness (DCU) Generated Consider a binary encoding of “1” (Real) and “0” (Fake). Which of these two values should the discriminator predict when we train the discriminator with a generated image ?
  • 48. Adversarial Training: Generator 1. Fix discriminator weights 2. Sample from generator by injecting noise. 3. Backprop error through discriminator to update generator weights Generator Real world images Discriminator Real Loss Latent random variable Sample Sample Backprop error to update generator weights Figure: Kevin McGuinness (DCU) Generated
  • 49. Adversarial Training: Generator Generator Real world images Discriminator Real Loss Latent random variable Sample Sample Backprop error to update generator weights Figure: Kevin McGuinness (DCU) Generated Consider a binary encoding of “1” (Real) and “0” (Fake). Which of these two values should the discriminator predict when we train the generator with a generated image ?
  • 50. Conditional GANs (cGAN) #StyleGAN-T Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, Timo Aila, "StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis". ICML 2023
  • 51. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ■ Generator & Discriminator Networks ■ Adversarial Training ■ Conditional GANs ○ Variational Autoencoders (VAEs) ○ Diffusion ○ Auto-regressive
  • 52. Non-Conditional GANs 52 Slide credit: Víctor Garcia Discriminator D(·) Generator G(·) Real World Random seed (z) Real/Generated
  • 53. 53 Conditional GANs (cGAN) Slide credit: Víctor Garcia Conditional Adversarial Networks Real World Real/Generated Condition Discriminator D(·) Generator G(·)
  • 54. 54 Learn more about GANs Ian Goodfellow. NeurIPS Barcelona 2016. Mihaela Rosca & Jeff Donahue. UCL x Deepmind 2020.
  • 55. Soumith Chintala, “How to train a GAN ? Tips and tricks to make GAN work”. Github 2016. Learn more about GANs Ian Goodfellow @ Lex Friedman (2020)
  • 56. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ○ Diffusion ○ Auto-regressive Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
  • 57. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ■ Limitations of the AE ■ VAE training ■ VAE inferenc ○ Diffusion ○ Auto-regressive Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
  • 58. Encode Decode “Generate” 58 Auto-Encoder (AE) z Feature space ● Trained with a reconstruction loss. ● Proposed as a pre-training stage for the encoder (“self-supervised learning”). #AE Hinton, Geoffrey E., Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 2006. x’ Learned manifold Loss term: Reconstruction loss
  • 59. 59 Auto-Encoder (AE) for generation ? Could we generate new samples by sampling from a normal distribution and feeding it into the encoder ? And into the decoder (as in GANs) ? Encode Decode “Generate” ? Learned manifold Feature space
  • 60. 60 Auto-Encoder (AE) for generation ? No, because the noise (or encoded noise) would be out of the learned manifold. Encode Decode “Generate” Could we generate new samples by sampling from a normal distribution and feeding it into the encoder, or the decoder (as in GANs) ? Learned manifold Feature space
  • 61. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ■ Limitations of the AE ■ VAE training ■ VAE inferen ○ Diffusion ○ Auto-regressive Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
  • 62. 62 Source: Wikipedia. Image by Bscan - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=25235145 Maths 101: Multivariate normal distribution If all dimensions are independent: ∑ = σ I, where σ ∈ Rk and I ∈ Rk x k is the identity matrix.
  • 63. 63 VAE - Training Encoder (training only): Instead of mapping the input x to a fixed vector, we map it into a multivariate normal distribution z ~ N( μ(X), ∑(X) ). Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." ICLR 2014. DKL [ N( μ(X), ∑(X) ) || N( μ, σ I ) ] z Encode Loss term #1: KL divergence Sample ~
  • 64. 64 VAE - Training Encode Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." ICLR 2014. Decoder: Reconstruct the input data from z ~ N( μ(X), ∑(X) ). Decode Loss term #2: Reconstruction loss z Sample ~
  • 65. 65 VAE - Training Encode Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." ICLR 2014. Decode Loss term #2: Reconstruction loss z Sample ~ Do you see any challenge in this scheme if use deep neural networks to estimate Q(z|X) and P(X|z) ?
  • 66. Sample ~ 66 VAE - Training Encode Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." ICLR 2014. Decode Loss term #2: Reconstruction loss z Challenge: We cannot backprop through “Sampling” because it is not a differentiable operation.
  • 67. 67 VAE - Training Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." ICLR 2014. Sampling: The reparametrization trick uses an auxiliary variable ε~N(0,I) to obtain a sample z from the distribution N( μ, σI ). ε~N(0,I) Decode Loss term #2: Reconstruction loss z
  • 68. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ■ Limitations of the AE ■ VAE training ■ VAE inferen ○ Diffusion ○ Auto-regressive Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
  • 69. VAE Inference 69 How can we generate new samples from a trained VAE ? ε~N(0,I) Decode z
  • 70. We can sample from our prior ε~N(0,I), discarding the encoder path. 70 Generative behaviour ε~N(0,I) Decode z
  • 71. 71 Generative behaviour #NVAE Vahdat, Arash, and Jan Kautz. "NVAE: A deep hierarchical variational autoencoder." NeurIPS 2020. [code]
  • 72. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ■ Limitations of the AE ■ VAE training ■ VAE inferee ○ Diffusion ○ Auto-regressive Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
  • 73. 73 VQ-VAE #VQ-VAE (Deepmind) Van Den Oord, Aaron, Oriol Vinyals, Koray Kavukcuoglu. "Neural discrete representation learning." NeurIPS 2017. Z space is a discretized and learned embedding space (vector quantization), so every encoded point z(x) is mapped to nearest embedding e, which is the information given to decode the sample.
  • 74. Learn more about VAEs 74 Andriy Mnih (UCL - Deepmind 2020) Max Welling - University of Amsterdam (2020)
  • 75. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ○ Diffusion ○ Auto-regressive Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
  • 76. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ○ Denoising Diffusion Models (DDM) ■ Forward diffusion process ■ Reverse denoising process ■ Latent diffusion models ○ Auto-regressive
  • 77. #DDPM Jonathan Ho, Ajay Jain, Pieter Abbeel. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020. Diffusion-based Data x0 x1 x2 x3 x4 … xT Forward diffusion process (fixed) 😊 Data Reverse diffusion process (must learn) 🤔 Noise Noise
  • 78. #DDPM Jonathan Ho, Ajay Jain, Pieter Abbeel. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020. Diffusion-based Data Manifold P(x0 ) x0 x1 xT … Reverse denoising Forward diffusion Noise In which stage are neural networks needed ?
  • 79. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ○ Denoising Diffusion Models (DDM) ■ Forward diffusion process ■ Reverse denoising process ■ Latent diffusion models ○ Auto-regressive
  • 80. Forward Diffusion Process Philip Isola, Generative Models of Images. MIT 2023.
  • 81. Forward Diffusion Process Data Noise x0 x1 x2 x3 x4 … xT What is distribution q(xt | xt-1 ) ? It is a normal distribution of: ● mean: (1- βt )1/2 xt-1 ● covariance: βt I …with βt increasing from 0 to 1 with t. Forward diffusion process (fixed)
  • 82. Forward Diffusion Process Data Noise x0 x1 x2 x3 x4 … xT Forward diffusion process (fixed) What is q(xT |x0 ), the approximate data distribution of xT when T is large (eg. 1000) ?
  • 83. Forward Diffusion Process Data Noise x0 x1 x2 x3 x4 … xT Forward diffusion process (fixed) What is q(xT |x0 ), the approximate data distribution of xT when T is large (eg. 1000) ? Sohl-Dickstein, Jascha, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. "Deep unsupervised learning using nonequilibrium thermodynamics." ICML 2015.
  • 84. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ○ Denoising Diffusion Models (DDM) ■ Forward diffusion process ■ Reverse denoising process ■ Latent diffusion models ○ Auto-regressive
  • 85. Related work: Denoising Autoencoder (DAE) Encode Decode “Generate” #DAE Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. "Extracting and composing robust features with denoising autoencoders." ICML 2008.
  • 86. Philip Isola, Generative Models of Images. MIT 2023. Reverse Denoising process
  • 87. Reverse Denoising process xt : sample at time t timestep t , to account for the schedule Data Noise x0 x1 x2 x3 x4 … xT What is the distribution of xt-1 , given xt ? Normal distribution of: ● mean: μӨ (xt ,t) ● covariance: ΣӨ (xt ,t)
  • 88. Reverse Denoising process What is the dimension of the latent variable in diffusion models ? Same dimensionality as the diffused data. Data Manifold P(x0 ) x0 xT Noise Image Network learns to denoise step by step NN (eg.UNet, ViT) xt-1 xt
  • 89. Relation with VAEs Diffusion models can be considered as a special form of VAEs. However in diffusion models: ● The encoder is fixed ● The latent variables have the same dimension as the data ● The decoder is run multiple times in an autoregressive fashion. Encoder qϴ (z|x) Decoder pϴ (x|z) z Forward “encoding” → fixed Reverse “decoding” → trainable VAE Diffusion
  • 90. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ○ Denoising Diffusion Models (DDM) ■ Forward diffusion process ■ Reverse denoising process ■ Latent diffusion models ○ Auto-regressive
  • 91. Latent Diffusion #LDM (Munich, Heidelberg) Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. "High-resolution image synthesis with latent diffusion models." CVPR 2022. [talk]
  • 92. Latent Diffusion #LDM (Munich, Heidelberg) Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. "High-resolution image synthesis with latent diffusion models." CVPR 2022. [talk] VQ-VAE Decoder Sampler 4 4 Denoised latent Denoising UNet Text CLIP Encoder 768 “The girl with pearl earring by Johannes Vermeer.”
  • 93. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ○ Denoising Diffusion Models (DDM) ○ Auto-regressive Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
  • 94. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ○ Denoising Diffusion Models (DDM) ○ Auto-regressive Models (AR) ■ PixelRNN ■ PixelCNN ■ Transformer-based solutions Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
  • 95. Joint probability distribution Model the joint probability distribution of data x as a product of element-wise conditional distributions for each element xi conditioned on the previous elements x1 ,...xi-1 . Example: ○ An image x of size (n, n) is understood as a sequence of pixels x1 ,...xn*n ● Apply probability chain rule: xi is the i-th pixel in the image. 95 #PixelRNN Van Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. Pixel recurrent neural networks. ICML 2016.
  • 96. PixelRNN used a deep neural network (an RNN) to model 96 #PixelRNN Van Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. Pixel recurrent neural networks. ICML 2016. Joint probability distribution Recurrent layer (RNN)
  • 97. PixelRNN 97 #PixelRNN Van Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. Pixel recurrent neural networks. ICML 2016. Why are not all completions identical ? (aka how can AR offer a generative behaviour ?)
  • 98. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ○ Denoising Diffusion Models (DDM) ○ Auto-regressive Models (AR) ■ PixelRNN ■ PixelCNN ■ Transformer-based solutions Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
  • 99. 99 Auto-Regressive (AR) An auto-regressive approach, where the previous outputs become inputs in future steps, also allows modelling . x[t-L+2], …, x ̂ [t+2] x[t-L+1], …, x ̂ [t+1], x ̂ [t+1] x[t-L], …, x ̂ [t] x ̂ [t+2] x ̂ [t+3]
  • 100. Wavenet 100 Wavenet used dilated convolutions to produce synthetic audio, sample by sample, conditioned over by receptive field of size T: #Wavenet Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv 2016. [blog]
  • 101. 101 Training AR with Teacher Forcing #TeacherForcing Williams, Ronald J., and David Zipser. "A learning algorithm for continually running fully recurrent neural networks." Neural computation 1, no. 2 (1989): 270-280. AR (eg. CNN, MLP…) RNN x[t-L+1], …, x[t+1], x ̂ [t+1] x[t-L], …, x ̂ [t] x ̂ [t+2] X Known label
  • 102. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ○ Denoising Diffusion Models (DDM) ○ Auto-regressive Models (AR) ■ PixelRNN ■ PixelCNN ■ Transformer-based solutions Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
  • 103. The Transformer #Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need. NeurIPS 2017. The Transformer removed the recurrence mechanism thanks to self-attention...
  • 104. The Transformer #Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need. NeurIPS 2017. …which allows parallelizing across multiple machines using teacher forcing. AR
  • 105. The Transformer Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 106. Zero-shot learning #GPT-2 Alec Radford, Jeffrey Wu, Dario Amodei, Daniela Amodei, Jack Clark, Miles Brundage, Ilya Sutskever, “Better Language Models and Their Implications”. OpenAI Blog 2019. “GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text.” Zero-shot task performances (GPT-2 was never trained for these tasks)
  • 107. #iGPT Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. Generative Pretraining from Pixels. ICML 2020. [blog] iGPT - 1.5B params (similar to GPT-2) Training: Next-pixel prediction or masked pixel prediction. Inference: Autoregressive generation of zq ’s with a transformer (global attention). Input resolutions = {322 ,482 ,962 ,1922 } x 3 Model resolutions= {322 ,482 }
  • 108. #iGPT Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. Generative Pretraining from Pixels. ICML 2020. [blog] iGPT - 1.5B params (similar to GPT-2)
  • 109. #DALL·E (OpenAI) Ramesh, Aditya, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. "Zero-shot text-to-image generation." ICML 2021. Two-Stage Approaches: DALL·E (12B params) Stage 1: Discrete variational autoencoder (dVAE). Stage 2: Autoregressive transformer
  • 110. 110 #Parti (Google) Yu, Jiahui, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan et al. "Scaling autoregressive models for content-rich text-to-image generation." TMLR 2023. [blog] Scaling up to 20B params Two-Stage Approaches: Parti (20B params)
  • 111. Learn more about AR models Nal Kalchbrenner, Mediterranean Machine Learning Summer School 2022.
  • 112. Outline 1. Motivation 2. Discriminative vs Generative Models 3. Sampling 4. Architectures ○ Generative Adversarial Networks (GANs) ○ Variational Autoencoders (VAEs) ○ Denoising Diffusion Models (DDM) ○ Auto-regressive Models (AR) Figure source: Lilian Weng, What are diffusion models ?, Lil’Log 2021.
  • 114. Recommended books Interview of David Foster for Machine Learning Street Talk (2023)
  • 115. Deep Generative Learning for All (a.k.a. The GenAI Hype) Xavier Giro-i-Nieto @DocXavi xavigiro.upc@gmail.com Associate Professor (on leave) Universitat Politècnica de Catalunya Institut de Robòtica Industrial ELLIS Unit Barcelona