Do Deep Generative Models* Know
What They Don't Know?
Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, Balaji Lakshminarayanan
(DeepMind)
ICLR 2019
*Fake news, no GANs
Presented by: Julius Hietala
TL;DR
TL;DR
Normalizing flows, VAEs, PixelCNNs aren’t reliable enough to
detect out of distribution data*
*in some interesting cases
Outline
• Paper introduction
• Some notes
• How normalizing flows work?
• Paper experiments
• Paper findings
• Conclusions
• Discussion
Paper introduction
• Density estimation/determination is used in many applications
(anomaly detection, transfer learning etc.)
Paper introduction
• Density estimation/determination is used in many applications
(anomaly detection, transfer learning etc.)
• These applications have spawned interest towards deep
generative models
Paper introduction
• Density estimation/determination is used in many applications
(anomaly detection, transfer learning etc.)
• These applications have spawned interest towards deep
generative models
• Currently popular choices are VAEs, GANs, auto regressive
models, and invertible latent variable models
Paper introduction
• Density estimation/determination is used in many applications
(anomaly detection, transfer learning etc.)
• These applications have spawned interest towards deep
generative models
• Currently popular choices are VAEs, GANs, auto regressive
models, and invertible latent variable models
• The latter two are interesting due to the fact that they allow for
exact likelihood calculation
Paper introduction
• Density estimation/determination is used in many applications
(anomaly detection, transfer learning etc.)
• These applications have spawned interest towards deep
generative models
• Currently popular choices are VAEs, GANs, auto regressive
models, and invertible latent variable models
• The latter two are interesting due to the fact that they allow for
exact likelihood calculation
• Main question of the paper: can these models be used for
anomaly detection?
Some notes
• The authors report results for VAEs, PixelCNNs, and
normalizing flows.
Some notes
• The authors report results for VAEs, PixelCNNs, and
normalizing flows.
• Only normalizing flows are discussed and studied in depth
Some notes
• The authors report results for VAEs, PixelCNNs, and
normalizing flows.
• Only normalizing flows are discussed and studied in depth
• Is their analysis applicable to all the different types of models?
How normalizing flows work?
How normalizing flows work?
• Change of variables:
• 𝑔 = 𝑓−1
• 𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑧
𝜕𝑧
𝜕𝑥
• ⟹ 𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑓(𝑥 )
𝜕𝑓
𝜕𝑥
𝑥
𝑍
𝑔
𝑋
ℝℝ
*Illustration stolen from here:
https://www.youtube.com/watch?v=P4Ta-TZPVi0
How normalizing flows work?
• In multiple dimensions this is 𝑝 𝑥 𝒙 = 𝑝 𝑧 𝑓(𝒙 ) det
𝜕𝒇
𝜕𝒙
𝑝 𝑥
𝑝 𝑧
How normalizing flows work?
• In multiple dimensions this is
𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑓(𝑥 ) det
𝜕𝑓
𝜕𝑥
• We want to determine 𝑝 𝑥 𝑥
• We can choose 𝑝 𝑧(𝑧) as we wish (usually a gaussian)
• We can choose 𝑓 (invertible, 𝑔 = 𝑓−1
)
• Challenges?
How normalizing flows work?
• Calculating det
𝜕𝑓
𝜕𝑥
could be hard (Jacobian determinant)
How normalizing flows work?
• Calculating det
𝜕𝑓
𝜕𝑥
could be hard (Jacobian determinant)
• Designing 𝑓 to be invertible might be a challenge
How normalizing flows work?
• Calculating det
𝜕𝑓
𝜕𝑥
could be hard (Jacobian determinant)
• Designing 𝑓 to be invertible might be a challenge
• Flow based models are designed so that both of these are easy
How normalizing flows work?
• Calculating det
𝜕𝑓
𝜕𝑥
could be hard (Jacobian determinant)
• Designing 𝑓 to be invertible might be a challenge
• Flow based models are designed so that both of these are easy
• Jacobian determinant:
• Make triangular so that only diagonal terms matter
• Make diagonal elements easy to calculate
How normalizing flows work?
• Example from RealNVP (https://arxiv.org/pdf/1605.08803.pdf):
*s and t are NN()
How normalizing flows work?
• Example from RealNVP (https://arxiv.org/pdf/1605.08803.pdf):
How normalizing flows work?
• Example from RealNVP (https://arxiv.org/pdf/1605.08803.pdf):
• Even with multiple levels of these steps of ”flow” the Jacobian
determinant remains tractable since
det 𝐴𝐵 = det 𝐴 det 𝐵
How normalizing flows work?
• So we are able to determine 𝑝 𝑥 𝑥
How normalizing flows work?
• So we are able to determine 𝑝 𝑥 𝑥
• For generation, we would just sample from 𝑝 𝑥 𝑥 (sample from
𝑝 𝑧 𝑧 and ”flow” the sample back in reverse)
How normalizing flows work?
• So we are able to determine 𝑝 𝑥 𝑥
• For generation, we would just sample from 𝑝 𝑥 𝑥 (sample from
𝑝 𝑧 𝑧 and ”flow” the sample back in reverse)
• For likelihood estimation (anomaly detection etc. applications)
we just ”flow” 𝑥 through the model to get the likelihood given by
𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑓(𝑥 ) det
𝜕𝑓
𝜕𝑥
How normalizing flows work?
• So we are able to determine 𝑝 𝑥 𝑥
• For generation, we would just sample from 𝑝 𝑥 𝑥 (sample from
𝑝 𝑧 𝑧 and ”flow” the sample back in reverse)
• For likelihood estimation (anomaly detection etc. applications)
we just ”flow” 𝑥 through the model to get the likelihood given by
𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑓(𝑥 ) det
𝜕𝑓
𝜕𝑥
• Models are optimized simply by maximizing the (log) likelihood
𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 log 𝑝 𝑥(𝑥; 𝜃)
How normalizing flows work?
• So we are able to determine 𝑝 𝑥 𝑥
• For generation, we would just sample from 𝑝 𝑥 𝑥 (sample from
𝑝 𝑧 𝑧 and ”flow” the sample back in reverse)
• For likelihood estimation (anomaly detection etc. applications)
we just ”flow” 𝑥 through the model to get the likelihood given by
𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑓(𝑥 ) det
𝜕𝑓
𝜕𝑥
• Models are optimized simply by maximizing the (log) likelihood
𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 log 𝑝 𝑥(𝑥; 𝜃)
• Glow demo: https://openai.com/blog/glow/
Paper experiments
• Train the model (Glow) on one data set (in distribution),
afterwards determine likelihoods for the training data (in
distribution) and another data set that was not used in training
(out of distribution)
Paper experiments
• Train the model (Glow) on one data set (in distribution),
afterwards determine likelihoods for the training data (in
distribution) and another data set that was not used in training
(out of distribution)
• Data set/distribution pairs:
• FashionMNIST vs. MNIST
• CIFAR-10 vs. SVHN
• CelebA vs. SVHN
• ImageNet vs. CIFAR-10/CIFAR-100/SVHN
Paper findings
• FashionMNIST vs. MNIST
Paper findings
• FashionMNIST vs. MNIST
Paper findings
• CIFAR-10 vs. SVHN
Paper findings
• CIFAR-10 vs. SVHN
Paper findings
• CelebA vs. SVHN
Paper findings
• CelebA vs. SVHN
Paper findings
• ImageNet vs. CIFAR-10/CIFAR-100/SVHN
Paper findings
• ImageNet vs. CIFAR-10/CIFAR-100/SVHN
Paper findings
• Other model types
Paper findings
• The observations presented were the main contributions of the paper,
grain of salt needed with next points
Paper findings
• The observations presented were the main contributions of the paper,
grain of salt needed with next points
• They try to explain the phenomenon, but raising many questions from
the reviewers
Paper findings
• The observations presented were the main contributions of the paper,
grain of salt needed with next points
• They try to explain the phenomenon, but raising many questions from
the reviewers
• Change of variable formula* term analysis:
*𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑓(𝑥 ) det
𝜕𝑓
𝜕𝑥
Paper findings
• They make the model “constant volume” (CV), i.e. det
𝜕𝑓
𝜕𝑥
is constant
Paper findings
• Explanation of the phenomenon making a lot of assumptions:
• Training distribution 𝑥 ~𝑝∗ and ”adversarial distribution” 𝑥 ~𝑞,
generative model 𝑝(𝑥; 𝜃)
• 𝑞 will have higher likelihood than 𝑝∗ if
𝔼 𝑞 log p 𝑥; 𝜃 − 𝔼 𝑝∗ log p 𝑥; 𝜃 > 0
• Assumptions:
• Second order expansion around 𝑥0
• Assuming 𝔼 𝑞 = 𝔼 𝑝∗ = 𝑥0 (some empirical proof in the example case)
• Latent distribution is gaussian
• Using constant volume
• 𝑞= SVHN, 𝑝∗ = CIFAR-10
Paper findings
• For 𝑞=SVHN, 𝑝∗=CIFAR-10, the assumptions given, and empirical
variances of the data
𝔼 𝑞 log p 𝑥; 𝜃 − 𝔼 𝑝∗ log p 𝑥; 𝜃 > 0
simplifies to:
1
2𝜎 𝜓
2 𝛼1
2
∗ 12.3 + 𝛼2
2
∗ 6.5 + 𝛼3
2
∗ 14.5 ≥ 0, where
𝛼 𝑐 =
𝑘=1
𝐾
𝑗=1
𝐶
𝑢 𝑘,𝑐,𝑗
• 𝔼 𝑞 log p 𝑥; 𝜃 − 𝔼 𝑝∗ log p 𝑥; 𝜃 is thus always larger or equal to zero
since 𝛼 𝑐
2
≥ 0
• Predicts that SVHN will be more likely than CIFAR-10
Paper findings
• Then hypothesize that reducing the variance of the data artificially will
increase the likelihood
Conclusions
• Cause to pause when using generative models in anomaly
detection
• Second order analysis provided (only applicable to a certain
type of flow + many assumptions)
• The author’s urge further study on the subject
Discussion
• How valid/applicable is their analysis?
• How come samples do not look like the OOD images if they
have higher likelihood?

Slides for "Do Deep Generative Models Know What They Don't know?"

  • 1.
    Do Deep GenerativeModels* Know What They Don't Know? Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, Balaji Lakshminarayanan (DeepMind) ICLR 2019 *Fake news, no GANs Presented by: Julius Hietala
  • 2.
  • 3.
    TL;DR Normalizing flows, VAEs,PixelCNNs aren’t reliable enough to detect out of distribution data* *in some interesting cases
  • 4.
    Outline • Paper introduction •Some notes • How normalizing flows work? • Paper experiments • Paper findings • Conclusions • Discussion
  • 5.
    Paper introduction • Densityestimation/determination is used in many applications (anomaly detection, transfer learning etc.)
  • 6.
    Paper introduction • Densityestimation/determination is used in many applications (anomaly detection, transfer learning etc.) • These applications have spawned interest towards deep generative models
  • 7.
    Paper introduction • Densityestimation/determination is used in many applications (anomaly detection, transfer learning etc.) • These applications have spawned interest towards deep generative models • Currently popular choices are VAEs, GANs, auto regressive models, and invertible latent variable models
  • 8.
    Paper introduction • Densityestimation/determination is used in many applications (anomaly detection, transfer learning etc.) • These applications have spawned interest towards deep generative models • Currently popular choices are VAEs, GANs, auto regressive models, and invertible latent variable models • The latter two are interesting due to the fact that they allow for exact likelihood calculation
  • 9.
    Paper introduction • Densityestimation/determination is used in many applications (anomaly detection, transfer learning etc.) • These applications have spawned interest towards deep generative models • Currently popular choices are VAEs, GANs, auto regressive models, and invertible latent variable models • The latter two are interesting due to the fact that they allow for exact likelihood calculation • Main question of the paper: can these models be used for anomaly detection?
  • 10.
    Some notes • Theauthors report results for VAEs, PixelCNNs, and normalizing flows.
  • 11.
    Some notes • Theauthors report results for VAEs, PixelCNNs, and normalizing flows. • Only normalizing flows are discussed and studied in depth
  • 12.
    Some notes • Theauthors report results for VAEs, PixelCNNs, and normalizing flows. • Only normalizing flows are discussed and studied in depth • Is their analysis applicable to all the different types of models?
  • 13.
  • 14.
    How normalizing flowswork? • Change of variables: • 𝑔 = 𝑓−1 • 𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑧 𝜕𝑧 𝜕𝑥 • ⟹ 𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑓(𝑥 ) 𝜕𝑓 𝜕𝑥 𝑥 𝑍 𝑔 𝑋 ℝℝ *Illustration stolen from here: https://www.youtube.com/watch?v=P4Ta-TZPVi0
  • 15.
    How normalizing flowswork? • In multiple dimensions this is 𝑝 𝑥 𝒙 = 𝑝 𝑧 𝑓(𝒙 ) det 𝜕𝒇 𝜕𝒙 𝑝 𝑥 𝑝 𝑧
  • 16.
    How normalizing flowswork? • In multiple dimensions this is 𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑓(𝑥 ) det 𝜕𝑓 𝜕𝑥 • We want to determine 𝑝 𝑥 𝑥 • We can choose 𝑝 𝑧(𝑧) as we wish (usually a gaussian) • We can choose 𝑓 (invertible, 𝑔 = 𝑓−1 ) • Challenges?
  • 17.
    How normalizing flowswork? • Calculating det 𝜕𝑓 𝜕𝑥 could be hard (Jacobian determinant)
  • 18.
    How normalizing flowswork? • Calculating det 𝜕𝑓 𝜕𝑥 could be hard (Jacobian determinant) • Designing 𝑓 to be invertible might be a challenge
  • 19.
    How normalizing flowswork? • Calculating det 𝜕𝑓 𝜕𝑥 could be hard (Jacobian determinant) • Designing 𝑓 to be invertible might be a challenge • Flow based models are designed so that both of these are easy
  • 20.
    How normalizing flowswork? • Calculating det 𝜕𝑓 𝜕𝑥 could be hard (Jacobian determinant) • Designing 𝑓 to be invertible might be a challenge • Flow based models are designed so that both of these are easy • Jacobian determinant: • Make triangular so that only diagonal terms matter • Make diagonal elements easy to calculate
  • 21.
    How normalizing flowswork? • Example from RealNVP (https://arxiv.org/pdf/1605.08803.pdf): *s and t are NN()
  • 22.
    How normalizing flowswork? • Example from RealNVP (https://arxiv.org/pdf/1605.08803.pdf):
  • 23.
    How normalizing flowswork? • Example from RealNVP (https://arxiv.org/pdf/1605.08803.pdf): • Even with multiple levels of these steps of ”flow” the Jacobian determinant remains tractable since det 𝐴𝐵 = det 𝐴 det 𝐵
  • 24.
    How normalizing flowswork? • So we are able to determine 𝑝 𝑥 𝑥
  • 25.
    How normalizing flowswork? • So we are able to determine 𝑝 𝑥 𝑥 • For generation, we would just sample from 𝑝 𝑥 𝑥 (sample from 𝑝 𝑧 𝑧 and ”flow” the sample back in reverse)
  • 26.
    How normalizing flowswork? • So we are able to determine 𝑝 𝑥 𝑥 • For generation, we would just sample from 𝑝 𝑥 𝑥 (sample from 𝑝 𝑧 𝑧 and ”flow” the sample back in reverse) • For likelihood estimation (anomaly detection etc. applications) we just ”flow” 𝑥 through the model to get the likelihood given by 𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑓(𝑥 ) det 𝜕𝑓 𝜕𝑥
  • 27.
    How normalizing flowswork? • So we are able to determine 𝑝 𝑥 𝑥 • For generation, we would just sample from 𝑝 𝑥 𝑥 (sample from 𝑝 𝑧 𝑧 and ”flow” the sample back in reverse) • For likelihood estimation (anomaly detection etc. applications) we just ”flow” 𝑥 through the model to get the likelihood given by 𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑓(𝑥 ) det 𝜕𝑓 𝜕𝑥 • Models are optimized simply by maximizing the (log) likelihood 𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 log 𝑝 𝑥(𝑥; 𝜃)
  • 28.
    How normalizing flowswork? • So we are able to determine 𝑝 𝑥 𝑥 • For generation, we would just sample from 𝑝 𝑥 𝑥 (sample from 𝑝 𝑧 𝑧 and ”flow” the sample back in reverse) • For likelihood estimation (anomaly detection etc. applications) we just ”flow” 𝑥 through the model to get the likelihood given by 𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑓(𝑥 ) det 𝜕𝑓 𝜕𝑥 • Models are optimized simply by maximizing the (log) likelihood 𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 log 𝑝 𝑥(𝑥; 𝜃) • Glow demo: https://openai.com/blog/glow/
  • 29.
    Paper experiments • Trainthe model (Glow) on one data set (in distribution), afterwards determine likelihoods for the training data (in distribution) and another data set that was not used in training (out of distribution)
  • 30.
    Paper experiments • Trainthe model (Glow) on one data set (in distribution), afterwards determine likelihoods for the training data (in distribution) and another data set that was not used in training (out of distribution) • Data set/distribution pairs: • FashionMNIST vs. MNIST • CIFAR-10 vs. SVHN • CelebA vs. SVHN • ImageNet vs. CIFAR-10/CIFAR-100/SVHN
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
    Paper findings • ImageNetvs. CIFAR-10/CIFAR-100/SVHN
  • 38.
    Paper findings • ImageNetvs. CIFAR-10/CIFAR-100/SVHN
  • 39.
  • 40.
    Paper findings • Theobservations presented were the main contributions of the paper, grain of salt needed with next points
  • 41.
    Paper findings • Theobservations presented were the main contributions of the paper, grain of salt needed with next points • They try to explain the phenomenon, but raising many questions from the reviewers
  • 42.
    Paper findings • Theobservations presented were the main contributions of the paper, grain of salt needed with next points • They try to explain the phenomenon, but raising many questions from the reviewers • Change of variable formula* term analysis: *𝑝 𝑥 𝑥 = 𝑝 𝑧 𝑓(𝑥 ) det 𝜕𝑓 𝜕𝑥
  • 43.
    Paper findings • Theymake the model “constant volume” (CV), i.e. det 𝜕𝑓 𝜕𝑥 is constant
  • 44.
    Paper findings • Explanationof the phenomenon making a lot of assumptions: • Training distribution 𝑥 ~𝑝∗ and ”adversarial distribution” 𝑥 ~𝑞, generative model 𝑝(𝑥; 𝜃) • 𝑞 will have higher likelihood than 𝑝∗ if 𝔼 𝑞 log p 𝑥; 𝜃 − 𝔼 𝑝∗ log p 𝑥; 𝜃 > 0 • Assumptions: • Second order expansion around 𝑥0 • Assuming 𝔼 𝑞 = 𝔼 𝑝∗ = 𝑥0 (some empirical proof in the example case) • Latent distribution is gaussian • Using constant volume • 𝑞= SVHN, 𝑝∗ = CIFAR-10
  • 45.
    Paper findings • For𝑞=SVHN, 𝑝∗=CIFAR-10, the assumptions given, and empirical variances of the data 𝔼 𝑞 log p 𝑥; 𝜃 − 𝔼 𝑝∗ log p 𝑥; 𝜃 > 0 simplifies to: 1 2𝜎 𝜓 2 𝛼1 2 ∗ 12.3 + 𝛼2 2 ∗ 6.5 + 𝛼3 2 ∗ 14.5 ≥ 0, where 𝛼 𝑐 = 𝑘=1 𝐾 𝑗=1 𝐶 𝑢 𝑘,𝑐,𝑗 • 𝔼 𝑞 log p 𝑥; 𝜃 − 𝔼 𝑝∗ log p 𝑥; 𝜃 is thus always larger or equal to zero since 𝛼 𝑐 2 ≥ 0 • Predicts that SVHN will be more likely than CIFAR-10
  • 46.
    Paper findings • Thenhypothesize that reducing the variance of the data artificially will increase the likelihood
  • 47.
    Conclusions • Cause topause when using generative models in anomaly detection • Second order analysis provided (only applicable to a certain type of flow + many assumptions) • The author’s urge further study on the subject
  • 48.
    Discussion • How valid/applicableis their analysis? • How come samples do not look like the OOD images if they have higher likelihood?