My slides that discuss different deep generative models, mainly normalizing flows for density estimation at a deep learning seminar at Aalto University fall 2019.
Slides for "Do Deep Generative Models Know What They Don't know?"
1. Do Deep Generative Models* Know
What They Don't Know?
Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, Balaji Lakshminarayanan
(DeepMind)
ICLR 2019
*Fake news, no GANs
Presented by: Julius Hietala
3. TL;DR
Normalizing flows, VAEs, PixelCNNs arenโt reliable enough to
detect out of distribution data*
*in some interesting cases
4. Outline
โข Paper introduction
โข Some notes
โข How normalizing flows work?
โข Paper experiments
โข Paper findings
โข Conclusions
โข Discussion
5. Paper introduction
โข Density estimation/determination is used in many applications
(anomaly detection, transfer learning etc.)
6. Paper introduction
โข Density estimation/determination is used in many applications
(anomaly detection, transfer learning etc.)
โข These applications have spawned interest towards deep
generative models
7. Paper introduction
โข Density estimation/determination is used in many applications
(anomaly detection, transfer learning etc.)
โข These applications have spawned interest towards deep
generative models
โข Currently popular choices are VAEs, GANs, auto regressive
models, and invertible latent variable models
8. Paper introduction
โข Density estimation/determination is used in many applications
(anomaly detection, transfer learning etc.)
โข These applications have spawned interest towards deep
generative models
โข Currently popular choices are VAEs, GANs, auto regressive
models, and invertible latent variable models
โข The latter two are interesting due to the fact that they allow for
exact likelihood calculation
9. Paper introduction
โข Density estimation/determination is used in many applications
(anomaly detection, transfer learning etc.)
โข These applications have spawned interest towards deep
generative models
โข Currently popular choices are VAEs, GANs, auto regressive
models, and invertible latent variable models
โข The latter two are interesting due to the fact that they allow for
exact likelihood calculation
โข Main question of the paper: can these models be used for
anomaly detection?
10. Some notes
โข The authors report results for VAEs, PixelCNNs, and
normalizing flows.
11. Some notes
โข The authors report results for VAEs, PixelCNNs, and
normalizing flows.
โข Only normalizing flows are discussed and studied in depth
12. Some notes
โข The authors report results for VAEs, PixelCNNs, and
normalizing flows.
โข Only normalizing flows are discussed and studied in depth
โข Is their analysis applicable to all the different types of models?
15. How normalizing flows work?
โข In multiple dimensions this is ๐ ๐ฅ ๐ = ๐ ๐ง ๐(๐ ) det
๐๐
๐๐
๐ ๐ฅ
๐ ๐ง
16. How normalizing flows work?
โข In multiple dimensions this is
๐ ๐ฅ ๐ฅ = ๐ ๐ง ๐(๐ฅ ) det
๐๐
๐๐ฅ
โข We want to determine ๐ ๐ฅ ๐ฅ
โข We can choose ๐ ๐ง(๐ง) as we wish (usually a gaussian)
โข We can choose ๐ (invertible, ๐ = ๐โ1
)
โข Challenges?
17. How normalizing flows work?
โข Calculating det
๐๐
๐๐ฅ
could be hard (Jacobian determinant)
18. How normalizing flows work?
โข Calculating det
๐๐
๐๐ฅ
could be hard (Jacobian determinant)
โข Designing ๐ to be invertible might be a challenge
19. How normalizing flows work?
โข Calculating det
๐๐
๐๐ฅ
could be hard (Jacobian determinant)
โข Designing ๐ to be invertible might be a challenge
โข Flow based models are designed so that both of these are easy
20. How normalizing flows work?
โข Calculating det
๐๐
๐๐ฅ
could be hard (Jacobian determinant)
โข Designing ๐ to be invertible might be a challenge
โข Flow based models are designed so that both of these are easy
โข Jacobian determinant:
โข Make triangular so that only diagonal terms matter
โข Make diagonal elements easy to calculate
21. How normalizing flows work?
โข Example from RealNVP (https://arxiv.org/pdf/1605.08803.pdf):
*s and t are NN()
22. How normalizing flows work?
โข Example from RealNVP (https://arxiv.org/pdf/1605.08803.pdf):
23. How normalizing flows work?
โข Example from RealNVP (https://arxiv.org/pdf/1605.08803.pdf):
โข Even with multiple levels of these steps of โflowโ the Jacobian
determinant remains tractable since
det ๐ด๐ต = det ๐ด det ๐ต
25. How normalizing flows work?
โข So we are able to determine ๐ ๐ฅ ๐ฅ
โข For generation, we would just sample from ๐ ๐ฅ ๐ฅ (sample from
๐ ๐ง ๐ง and โflowโ the sample back in reverse)
26. How normalizing flows work?
โข So we are able to determine ๐ ๐ฅ ๐ฅ
โข For generation, we would just sample from ๐ ๐ฅ ๐ฅ (sample from
๐ ๐ง ๐ง and โflowโ the sample back in reverse)
โข For likelihood estimation (anomaly detection etc. applications)
we just โflowโ ๐ฅ through the model to get the likelihood given by
๐ ๐ฅ ๐ฅ = ๐ ๐ง ๐(๐ฅ ) det
๐๐
๐๐ฅ
27. How normalizing flows work?
โข So we are able to determine ๐ ๐ฅ ๐ฅ
โข For generation, we would just sample from ๐ ๐ฅ ๐ฅ (sample from
๐ ๐ง ๐ง and โflowโ the sample back in reverse)
โข For likelihood estimation (anomaly detection etc. applications)
we just โflowโ ๐ฅ through the model to get the likelihood given by
๐ ๐ฅ ๐ฅ = ๐ ๐ง ๐(๐ฅ ) det
๐๐
๐๐ฅ
โข Models are optimized simply by maximizing the (log) likelihood
๐โ = ๐๐๐๐๐๐ฅ ๐ log ๐ ๐ฅ(๐ฅ; ๐)
28. How normalizing flows work?
โข So we are able to determine ๐ ๐ฅ ๐ฅ
โข For generation, we would just sample from ๐ ๐ฅ ๐ฅ (sample from
๐ ๐ง ๐ง and โflowโ the sample back in reverse)
โข For likelihood estimation (anomaly detection etc. applications)
we just โflowโ ๐ฅ through the model to get the likelihood given by
๐ ๐ฅ ๐ฅ = ๐ ๐ง ๐(๐ฅ ) det
๐๐
๐๐ฅ
โข Models are optimized simply by maximizing the (log) likelihood
๐โ = ๐๐๐๐๐๐ฅ ๐ log ๐ ๐ฅ(๐ฅ; ๐)
โข Glow demo: https://openai.com/blog/glow/
29. Paper experiments
โข Train the model (Glow) on one data set (in distribution),
afterwards determine likelihoods for the training data (in
distribution) and another data set that was not used in training
(out of distribution)
30. Paper experiments
โข Train the model (Glow) on one data set (in distribution),
afterwards determine likelihoods for the training data (in
distribution) and another data set that was not used in training
(out of distribution)
โข Data set/distribution pairs:
โข FashionMNIST vs. MNIST
โข CIFAR-10 vs. SVHN
โข CelebA vs. SVHN
โข ImageNet vs. CIFAR-10/CIFAR-100/SVHN
40. Paper findings
โข The observations presented were the main contributions of the paper,
grain of salt needed with next points
41. Paper findings
โข The observations presented were the main contributions of the paper,
grain of salt needed with next points
โข They try to explain the phenomenon, but raising many questions from
the reviewers
42. Paper findings
โข The observations presented were the main contributions of the paper,
grain of salt needed with next points
โข They try to explain the phenomenon, but raising many questions from
the reviewers
โข Change of variable formula* term analysis:
*๐ ๐ฅ ๐ฅ = ๐ ๐ง ๐(๐ฅ ) det
๐๐
๐๐ฅ
43. Paper findings
โข They make the model โconstant volumeโ (CV), i.e. det
๐๐
๐๐ฅ
is constant
44. Paper findings
โข Explanation of the phenomenon making a lot of assumptions:
โข Training distribution ๐ฅ ~๐โ and โadversarial distributionโ ๐ฅ ~๐,
generative model ๐(๐ฅ; ๐)
โข ๐ will have higher likelihood than ๐โ if
๐ผ ๐ log p ๐ฅ; ๐ โ ๐ผ ๐โ log p ๐ฅ; ๐ > 0
โข Assumptions:
โข Second order expansion around ๐ฅ0
โข Assuming ๐ผ ๐ = ๐ผ ๐โ = ๐ฅ0 (some empirical proof in the example case)
โข Latent distribution is gaussian
โข Using constant volume
โข ๐= SVHN, ๐โ = CIFAR-10
45. Paper findings
โข For ๐=SVHN, ๐โ=CIFAR-10, the assumptions given, and empirical
variances of the data
๐ผ ๐ log p ๐ฅ; ๐ โ ๐ผ ๐โ log p ๐ฅ; ๐ > 0
simplifies to:
1
2๐ ๐
2 ๐ผ1
2
โ 12.3 + ๐ผ2
2
โ 6.5 + ๐ผ3
2
โ 14.5 โฅ 0, where
๐ผ ๐ =
๐=1
๐พ
๐=1
๐ถ
๐ข ๐,๐,๐
โข ๐ผ ๐ log p ๐ฅ; ๐ โ ๐ผ ๐โ log p ๐ฅ; ๐ is thus always larger or equal to zero
since ๐ผ ๐
2
โฅ 0
โข Predicts that SVHN will be more likely than CIFAR-10
46. Paper findings
โข Then hypothesize that reducing the variance of the data artificially will
increase the likelihood
47. Conclusions
โข Cause to pause when using generative models in anomaly
detection
โข Second order analysis provided (only applicable to a certain
type of flow + many assumptions)
โข The authorโs urge further study on the subject