4. Outline
• Autoencoder
• Autoencoder and variational autoencoder
• VQVAE (VQVAE2)
• Autoregressive model
• PixelCNN
• GAN
• CycleGAN, RecycleGAN
• CycleGAN with guess discriminator
• Flow-based Model
• GLOW
5. Autoencoder
未知的作者 的 此相片 已透過 CC BY-NC 授權 未知的作者 的 此相片 已透過 CC BY-NC 授權
Latent
variables
The latent variables
The variables that can express the
output of the phenomena
1. Denoise
2. Dimension reduction
3. Feature extraction
4. Segmentation
5. Energy measuring
…
7. Variational Autoencoder
未知的作者 的 此相片 已透過 CC BY-NC 授權 未知的作者 的 此相片 已透過 CC BY-NC 授權
mean
variance
The “mean” and
“Variance” could
create a
distribution.
samplingIf we control the latent variables underling the normal
distribution, maybe we might create new latent
variables by sampling from the normal distribution.
The mean and variance should close
to 0 and 1 as best as possible.
策略
1. Latent variables 要呈現
常態分佈。
2. 一些離散的點就給予亂
數做perturbation,期
望把沒有樣本的空間也
填滿。 (其實可以視為
從常態分佈中抽樣)
12. The improvement of VQ
https://arxiv.org/pdf/1803.03382.pdf
The issues from VQ :
1. Only the code selected by encoder will be updated
2. Not all the codes in the codebook will be used
The exponential moving average (EMA) method
Codes
embedding
1. Calculating the distances of the codes that are selected
=> This help to maintain the codebook Selected
code
Codebook 查詢結果
樣本數量
Code的更新Lambda=.999
2. Decomposing the vector into small pieces would help to use them more
efficient.
=> Vector slicing help to use the code more efficiently
找出所有跟該code
相關的embedding,
然後算出這些
embedding的質心,
同時讓這code往這
個質心移動
質心
15. Can we use CNN to handle the autoregressive
model?
• The problem autoregressive model met is …
• “the model cannot get the information from future, but it can use any
information of the past”.
So …
Maybe we can just make a toy sample that:
the model only use the previous information.
(but it would not be the real case in the word for image generating)
16. PixelCNN
1 2 3
4 5 6
7 8 9
kernel
1 2 3
4 5 6
7 8 9
Masked kernel
mask
The pixel CNN which use masked
kernel is proposed with PixelRNN
with LSTM
However, PixelRNN perform better
than PixelCNN.
Some ones thought this would be
caused from LSTM having “gates”
which provide the RNN to handle
complex problems.
17. Blind spots of pixel CNN
• https://towardsdatascience.com/blind-spot-problem-in-pixelcnn-8c71592a14a
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
The blind spot means: the
generated point is
generated without
considering the spots.
Feature map
1 2 3
4 5 6
7 8 9
kernel
1 2 3
4 5 6
7 8 9
Masked kernel
mask
When running with the masked kernels
1
1 2 3
4 a
If the feature map is handled with zero
padding, then a would be only influenced
with bias terms of NN. (or maybe we can
consider the non-zero padding)
next
1
1 2 3
4a b
a b c d e
f g h i j
n o1k 2l 3m
4p q
Blind spot
The q is generated
without considering j, n,
o.
18. The blind spots explanation
Absenting this
direction makes the
blind spots. These
spot number will
increase with the
layer number.
How about make a new
filters which still have this
direction?
Supposing:
The 3*3 kernel and
postulating 2 or 3 layers
19. Horizontal and vertical stack
The feature map can be separated into 2 parts according to the status of the rows:
1. The rows containing the predicted spot
2. The rows without the predicted spot
https://www.slideshare.net/suga93/conditional-image-generation-with-pixelcnn-decoders
masked
It’s equivalent to
use 2*3 kernel
*
*
This mask is due to we
will use the horizontal
stack information.
20. Construct the horizontal and vertical at the
same time
Vertical
stack
Horizontal
stack
*
*
+
New feature map
Horizontal stack provides the
final predicted output
Padding after
convolution
crop
This feature map will
see the “future spot”
without padding and
crop
21. What happen when vertical and horizontal
stacks merging
P P P P P P P P P
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
1 2 3
13
22
1 2 3
13
21 22
= +
If we want to take
position 22 …
Blind spot disappear
(but we still use the same kernel shape)
22. The concept of conditional gated CNN
The input should be
weighted and normalized
using certain method.
The opening of the
gate can be also
decided by the input
The x is masked using the
same mask. This influences
the Wf and Wg
• The description can be also given as one-hot
encoding, h.
• x is thought as given from h
The h should be properly transform due to the shape should be the same to W*X.
The h can be also embedded use neural network, m, such that s = m(h)
V*s can be directly calculated without masking. The kernel size use the 1*1 in original
paper.
Vertical
stack
Horizontal
stack
Gate mechanism
The feature map will pass two functions.
The tanh will normalize the feature maps
while the sigmoid is the gate.
23. The whole postulated architecture v’ h’
v h
Vertical stack
Horizontal stack
output
26. Generative adversarial network
• The problem of generative model
• “structured things” seems having some certain “distribution”
• Human sentence : n + v + adj + n
• The generative model need to learn the “distribution”
• The types of generative model learning
• Select the most related object from database
• Just learn the similarity => like Deepblue
• Learn the distribution
input
Transform to vector
database output
input
Transform to vector
output
distribution
27. Generative adversarial network
• How to learn a distribution is a problem
Generative
model
Generated
samples
generate
These samples have certain pattern
Real
samples
Discriminator
Measure the
differences
Tell the generator how to modify the distribution
Give a random
vector
28. CycleGan
• The naïve GAN generated model depending on the random vector
• The user cannot decide what will be generated
• we can use the paired data to constrain the output
• Not every data have the paired data
• CycleGan solve the unpaired data problem
G
D
Image source d from H.Y. Lee, NTU
Paired data
29. Spatial cycle consistency is not enough
• Mode collapse is a
serious problem in
cycle GAN
• In (a), the generated
Obama has only some
pixels changing, but
they can generated
variated Trump.
• In (b), the similar
situation occurred.
• Self-adversirial attack
similar
different
31. The Recycle GAN
Xt Y’t
Gy
X’t
Gx
Cycle consistency
Dy Dx
True or False True or FalseCycle GAN
Video X Video Y
Px Py
P can use t frames to predict
(t+1) frame
Gx
Xt+1 Y’t+1 X’t+1
Cycle consistency
(Recycle loss)
Py
Px Maybe not needed
Recycle GAN
33. The guess discriminator
Xt Y’t
Gy
X’t
Gx
Cycle consistency
Dy Dx
True or False True or FalseCycle GAN
https://arxiv.org/pdf/1908.01517.pdf
Dguess
Which is the
constructed
sample
Guess discriminator
have the input and
the reconstructed
sample.
input
reconstructed
Dguess
Which one
is input?
在這步驟把
input和
reconstructed
sample 順序任
意排列,看D
能不能判別出
上下哪張是
input
但是實際使用上,
直接用noise的效果
就很不錯了。
34. How to choose the types of GAN
• “Are GANs Created Equal? A Large-Scale Study” (2017)
• https://arxiv.org/abs/1711.10337 (Google Brain)
結論:
燒完一堆花錢做實驗後,
還是原始版本最好。
37. Optical flow
D. Putcha et al, “Functional correlates of optic flow motion processing in Parkinson’s disease,” 2014
Optic flow stimuli illustration. Optic flow motion stimuli (A) simulate forward and backward
motion using dot fields that are expanding or contracting while rotating about a central focus.
Random motion (B) simulates non-coherent motion using dots moving at the same speeds used
in optic flow, but with random directions of movement. In the illustrations, the length of arrows
corresponds with dot speed, indication that dot speed increases with distance away from the
center.
https://mlatgt.blog/2018/09/06/learning-rigidity-and-scene-flow-estimation/
http://hammerbchen.blogspot.com/2011/10/art-of-optical-flow.html
41. Affine transformation
http://silverwind1982.pixnet.net/blog/post/160691705-affine-transformation
https://math.stackexchange.com/questions/884666/what-
are-differences-between-affine-space-and-vector-space
Affine space
If the point can map to affine space, this is
the “affine transformation”.
This transformation is supposed to be linear.
Affine space 2
Affine space 1
Affine space 2
Affine space 1
affine transformation
𝑋1
𝑌1
=
𝐴11 𝐴12
𝐴21 𝐴22
X
𝑋2
𝑌2
+
𝐵1
𝐵2
𝑋1
𝑌1
1
=
𝐴11 𝐴12 𝐵1
𝐴21 𝐴22 𝐵2
0 0 1
X
𝑋2
𝑌2
1
homogenerous coordinates
放大縮小 平移
如果放入三角函數還可以達到
“旋轉”的效果
𝑋1
𝑌1
1
=
cos(θ) −sin(θ) 𝐵1
sin(θ) cos(θ) 𝐵2
0 0 1
X
𝑋2
𝑌2
1
This is also inversed function.
Objective:
可以用一個神經網路學到這些affine
transformation的特性嗎?If Z > 1, the vector (X, Y, Z) will pass
another plane (which is not a sub-
vector space of each other) and give
a coordinates of (x, y)
(but the information of z would be loss)
42. non-volume preserving
You can find more detail from Dr. Hung-yi Lee’s Youtube channel
https://www.youtube.com/watch?v=uXY18nzdSsM
Volume preserving
The “determinate” of the vector
indicates the “volume”.
The characteristics:
1. The diagonal of Jacob matrix
between the functions are 1.
2. The relation between functions are
traceable.
What will the “non-volume preserving”
(NVP) do?
In such task based on the concepts of optical-flow
methods, the volume would not be so important.
But…
The characteristic of “traceable” is essential.
RealNVP: modifying the functions which keep the “invertible” characteristic
Identical
matrix
I don’t care
48. The process of Glow
製作Flow model時,就掌握三大原則:
1. Traceable
2. 想辦法打亂
3. 如果有channel,channel間必須有
相依性。(方便進行affine轉換)
RealNVP (2018)
Factor out
Scale block
…
Scale block
Z
Actnorm
1*1 Invertible conv
Affine coupling
Scale block
…
Y
Loss 內涵:
1. 在flow裡面,每一個點都有意義,因
此每個點都要算。
2. Normalizing flow的意思就是希望“把
output整到某種分布”,而我們常用的就
是常態分布。因此就把z跟y跟常態分佈
做maximum likelihood (MLE)。
3. 每次做轉換的時候並不希望有太大的”
體積轉變“(log determinant)
z
任何你喜歡的分
布,但大部分的
人會挑常態分佈
MLE
49. Actnorm layer
This layer try to scale the input into an “acceptable”
information for next layer.
• “acceptable” means the information have the 0 and 1
for mean and variance, respectively.
• This concept is similar to “batch normalization (BN).
Here, the mean and variance is get directly from the
input data.
• This is a special case of “data dependent initialization”
• This would be caused from BN is less traceable.
Data dependent initialization
https://arxiv.org/pdf/1511.06856.pdf (ICLR2016)
Layer 1
Layer 2
Layer k
….
Layer k-1
inputinputinput
Input of
Layer K
Layer k
output of
Layer K
控制這一層的比
重,使得output
也可以維持在
activation
function可以作
用的範圍
(控制抽樣的
distribution的
mean跟variance
即可)
Between layer normalization
原本都是假設前面幾層都是線性的
(supposed affine layer),但實際上不
是這樣。因此還是需要做進一步調整,
調整方式就是用多放不同的Batch進
去,然後估計不同樣本變化多大,在
進一步調整mean跟Variance(rk in the
algorithm2)。
Within layer initialization
inputinputinput inputinputinput
Data dependent initialization
https://arxiv.org/pdf/1511.06856.pdf (ICLR2016)