20190927 generative models_aia

Some interesting generative
models
Yi-fan Liou
@AIA 20190927

誤會是這樣發生的
https://arxiv.org/pdf/1807.03039.pdf

今天講題的兩大保證
• 保證講者沒有強大的數學背景
• 請相信生物學家…
• 保證內容不會是全部正確的
• 歡迎討論

Outline
• Autoencoder
• Autoencoder and variational autoencoder
• VQVAE (VQVAE2)
• Autoregressive model
• PixelCNN
• GAN
• CycleGAN, RecycleGAN
• CycleGAN with guess discriminator
• Flow-based Model
• GLOW

Autoencoder
未知的作者的此相片已透過 CC BY-NC 授權未知的作者的此相片已透過 CC BY-NC 授權
Latent
variables
The latent variables
The variables that can express the
output of the phenomena
1. Denoise
2. Dimension reduction
3. Feature extraction
4. Segmentation
5. Energy measuring
…

Autoencoder with RELU?
https://groups.google.com/forum/#!msg/pylearn-dev/iWqctW9nkAg/JJ5GwA5OYlUJ
Sometime,
Everything is built over the dead bodies….
並不是所有的事情都是這麼理所當然

Variational Autoencoder
未知的作者的此相片已透過 CC BY-NC 授權未知的作者的此相片已透過 CC BY-NC 授權
mean
variance
The “mean” and
“Variance” could
create a
distribution.
samplingIf we control the latent variables underling the normal
distribution, maybe we might create new latent
variables by sampling from the normal distribution.
The mean and variance should close
to 0 and 1 as best as possible.
策略
1. Latent variables 要呈現
常態分佈。
2. 一些離散的點就給予亂
數做perturbation，期
望把沒有樣本的空間也
填滿。 (其實可以視為
從常態分佈中抽樣)

Vector Quantize Autoencoder
encoder decoder
Latent variables
autoencoder Variational
autoencoder
Vector quantized
Variational
autoencoder
好處: 訓練容易
問題: 如果test抽
到介於訓練樣本
的點與點之間，
就無法順利還原
好處: 對於抽到點
與點之間的樣本
有一定的抗性。
問題: 抵抗性仍然
不夠大
好處: 沒有抽到點與點
間樣本的問題。
問題: 有夠難寫的，而
且不好練https://arxiv.org/abs/1711.00937
encoder decoder
autoencoder VQ variational autoencoder
…
Vector dictionary
找出在dictionary中，
相似度最高的向量
decoder
…
inferencing
任意分布抽樣都可，
因為都會找最近的
vector做代表

If we want to use 2D CNN in VQVAE

How to pass the gradient in VQVAE
encoder decoder
…
Loss1: decoder loss
如果直接算loss，
gradient傳到這邊就停
Loss2: encoder loss
這邊假設encoder就是編出
dictionary中取出的那個代表，所以
直接給dictionary中的結果跟梯度
Loss3: QV loss
Encoder應該盡量編碼出跟
dictionary選出的vector相同的數值
zq = ze + tf.stop_gradient(zq - ze)
zq = ze + (zq - ze) = zq
Forward:
zq = ze + tf.stop_gradient(zq - ze) = ze
因為stop gradient會讓這部分在倒傳遞時直接被忽略
Backward:
3 losses in VQVAE
https://github.com/markliou/ML-
practice/blob/master/VQVAE/vqvae.py

MNIST with VAVAE
https://github.com/markliou/ML-practice/blob/master/VQVAE/vqvae.py

The improvement of VQ
The issues from VQ :
1. Only the code selected by encoder will be updated
2. Not all the codes in the codebook will be used
The exponential moving average (EMA) method
Codes
embedding
1. Calculating the distances of the codes that are selected
=> This help to maintain the codebook Selected
code
Codebook 查詢結果
樣本數量
Code的更新Lambda=.999
2. Decomposing the vector into small pieces would help to use them more
efficient.
=> Vector slicing help to use the code more efficiently
找出所有跟該code
相關的embedding，
然後算出這些
embedding的質心，
同時讓這code往這
個質心移動
質心

VQVAE2

Autoregressive models
Focus on Pixel CNN

Can we use CNN to handle the autoregressive
model?
• The problem autoregressive model met is …
• “the model cannot get the information from future, but it can use any
information of the past”.
So …
Maybe we can just make a toy sample that:
the model only use the previous information.
(but it would not be the real case in the word for image generating)

PixelCNN
1 2 3
4 5 6
7 8 9
kernel
1 2 3
4 5 6
7 8 9
Masked kernel
mask
The pixel CNN which use masked
kernel is proposed with PixelRNN
with LSTM
However, PixelRNN perform better
than PixelCNN.
Some ones thought this would be
caused from LSTM having “gates”
which provide the RNN to handle
complex problems.

Blind spots of pixel CNN
• https://towardsdatascience.com/blind-spot-problem-in-pixelcnn-8c71592a14a
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
The blind spot means: the
generated point is
generated without
considering the spots.
Feature map
1 2 3
4 5 6
7 8 9
kernel
1 2 3
4 5 6
7 8 9
Masked kernel
mask
When running with the masked kernels
1
1 2 3
4 a
If the feature map is handled with zero
padding, then a would be only influenced
with bias terms of NN. (or maybe we can
consider the non-zero padding)
next
1
1 2 3
4a b
a b c d e
f g h i j
n o1k 2l 3m
4p q
Blind spot
The q is generated
without considering j, n,
o.

The blind spots explanation
Absenting this
direction makes the
blind spots. These
spot number will
increase with the
layer number.
How about make a new
filters which still have this
direction?
Supposing:
The 3*3 kernel and
postulating 2 or 3 layers

Horizontal and vertical stack
The feature map can be separated into 2 parts according to the status of the rows:
1. The rows containing the predicted spot
2. The rows without the predicted spot
https://www.slideshare.net/suga93/conditional-image-generation-with-pixelcnn-decoders
masked
It’s equivalent to
use 2*3 kernel
*
*
This mask is due to we
will use the horizontal
stack information.

Construct the horizontal and vertical at the
same time
Vertical
stack
Horizontal
stack
*
*
+
New feature map
Horizontal stack provides the
final predicted output
Padding after
convolution
crop
This feature map will
see the “future spot”
without padding and
crop

What happen when vertical and horizontal
stacks merging
P P P P P P P P P
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
1 2 3
13
22
1 2 3
13
21 22
= +
If we want to take
position 22 …
Blind spot disappear
(but we still use the same kernel shape)

The concept of conditional gated CNN
The input should be
weighted and normalized
using certain method.
The opening of the
gate can be also
decided by the input
The x is masked using the
same mask. This influences
the Wf and Wg
• The description can be also given as one-hot
encoding, h.
• x is thought as given from h
The h should be properly transform due to the shape should be the same to W*X.
The h can be also embedded use neural network, m, such that s = m(h)
V*s can be directly calculated without masking. The kernel size use the 1*1 in original
paper.
Vertical
stack
Horizontal
stack
Gate mechanism
The feature map will pass two functions.
The tanh will normalize the feature maps
while the sigmoid is the gate.

The whole postulated architecture v’ h’
v h
Vertical stack
Horizontal stack
output

GTU and GLU
GTU : gate tanh unit
GLU : gate linear unit

GAN
Generative adversarial network
1. Cycle GAN
2. Recycle GAN
3. The loss functions and which is most useful

• The problem of generative model
• “structured things” seems having some certain “distribution”
• Human sentence : n + v + adj + n
• The generative model need to learn the “distribution”
• The types of generative model learning
• Select the most related object from database
• Just learn the similarity => like Deepblue
• Learn the distribution
input
Transform to vector
database output
input
Transform to vector
output
distribution

• How to learn a distribution is a problem
Generative
model
Generated
samples
generate
These samples have certain pattern
Real
samples
Discriminator
Measure the
differences
Tell the generator how to modify the distribution
Give a random
vector

CycleGan
• The naïve GAN generated model depending on the random vector
• The user cannot decide what will be generated
• we can use the paired data to constrain the output
• Not every data have the paired data
• CycleGan solve the unpaired data problem
G
D
Image source d from H.Y. Lee, NTU
Paired data

Spatial cycle consistency is not enough
• Mode collapse is a
serious problem in
cycle GAN
• In (a), the generated
Obama has only some
pixels changing, but
they can generated
variated Trump.
• In (b), the similar
situation occurred.
• Self-adversirial attack
similar
different

3 Different domain style transfer methods

The Recycle GAN
Xt Y’t
Gy
X’t
Gx
Cycle consistency
Dy Dx
True or False True or FalseCycle GAN
Video X Video Y
Px Py
P can use t frames to predict
(t+1) frame
Gx
Xt+1 Y’t+1 X’t+1
Cycle consistency
(Recycle loss)
Py
Px Maybe not needed
Recycle GAN

Self-adversarial attack
• 進行cycle-GAN的時候，中間產生的產物沒甚麼變化，但卻可以騙
過D
Defense methods
1. Adversarial training with noise
2. Guess discriminator

The guess discriminator
Xt Y’t
Gy
X’t
Gx
Cycle consistency
Dy Dx
True or False True or FalseCycle GAN
Dguess
Which is the
constructed
sample
Guess discriminator
have the input and
the reconstructed
sample.
input
reconstructed
Dguess
Which one
is input?
在這步驟把
input和
reconstructed
sample 順序任
意排列，看D
能不能判別出
上下哪張是
input
但是實際使用上，
直接用noise的效果
就很不錯了。

How to choose the types of GAN
• “Are GANs Created Equal? A Large-Scale Study” (2017)
• https://arxiv.org/abs/1711.10337 (Google Brain)
結論:
燒完一堆花錢做實驗後，
還是原始版本最好。

Flow Based model – Glow as
example
Yi-Fan Liou

今天要說一個initialization決定
輸贏的網路
不過首先還是要說一些可能對後面理解Glow有幫助的東西

Optical flow
D. Putcha et al, “Functional correlates of optic flow motion processing in Parkinson’s disease,” 2014
Optic flow stimuli illustration. Optic flow motion stimuli (A) simulate forward and backward
motion using dot fields that are expanding or contracting while rotating about a central focus.
Random motion (B) simulates non-coherent motion using dots moving at the same speeds used
in optic flow, but with random directions of movement. In the illustrations, the length of arrows
corresponds with dot speed, indication that dot speed increases with distance away from the
center.
https://mlatgt.blog/2018/09/06/learning-rigidity-and-scene-flow-estimation/
http://hammerbchen.blogspot.com/2011/10/art-of-optical-flow.html

Lucas-Kanade
Ref:
http://www.inf.fu-berlin.de/inst/ag-ki/rojas_home/documents/tutorials/Lucas-Kanade2.pdf
http://www.cs.umd.edu/~djacobs/CMSC426/OpticalFlow.pdf
今天有個圖形某一小區塊的亮度如果往右下慢慢
增加，那可以猜測目前這個區塊是往左上移動。
這時候可以猜出(x,y)這個位點在時間t內移動為:
稍微
整理
兩邊都分別乘以ST，這時
候ST S應該要為可逆才可能
有解。
ST S如果是可逆，那應
該會是對角矩陣。
Lucas-Kanade追蹤了前後兩個frame在短時間內的
變化。
從這方法來看，大概可以得到一個簡單的特性，
就是處理光流的矩陣大概是呈現對角的狀態。
對角矩陣有個特性就是”可追蹤”(tractable)。
LU decomposition?

Normalizing flow
http://proceedings.mlr.press/v37/rezende15.pdf
f1 f2
Target
distribution
Expected
distribution
f1-1
f2-1
從一個我們認識的分布(通常會用normal
distribution)，透過一連串可逆的mapping，得到另
外一組再實際場域中可以看到的分布。

Normalizing flow based methods
• 從某個分布空間域投射到另外一個空間域，如果每個小物件都有
其對應的關係，找出這個關係就能相互推導
f(x)
g(x)
f(x) = -g(x)
如果空間上本身就有”對應”關係，那不需要讓模型去”自動產生
全部合理的形狀”(因為太困難了)。也許可以透過修改一個小的
機率分布就能得到最後對應的關係。

Affine transformation
http://silverwind1982.pixnet.net/blog/post/160691705-affine-transformation
https://math.stackexchange.com/questions/884666/what-
are-differences-between-affine-space-and-vector-space
Affine space
If the point can map to affine space, this is
the “affine transformation”.
This transformation is supposed to be linear.
Affine space 2
Affine space 1
Affine space 2
Affine space 1
affine transformation
𝑋1
𝑌1
=
𝐴11 𝐴12
𝐴21 𝐴22
X
𝑋2
𝑌2
+
𝐵1
𝐵2
𝑋1
𝑌1
1
=
𝐴11 𝐴12 𝐵1
𝐴21 𝐴22 𝐵2
0 0 1
X
𝑋2
𝑌2
1
homogenerous coordinates
放大縮小平移
如果放入三角函數還可以達到
“旋轉”的效果
𝑋1
𝑌1
1
=
cos(θ) −sin(θ) 𝐵1
sin(θ) cos(θ) 𝐵2
0 0 1
X
𝑋2
𝑌2
1
This is also inversed function.
Objective:
可以用一個神經網路學到這些affine
transformation的特性嗎?If Z > 1, the vector (X, Y, Z) will pass
another plane (which is not a sub-
vector space of each other) and give
a coordinates of (x, y)
(but the information of z would be loss)

non-volume preserving
You can find more detail from Dr. Hung-yi Lee’s Youtube channel
https://www.youtube.com/watch?v=uXY18nzdSsM
Volume preserving
The “determinate” of the vector
indicates the “volume”.
The characteristics:
1. The diagonal of Jacob matrix
between the functions are 1.
2. The relation between functions are
traceable.
What will the “non-volume preserving”
(NVP) do?
In such task based on the concepts of optical-flow
methods, the volume would not be so important.
But…
The characteristic of “traceable” is essential.
RealNVP: modifying the functions which keep the “invertible” characteristic
Identical
matrix
I don’t care

TNVP – temporal non-volume preserving

Curiosity in reinforcement learning
Rewards
Intrinsic – inside the agent
Extrinsic – from the environment
IntrinsicExtrinsic
如果可以從當前的狀態跟當前的行為直接
推導下一個狀態，表示這樣的行為已經非
常成熟，而不需要多探索了。
(因此如果可以找到一個動作無法預測未
來，那就應該給予高一點的獎勵)
St+1 是由at跟St 一起運作得到的。所以也
許可以透過建構一個model從St+1跟St來反
推at。
這模型可以分為兩部分，第一個部分就是
regressor，也就是inverse model；另外一
部分就是feature extractor。這邊是期望有
一個很好的feature extractor，而這個
feature extractor可以把真正有資訊的
feature抽出來而不是亂抽。loss
LI: inverse model loss
LF: forward model loss
b: 調整intrinsic reward比重。
l:調整intrinsic跟extrinsic rewards比重

Curious-driven exploration
Rewards
Intrinsic – inside the agent
Extrinsic – from the environment
Conventional ICM
Flow-based ICM
需要用反應(at)做為一
個proxy objective來評
估。這部分要以”精確
找出反應”來調整”狀態”
的encoder。

Various kinds of flow model

上面的先備知識講完後，就來
主題Glow

The process of Glow
製作Flow model時，就掌握三大原則:
1. Traceable
2. 想辦法打亂
3. 如果有channel，channel間必須有
相依性。(方便進行affine轉換)
RealNVP (2018)
Factor out
Scale block
…
Scale block
Z
Actnorm
1*1 Invertible conv
Affine coupling
Scale block
…
Y
Loss 內涵:
1. 在flow裡面，每一個點都有意義，因
此每個點都要算。
2. Normalizing flow的意思就是希望“把
output整到某種分布”，而我們常用的就
是常態分布。因此就把z跟y跟常態分佈
做maximum likelihood (MLE)。
3. 每次做轉換的時候並不希望有太大的”
體積轉變“(log determinant)
z
任何你喜歡的分
布，但大部分的
人會挑常態分佈
MLE

Actnorm layer
This layer try to scale the input into an “acceptable”
information for next layer.
• “acceptable” means the information have the 0 and 1
for mean and variance, respectively.
• This concept is similar to “batch normalization (BN).
Here, the mean and variance is get directly from the
input data.
• This is a special case of “data dependent initialization”
• This would be caused from BN is less traceable.
Data dependent initialization
https://arxiv.org/pdf/1511.06856.pdf (ICLR2016)
Layer 1
Layer 2
Layer k
….
Layer k-1
inputinputinput
Input of
Layer K
Layer k
output of
Layer K
控制這一層的比
重，使得output
也可以維持在
activation
function可以作
用的範圍
(控制抽樣的
distribution的
mean跟variance
即可)
Between layer normalization
原本都是假設前面幾層都是線性的
(supposed affine layer)，但實際上不
是這樣。因此還是需要做進一步調整，
調整方式就是用多放不同的Batch進
去，然後估計不同樣本變化多大，在
進一步調整mean跟Variance(rk in the
algorithm2)。
Within layer initialization
inputinputinput inputinputinput
Data dependent initialization
https://arxiv.org/pdf/1511.06856.pdf (ICLR2016)

Invertible 1x1 convolution Actnorm
1*1 Invertible conv
Affine coupling
Scale block
Purpose:
Shuffling the channel information and
1*1 convolution is traceable
1
2
3
1
0
0
0
0
1
0
1
0
1 3 2
reversable 要保持可逆性只需要製作一
個正交矩陣即可。不一定非
要只用1或0構成。
實作方式:
1. 製作一個亂數矩陣，其大小就是1*1 convolution使用的
kernel數量。
2. 把該矩陣做QR分解，其Q矩陣就是正交矩陣。
3. 把Q矩陣作為initialize的比重。

Affine Coupling layers
RealNVP (2018)
Actnorm
1*1 Invertible conv
Affine coupling
Scale block
Resnet
Tips:
1. channel數量每一層的
input = output
2. 最後一層用zero-
initialization，channel
數量設為2倍
3. 切一半。一半是
shift(+), 一半是scale(*)
用複雜的
神經網路
找出來
center
scale
+
*
concate看似整齊，實際上背後會
被1*1convolution做shffle
Zero
initialization

Factor out layer and Squeeze layer
RealNVP
Actnorm
1*1 Invertible conv
Affine coupling
Scale block
h
z
繼續進行flow
停住不動
Squeezing
把圖像切的更小，而且切法固定不失相
依性。
更多的channel可以讓shuffle進行得更有
效率

20190927 generative models_aia

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20190927 generative models_aia

Similar to 20190927 generative models_aia (20)

Recently uploaded

Recently uploaded (20)

20190927 generative models_aia