20190118 auto encoder-explanation-allen_lee

Auto Encoder
原理解說
Allen Lee
this slide:
https://docs.google.com/presentation/d/1ARiXGO7hZprrIglbuXf2yI4dSbnJHmsXoqmSoPyQORM/
sample code:
https://colab.research.google.com/github/allenyllee/ML_exercise/blob/master/Autoencoder/Autoencoder(tensorflow).ipynb

● Encoder 跟 Decoder 組合成一個NN
● Encoder 將數據壓縮，Decoder 將數據重建
● 最小化重建數據與原始數據的 MSE
● Encoder 得到一組較低維度的編碼 code，這組code可作為原數據的精簡表達
● PCA其實就是一層hidden layer 的autoencoder(拿掉非線性activation function，encoder decoder
weight 重複使用)
Auto Encoder 架構
ML Lecture 16: Unsupervised Learning - Auto-encoder - YouTube
https://www.youtube.com/watch?v=Tk5B4seA-AU
Hung-yi Lee
http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML16.html

Loss function
● 對二進位或0~1之間的input，使用 cross entropy
最後output 接上 sigmoid，使輸出限制在0~1之間
● 對實數input，使用平方差異(平方距離)
並且在最後output 上使用linear activation function(基本上就是不使用)，避免輸出被限制在 0~1之間
● 對其他類型的input (例如: 整數)，可以把decoder輸出作為高斯分布的平均值u，再從中sample 出整數
，作為最後的output，以此計算平方距離
Hugo Larochelle
http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html
Neural networks [6.1] : Autoencoder - definition
https://youtu.be/FzS3tMl4Nsc

Deep Auto Encoder
● 架構一樣，只是層數變深
● Encoder 跟 Decoder不一定
要對稱(也可以強迫對稱，重
複利用各層參數，參數就可
少一半)
●

Deep Auto Encoder
● Deep Auto Encoder 重建的效果較好
● Deep Auto Encoder 降維產生的code
更具有代表性

Text Retrival
● 將一篇文章轉成 BoW vector，再送到 auto encoder 得到降維後的code
● 相似主題的文章code 在空間中的位置會群聚在一起，因此可以更好用向量來計算語意的相似度
● LSA 無法得到類似的結果(LSA是單純的SVD矩陣分解，不同於LDA)

Word2Vec 詞向量
● Word2Vec 由三層神經網路構成，隱藏層
的神經元數量較少，藉此達到壓縮的效果
● 我們想要的是從input 到 hidden layer 的
轉換matrix
● 流程如下圖:
● 訓練方式：
● input 是欲壓縮的單字，output target 是它附近出
現的單字，採用softmax 輸出機率值
● 如下圖，dog 後面接run，所以我們希望輸出 run
的機率越大越好(採用抽樣減低常用詞影響 )
● 其他不相干的單字如 fly，我們希望它出現的機率
越接近0越好(採用 negative samlping加速)
類神經網路 -- word2vec (part 1 : Overview) « MARK CHANG'S BLOG
http://cpmarkchang.logdown.com/posts/773062-neural-network-word2vec-part-1-overview
[1301.3781] Efficient Estimation of Word Representations in Vector Space
https://arxiv.org/abs/1301.3781

Doc2Vec
● Doc2Vec 是在 Word2Vec 的基礎上新增一個Doc_id，
訓練方法一樣是預測上下文的字
● 一篇新文章進來，其實就是再做一次訓練，取出
hidden layer 的值
How does doc2vec represent feature vector of a document? Can anyone explain mathematically how the process is done? - Quora
https://www.quora.com/How-does-doc2vec-represent-feature-vector-of-a-document-Can-anyone-explain-mathematically-how-the-process-is-done
[Distributed representations of sentences and documents | the morning paper]
(https://blog.acolyer.org/2016/06/01/distributed-representations-of-sentences-and-documents/)

Sparse Auto Encoder
● 訓練一個普通的auto encoder 時，會發現
hidden layer 大部分的neuron 都被激發
(activate)，表示neuron對大部分的輸入都有反
應；然而，我們希望neuron只對特定的輸入有反
應。對不同的輸入來說，都只有少量相關的
nueron 被激發，其他neuron 都接近0 (sparse)。
因此，平均的neuron值應該要很小。
● 在loss function 加入KL divergence 限制平均
neuron 輸出的大小，並加上L2 regularization，
使模型的weight 儘量變小，以避免模型變得太
複雜。(L2 regularization 會讓效果變很差)
● 左圖分別對應100個neuron 激發最大的圖
片
Sparse autoencoder
https://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf
Tensorflow Day17 Sparse Autoencoder - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天
https://ithelp.ithome.com.tw/articles/10188255

Sparse Auto Encoder
● 下圖為手寫數字轉成30為向量，每一個維度數值大小的長條圖
● 第一行:手寫數字；第二行: not-sparse；第三行: sparse

Sparse Auto Encoder
● 右圖為把sparse
code 每個維度
decode 出來的
結果

● KL divergence 常用來衡量兩個分布的差距，例如
我們希望，我們的目標
就是最小化 KL divergence：
● 由於 dataset 通常是固定的，因此P(D)的entropy
可視為常數，我們只須最小化第二項，因此我們定
義 Cross entropy：
● Cross entropy 最常被用來當作分類問題的loss
function
何謂KL divergence
● 首先我們知道Entropy的定義：
● KL divergence 的定義：
第一項就是A事件的entropy，而第二項表示，如果
我們用B事件來編碼A事件，所產生的entropy 期望值。又
稱為相對熵（relative entropy）、信息增益（information
gain）。

Image Search
● 如果直接在相素空間上做比較，尋
找相似的圖片，無法找出語意上相
似的圖片

Image Search
● 利用auto encoder得到的code來做比較
，可以找到語意上相似的圖片，而且不需
要任何標註

Pre-training DNN
● 正式訓練 network 之前，用來找到一
組較好的初始化參數
● 如果hidden layer 的維度大於input，
則需要加上正規化條件(L1 or L2
regularization)，使之變成 sparse 以
免網路學不到東西
● (實驗上，較大的第一層hidden layer
效果更好，因為能夠抽取更多有用的
特徵)
● 如果有大量的unlabeled data，卻只
有少量的labeled data，可以先用
unlabeled data 把前幾層參數train 好
，最後再用labeled data 做 ﬁne-tune

如何訓練較大的hidden layer
● 通常我們希望第一層hidden layer 大於input，代表
資訊不會被壓縮，可以取得的feature更多。
● 但是較大的hidden layer會造成學不到任何東西(資
訊只是從input copy paste 到output)。避免上述情
形有如下方法：
○ 對 Weight 做正規化(L1 or L2 regularization)，使hidden
layer 變成 sparse
○ De-noising autoencoder
○ Constractive autoencoder
Hugo Larochelle
Neural networks [6.5] : Autoencoder - undercomplete vs. overcomplete hidden layer
https://youtu.be/5rLgoM2Pkso

De-noising Auto Encoder (DAE)
● 原始data加入雜訊後，送入
auto encoder 重建回來，如此
不只學到了重建，還學到去噪，
使結果更加穩建(robustness)
● 常用的noise ：
○ 隨機填零(挖洞補回)
○ 產生平均為0，標準差為0.5的噪
聲

De-noising Auto Encoder (DAE)
● 假設input data 分布在一個
高維流形(manifold)
● 對data 加上noise 相當於
在流形附近sampling
● DAE 的作用就是將流形附
近的sample data mapping
回流形上
● 但是距離太遠的雜訊data
仍然無法mapping 到流形
上

Contractive Auto Encoder(CAE)
● 與DAE類似，但目標改成希望 encode 出來的 code 不因為輸入的微小變動而改變
● 最小化的目標是hidden unit 對input 的變動，也就是計算Jacobian
● 使用樣本本身的梯度，不需要外加noise，更接近原始分布，泛化性更好
[1104.4153] Learning invariant features through local space contraction
https://arxiv.org/abs/1104.4153
[深度学习]Contractive Autoencoder - 落痕月极的博客- CSDN博客
https://blog.csdn.net/LuohenYJ/article/details/78394060

Contractive Auto Encoder(CAE)
● 如圖，中間的是2，兩旁是稍微轉動的
2，在高維流形上，這三個data 落在某
一個方向上(黃色箭頭)
● 當hidden unit 的改變量是落在黃色箭
頭方向上，這個方向的variation 對
recontruct 來說是必須的，最小化
reconstruction error 會保留此方向的
gradient
● 當hidden unit 的改變量落在藍色箭頭
上，這個方向的variation 對reconstruct
貢獻很小，最小化Jacobian 會丟掉此
方向的gradient Hugo Larochelle
Neural networks [6.7] : Autoencoder - contractive autoencoder
https://youtu.be/79sYlJ8Cvlc

Auto Encoder for CNN
● CNN 中有Convolution, Pooling的
操作，在auto encoder中就需要有
對應的操作: Deconvolution,
unpooling

● unpooling的幾種方式：
1. 記住之前pooling 取值的位置，在
unpooling 後還原位置，其他地方補0
2. 直接把值複製N份填入，不用記住先
前的位置

● 其實 Deconvolution就是
Convolution
● 把原來的Convolution weights
反過來，加上padding，就變成反
向操作了

Generate image by decoder
● 在code sapce 選取一個範圍做
sample，再丟到decoder 產生圖
像
●

Generate image by decoder
● 加上 L2 regularization 後，產生的
code 會接近0，因此可以方便的
在0附近做sample

VAE (Variational Auto-Encoder)
● AE 的問題是，它的D 只能保證將x生成的code還原
成x，對任意的code不保證能產生有意義的x
● VAE 可讓 E 生成的code 符合某個指定的分布，例
如標準差為1的常態分布，使D從這分布中還原x
○ 方法是讓E產生兩個向量：平均向量和標準差向量
○ 有了平均跟標準差，就能建立一個高斯分布，並從
中抽樣
○ 抽樣產生出來的潛向量輸入 D，讓產生出來的圖片
越接近原圖，並且潛向量的分布越接近標準差為 1
的常態分佈越好
2014, [[1312.6114] Auto-Encoding Variational Bayes]
(https://arxiv.org/abs/1312.6114)
[Variational Autoencoders Explained]
(http://kvfrans.com/variational-autoencoders-explained/)

Auto encoder 與 GAN 比較
● AE 的優化目標是儘量讓 x 和 D(E(x)) 儘量在像素上接近，其結果就是生成較為均勻但
模糊的圖像
● 下圖左為AE生成的圖像，右為GAN 生成的圖像
[漫談生成模型，從AE到CVAE-GAN - 幫趣]
(http://bangqu.com/7yE1f6.html#utm_source=Facebook_PicSee&utm_medium=Social)

GAN 與VAE 的關係
● 與VAE 產生還原x的概念不同，GAN 的思想
是使用判別器D對G 所產生的x’ 進行真偽判
定
○ 由於D 是以個別圖像進行判別，在細節上不會有平均
效應，因此GAN 所產生的圖像在細節上較清晰，但
在全局上容易出現明顯錯誤
● 右上圖為類別VAE的架構，多給了類別資訊
c，用來編碼或生成對應的類別
● 右中圖為類別GAN，多給了類別資訊c，進行
生成和判別
● 右下圖為生成結果，可發現VAE 的結果較模
糊、正常，GAN的結果較清晰、奇怪

CVAE-GAN
● CVAE-GAN 結合了VAE 和GAN 的思
想
○ 對於從x生成的z，G應該要能還原出x'在
像素上接近x，這是VAE的思想
○ 對於G生成的x’，應可由D鑑別為真實圖像
，這是GAN的思想
○ 對於G生成的x'，應可由C歸類為c類別
● 右下圖為換臉效果：透過保持z改變c
產生，其中z 保留了表情的語意成分
，c 則代表了不同的明星臉。
2017, [[1703.10155v1] CVAE-GAN: Fine-Grained Image Generation through Asymmetric Training]
(https://arxiv.org/abs/1703.10155v1)

VaDE
● 跟VAE很像，只是預先決定K個類
別，然後分別產生K組mean,
variance，最大化 ELBO
● 第一項為重建項，第二項為先驗
分布與後驗分布的KL
divergence 項
Variational Deep Embedding:An Unsupervised and Generative Approach to Clustering
https://arxiv.org/pdf/1611.05148.pdf

VaDE
● 與VAE等方法比較，可
以得到非常好的類聚
效果，分錯的數量更少

VaDE
● 以手寫數字的分群為例，如果不是分成十群，而是7群跟14群，就會如下圖所示，
形狀相似的被分在同一群

20190118 auto encoder-explanation-allen_lee

20190118 auto encoder-explanation-allen_lee

Recommended

Recommended

More Related Content

Similar to 20190118 auto encoder-explanation-allen_lee

Similar to 20190118 auto encoder-explanation-allen_lee (20)

20190118 auto encoder-explanation-allen_lee