從 VAE 走向深度學習新理論

從VAE ⾛向深度學習新理論從VAE ⾛向深度學習新理論
杜岳華

Deep Learning is a kind of Representational LearningDeep Learning is a kind of Representational Learning

Deep Learning is a kind of Representational LearningDeep Learning is a kind of Representational Learning
picture source (https://www.deeplearningbook.org/)

Representational LearningRepresentational Learning
x2
x3
x4
x5
z4
z3
z2
z1
x1
f
woman
ClassifierFeature
extractor
g

Representational LearningRepresentational Learning
woman
Representation Learning: A Review and New Perspectives
(https://arxiv.org/abs/1206.5538)

AutoencoderAutoencoder
x2
x3
x4
x5
z2
z1
x1
x2
x3
x4
x5
x1
z1
z2

Restricted Boltzmann MachinesRestricted Boltzmann Machines
An unsupervised greedy way to extract featuresAn unsupervised greedy way to extract features
發明：發明：
Smolensky, Paul (1986). Chapter 6: Information Processing in Dynamical Systems:
Foundations of Harmony Theory.
應⽤：應⽤：
降維：Hinton, G. E.; Salakhutdinov, R. R. (2006). Reducing the Dimensionality of
Data with Neural Networks. Science.
分類：Larochelle, H.; Bengio, Y. (2008). Classi cation using discriminative
restricted Boltzmann machines. ICML '08.
協同過濾：Salakhutdinov, R.; Mnih, A.; Hinton, G. (2007). Restricted Boltzmann
machines for collaborative ltering. ICML '07.
特徵學習：Coates, Adam; Lee, Honglak; Ng, Andrew Y. (2011). An analysis of
single-layer networks in unsupervised feature learning. International Conference
on Arti cial Intelligence and Statistics (AISTATS).

Restricted Boltzmann MachinesRestricted Boltzmann Machines
x2
x3
x4
x5
z2
z1
x1
A Beginner's Guide to Restricted Boltzmann Machines (RBMs)
(https://skymind.ai/wiki/restricted-boltzmann-machine)

Deep Belief Network [Hinton]Deep Belief Network [Hinton]
A greedy layerwise unsupervised pre-training methodA greedy layerwise unsupervised pre-training method
W1 W1
W2

Deep Belief NetworkDeep Belief Network
W1
W2
W1
T
W2
T

We need generative model!We need generative model!
Discriminative model:
Generative model:
p(Y |X)
p(X, Y )

Disentangle explanatory generative factorsDisentangle explanatory generative factors
to disentangle as many factors as possible, discarding as little information about the
data as is practical
x2
x3
x4
x5
z2
z1
x1
x2
x3
x4
x5
x1
z1
z2

Variational AutoencoderVariational Autoencoder

A generative modelA generative model
z x
N
We hope to learn generative factors by unsupervised method

The factorThe factor
xi
yi
^yi=axi+b
mean ^y
variance σ2

The factorThe factor
x y
N
y=θ0+θ1 x
θ
θ=(θ0, θ1)

To learn latent random variablesTo learn latent random variables
z x
N
θ

Kullback–Leibler divergenceKullback–Leibler divergence
Relative entropy, to measure the dissimilarity between two distributions.
Use data to approximate theoretical distributionp(X) q(X)
(p(X)||q(X)) = − p( ) log DKL ∑
i
xi
q( )xi
p( )xi
1. Asymmetry
2. Not distance
3.
4. and are equal
(p(X)||q(X)) ≥ 0DKL
(p(X)||q(X)) = 0DKL ⇔ p(X) q(X)

FormulationFormulation
(z|x) =pθ
(x|z) (z)pθ
pθ
(x)pθ
arg ( (z|x)|| (z|x))min
ϕ
DKL qϕ
pθ
x
zφ θ
N
x
z θ
N

ArchitectureArchitecture
x
z
x
z
encoder decoder
qϕ(z∣x) pθ(z∣x)
z= f (x) x= g(z)
θφ
x z x
gf

Evidence Lower Bound method (ELOB)Evidence Lower Bound method (ELOB)
( (z|x)|| (z|x))DKL qϕ pθ
= ∫ q(z|x) log dz
q(z|x)
p(z|x)
= ∫ q(z|x) log dz
q(z|x)p(x)
p(x, z)
= ∫ q(z|x) log dz + ∫ q(z|x) log p(x)dz
q(z|x)
p(x, z)
= ∫ q(z|x)(log q(z|x) − log p(x, z))dz + log p(x)
= − [log p(x, z) − log q(z|x)] + log p(x)Eq(z|x)

Let
is called (variational) lower bound or evidence lower bound.
L(θ, ϕ, x) = [log (x, z) − log (z|x)]E (z|x)qϕ
pθ qϕ
( (z|x)|| (z|x)) = −L(θ, ϕ, x) + log p(x)DKL qϕ pθ
log p(x) = ( (z|x)|| (z|x)) ↙ +L(θ, ϕ, x) ↗DKL qϕ pθ

Encoder: , Decoder:
(z|x) =pθ
(x|z) (z)pθ
pθ
(x)pθ
arg ( (z|x)|| (z|x))min
ϕ
DKL qϕ
pθ
⇓
(z|x) =pθ
(x|z) (z)pθ
pθ
(x)pθ
arg L(θ, ϕ, x)max
θ,ϕ
(z|x)qϕ (x|z)pθ

Hypothesis: gaussian mixture as latent representationHypothesis: gaussian mixture as latent representation
z2
z1 μz2
μz1
σz 2
σz1
z2 z2
z1 z1

Encoder and decoderEncoder and decoder
z2
z2
z1
z1
encoderencoder
decoder

How to solve?How to solve?
Mean eld variational approximation
Sampling by Markov chain Monte Carlo
More?

Sampling by MCMCSampling by MCMC
picture source (https://www.youtube.com/watch?
v=OTO1DygELpY)

Stochastic gradient descent?Stochastic gradient descent?
pθ qϕ
L(θ, ϕ, x) = [−log (z|x)]∇ϕ ∇ϕ E (z|x)qϕ
qϕ

Reparameterization trickReparameterization trick
Encoder
( )
Decoder
( )
Sample from
Encoder
( )
Decoder
( )
Sample from
*
+
Tutorial on Variational Autoencoders

Stochastic gradient variational bayes (SGVB)Stochastic gradient variational bayes (SGVB)
⾒Algorithm 1 in Auto-Encoding Variational Bayes⾒Algorithm 1 in Auto-Encoding Variational Bayes

Example: variational autoencoderExample: variational autoencoder
z2
z2
z1
z1
encoderencoder
decoder

ExperimentsExperiments
(a) Learned Frey Face manifold (b) Learned MNIST manifold

-variational Autoencoder-variational Autoencoderβ

Achieve disentangled explainable generative factorAchieve disentangled explainable generative factor

Achieve disentangled explainable generative factorAchieve disentangled explainable generative factor
Figure 6 in β-VAE: LEARNING BASIC VISUAL CONCEPTS WITH A CONSTRAINED
VARIATIONAL FRAMEWORK

What is the di erence between VAE andWhat is the di erence between VAE and -VAE?-VAE?β
VAE:
-VAE:
arg max L(θ, ϕ, x) = [log (x|z)] − ( (z|x)|| (z))E (z|x)qϕ
pθ DKL qϕ pθ
β
arg max L(θ, ϕ, x) = [log (x|z)] − β ( (z|x)|| (z))E (z|x)qϕ
pθ DKL qϕ pθ
pθ qϕ
= ∫ (z|x)(log (x, z) − log (z|x))dzqϕ pθ qϕ
= ∫ (z|x)(log − log )dzqϕ
(x, z)pθ
(z)pθ
(z|x)qϕ
(z)pθ
= [log (x|z)] − ( (z|x)|| (z))E (z|x)qϕ
pθ DKL qϕ pθ

Why?Why?
The higher encourages learning a disentangled representation.
: encourage to learn good representations.
: constraint the capacity of
β
[log (x|z)]E (z|x)qϕ
pθ
( (z|x)|| (z))DKL qϕ pθ z

The information bottleneck methodThe information bottleneck method
arg max I (Z; Y ) − βI (X; Z)
: maximize mutual information between Z and Y.
: discard irrelevant information about Y from X.
I (Z; Y )
I (X; Z)
Learning is about forgetting irrelevant details.Learning is about forgetting irrelevant details.

ExperimentsExperiments
Understanding disentangling in β-VAE

Information Bottleneck TheoryInformation Bottleneck Theory

Basic Information theoryBasic Information theory
EntropyEntropy
Information entropy, Shannon entropy
Measure the uncertainty of an event.
H(X) = E(I (X)) = − p( ) log p( )∑
i=1
n
xi xi
1. Nonnegativity:
2. Symmetry:
3. If and are independent random variable:
H(X) ≥ 0
H(X, Y ) = H(Y , X)
X Y H(X|Y ) = H(X)

EntropyEntropy
天氣預報100% 下⾬，0% 晴天：
天氣預報80% 下⾬，20% 晴天：
天氣預報50% 下⾬，50% 晴天：
1 lo 1 + 0 lo 0 = 0 + 0 = 0g2 g2
−0.8 lo 0.8 − 0.2 lo 0.2 = 0.258 + 0.464 = 0.722g2 g2
−0.5 lo 0.5 − 0.5 lo 0.5 = 0.5 + 0.5 = 1g2 g2

EntropyEntropy
0 0.5 10
0.5
1
Pr(X=1)
H(X)
)
picture source
(https://en.wikipedia.org/wiki/Entropy_(information_theory)

Conditional entropyConditional entropy
To measure how much information needed to describe the outcome of a random variable
Y given that the value of another random variable X is known.
H(Y |X) = p(x)H(Y |X = x)∑
x∈X
= − p(x) p(y|x) log p(y|x)∑
x∈X
∑
y∈Y
= − p(x, y) log ∑
x∈X ,y∈Y
p(x, y)
p(x)

Mutual informationMutual information
To measure how much information obtained about one random variable through
observing the other.
I (X; Y ) = H(X) − H(X|Y )
= H(Y ) − H(Y |X)
= H(X) + H(Y ) − H(X, Y )
= p(x, y) log ∑
x,y
p(x, y)
p(x)p(y)
1. Nonnegativity:
2. Symmetry:
I (X; Y ) ≥ 0
I (X; Y ) = I (Y ; X)

Relation to Kullback–Leibler divergenceRelation to Kullback–Leibler divergence
I (X; Y ) = (p(X, Y )||p(X)p(Y ))DKL

RelationRelation
picture source (https://en.wikipedia.org/wiki/Mutual_information)

Cross entropyCross entropy
How much difference between two distributions.
H(q, p) = H(q) + (q||p)DKL
= − p(x) log q(x)∑
x
DKL(q∣p)
H (q)
H (q, p)
NOTION: notation confused with joint entropy.

Di erence between mutual information and cross entropyDi erence between mutual information and cross entropy
Mutual information
Measure the information share between two random variables.
Cross entropy
Measure the difference between two distributions.

Data processing inequality (DPI)Data processing inequality (DPI)
Let be a Markov chain, thenX → Y → Z
I (X; Y ) ≥ I (X; Z)

The neural network generates a successive Markov chainThe neural network generates a successive Markov chain
Treat the whole layer as a single random variableTi
Encoder Decoder
I (X; Y ) ≥ I ( ; Y ) ≥ I ( ; Y ) ≥. . . ≥ I ( ; Y ) ≥ I ( ; Y )T1 T2 Tm Y^
H(X) ≥ I (X; ) ≥ I (X; ) ≥. . . ≥ I (X; ) ≥ I (X; )T1 T2 Tm Y^

Codebook and volumeCodebook and volume
Let
: signal source with xed probability measure
: quantized codebook
: a soft partition of , with probability with
X p(x)
X^
p( |x)x^ X
p( ) = p(x)p( |x)x^ ∑
x
x^

What determines the quality of a quantization?What determines the quality of a quantization?
Rate, the average numbers of bits per message to encode the signal.
The information to transmit from to is bounded from belowX X^
I (X; )X^

Rate distortion theoryRate distortion theory
Bernd Girod: EE368b Image and Video Compression Rate Distortion Theory no. 1
Lossy compression
n Lower the bit-rate R by allowing some acceptable distortion
D of the signal.
Distortion D
Rate R
Lossless coding
D=0

Bernd Girod: EE368b Image and Video Compression Rate Distortion Theory no. 2
Types of lossy compression problems
D
R
n Given maximum rate R,
minimize distortion D
n Given distortion D, minimize
rate R
D
R
Equivalent constrained optimization problems,
often unwieldy due to constraint.

Def. rate distortion function as
R(D) = min I (X; )X^
w. r. t. E[d(x, )] ≤ Dx^
Apply Lagrange multiplier:
F (p( |x)) = I (X; ) + βE[d(x, )]x^ X^ x^

Information bottleneck methodInformation bottleneck method
, thenX → → YX^ I (X; ) ≥ I (X; Y )X^
Information bottleneck:
arg min L(x, ) = I (X; ) − βI ( ; Y )x^ X^ X^
We want this quantization to capture as much information about
tradeoff between compress the representation and preserve meaningful information.
Y

Information bottleneck methodInformation bottleneck method
x2
x3
x4
x5
z4
z3
z2
z1
x1
x2
x3
x4
x5
x1

Opening the black box of Deep Neural Networks viaOpening the black box of Deep Neural Networks via
InformationInformation

IssuesIssues
1. The SGD layer dynamics in the Information plane.
2. The effect of the training sample size on the layers.
3. What is the bene t of the hidden layers?
4. What is the nal location of the hidden layers?
5. Do the hidden layers form optimal IB representations?

SetupSetup
standard DNN settings
tanh as activation function
sigmoid function in the nal layer
train with SGD and cross-entropy loss
7 fully connected hidden layers with widths: 12-10-7-5-4-3-2 neurons

Information planeInformation plane
Encoder Decoder
Given , plot point on the information plane.
Applied to the Markov chain of a k-layers of DNN, connected points form a unique
information path.
P (X; Y ) (I (X; T ), I (T ; Y ))

The dynamics of the training by Stochastic-Gradient-DecentThe dynamics of the training by Stochastic-Gradient-Decent
50 different randomized initializations with different randomized training samples
init − 400epochs − 9000epochs
The optimization process in the Information Plane (https://www.youtube.com/watch?
v=P1A1yNsxMjc)

The two optimization phases in the Information PlaneThe two optimization phases in the Information Plane
5% - 45% - 85% training samples5% - 45% - 85% training samples
Emperical risk minimization (ERM) phase (fast)
increase
layer learn the information while preserving the DPI order
Representation compression phase (slow)
decrease until convergence
layer lose irrelevant information (compression)
IY
IX

The drift and di usion phases of SGD optimizationThe drift and di usion phases of SGD optimization
Layer weight's gradient distributionsLayer weight's gradient distributions

The drift and di usion phases of SGD optimizationThe drift and di usion phases of SGD optimization
Drift phase
large gradient mean, small variance (high SNR)
increase and reduce the emperical error
ERM phase
Diffusion phase
small gradient mean, large uctuations (low SNR)
the gradients behave like Gaussian noise, weights evolve like Wiener
process
compression phase
Maximize the entropy of the weight distribution by addiing noise, known
as stochastic relaxation
compression by diffusion phase
attempts to interpret single weights or even single neurons in such networks can
be meaningless
IY

The computational bene t of the hidden layersThe computational bene t of the hidden layers
Train 6 different architecture with 1-6 hidden layers

The computational bene t of the hidden layersThe computational bene t of the hidden layers
1. Adding hidden layers dramatically reduces the number of training epochs for good
generalization.
2. The compression phase of each layer is shorter when it starts from a previous
compressed layer.
3. The compression is faster for the deeper (narrower and closer to the output)
layers.
4. Even wide hidden layers eventually compress in the diffusion phase. Adding extra
width does not help.

Convergence to the layers to the Information Bottleneck boundConvergence to the layers to the Information Bottleneck bound

Evolution of the layers with training sample sizeEvolution of the layers with training sample size
0 1 2 3 4 5 6 7 8 9
I(X;T)
0.3
0.4
0.5
0.6
0.7
I(T;Y)
4%
84%
Training data

with increasing training size the layers’ true label information (generalization) is
pushed up and gets closer to the theoretical IB bound for the rule distribution.
IY

Are our ndings general enough?Are our ndings general enough?

Hinton 的評論Hinton 的評論
Hinton 在聽完Tishby 的talk 之後，給Tishby 發了email:
“I have to listen to it another 10,000 times to really understand it,
but it’s very rare nowadays to hear a talk with a really original
idea in it that may be the answer to a really major puzzle.”

Caution!Caution!
No, information bottleneck (probably) doesn’t open the “black-box” of deep neural n
(https://severelytheoretical.wordpress.com/2017/09/28/no-information-bottlenec
black-box-of-deep-neural-networks/)
Tishby's 'Opening the Black Box of Deep Neural Networks via Information' received
(https://www.reddit.com/r/MachineLearning/comments/72eau7/d_tishbys_opening
On the Information Bottleneck Theory of Deep Learning [Harvard University] [ICLR
(https://openreview.net/forum?id=ry_WPG-A-)

Thank you for attentionThank you for attention
ReferenceReference
18. Information Theory of Deep Learning. Naftali Tishby
(https://www.youtube.com/watch?v=bLqJHjXihK8)

從 VAE 走向深度學習新理論

More Related Content

What's hot

Similar to 從 VAE 走向深度學習新理論

More from 岳華 杜

Recently uploaded

從 VAE 走向深度學習新理論

More from 岳華杜