RBM from Scratch

RBM from Scratch
Hadi Sinaee
Sharif University of Technology
Department of Computer Engineering
May 17, 2015
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 1 / 20

Outline
1 Unsupervised Learning
2 Liklihood
3 Optimization
4 Having Latent Variables
5 Markov Chain and Gibbs Sampling
6 Restricted Boltzmann Machines

Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.

• This includes ﬁnding a new representatoins of data that foster
learning and generalization.

• If structure of the graphical model

• If structure of the graphical model and family of energy
function parameterize by θ

function parameterize by θ is known,

function parameterize by θ is known, unsupervised learning of
a data distribution with MRF means adjusting parameters θ.

function parameterize by θ is known, unsupervised learning of
a data distribution with MRF means adjusting parameters θ.
• p(x|θ) shows this dependence.

Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q.

Liklihood
distribution q. Standard way of ﬁnding the parameters is ML.

Liklihood
• Appling this to MRF → ﬁnding the MRF parameters(θ) that
maximize the probability of S under the MRF distribution,p.

Liklihood
L : Θ → R
l(θ|S) = ΣN
i=1ln(p(xi |θ))

Liklihood
L : Θ → R
l(θ|S) = ΣN
i=1ln(p(xi |θ))
for the Gibbs distribution of an MRF → Cannot ﬁnd maximum
analytically!

Liklihood
L : Θ → R
l(θ|S) = ΣN
i=1ln(p(xi |θ))
for the Gibbs distribution of an MRF → Cannot ﬁnd maximum
analytically! So using numerical approximation.

Liklihood KL divergence
Liklihood
KL of true distribution and MRF distribution:
KL(q||p) = Σx q(x)ln(q(x)) − q(x)ln(p(x))
• KL comprises of entropy of q and expectation over q. Only
latter depends on the paramter subject to optimization.

Liklihood KL divergence
Liklihood
KL of true distribution and MRF distribution:
KL(q||p) = Σx q(x)ln(q(x)) − q(x)ln(p(x))
• KL comprises of entropy of q and expectation over q. Only
latter depends on the paramter subject to optimization.
• Maximizing likilihood → Minimizing the KL-divergence.

Optimization Gradient Ascent
Optimization
Iteratively updating the parameters from θt
to θt+1
based on
log-liklihood.
θt+1
= θt
+ η
∂
∂θt
(ΣN
i=1l(θ|xi ))

Having Latent Variables Latent Variables
Having Latent Variables
• We want to model m-dimensional prob. distribution q(e.g an
image with m pixels)

• X=(V,H) is a set of all variables.
V = (V1, V2, ..., Vm) → visibles units
H = (H1, H2, ..., Hn) → hidden units, n = |V | − m

• e.g V = set of all pixels, H= set of relationships between V units

• e.g V = set of all pixels, H= set of relationships between V units
• Our Gibbs distribution of visible units:
p(v) =
1
Z
Σhe−E(v,h)
, Z = Σv,he−E(v,h)

Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]

] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ

] − ln[
Z
Σv,he−E(v,h)
]
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
• It is diﬀerence of two expectations:

] − ln[
Z
Σv,he−E(v,h)
]
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
→One over conditional dist. of hidden units

] − ln[
Z
Σv,he−E(v,h)
]
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
→One over model dist.
• We have to sum over all possible values of (v,h) for this
computation!!

] − ln[
Z
Σv,he−E(v,h)
]
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
→One over model dist.
• We have to sum over all possible values of (v,h) for this
computation!! Instead approximating this expectation.

Markov Chain and Gibbs Sampling Markov Chain
Markov Chain
• Stationary Distribution: for distribution π for which it holds
πT
= πT
P where P is transition matrix with pij as matrix
elements.
• Detailed Balance Condition: suﬃcient condition for π to be
stationary distribution w.r.t pij , i, j ∈ Ω as transition probablities:
π(i)pij = π(j)pji , ∀i, j ∈ Ω

Markov Chain and Gibbs Sampling Gibbs Sampling
Gibbs Sampling
• MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where
V = {1, 2, ..., N}.

Gibbs Sampling
V = {1, 2, ..., N}.
• Xi , i ∈ V takes values in a ﬁnite set Λ.

Gibbs Sampling
V = {1, 2, ..., N}.
• Time varing states X = {Xk
|k ∈ N}, Xk
= {Xk
1 , Xk
2 , ..., Xk
N}

Gibbs Sampling
V = {1, 2, ..., N}.
• Time varing states X = {Xk
|k ∈ N}, Xk
= {Xk
1 , Xk
2 , ..., Xk
N}
• π(x) = 1
Z
e−ε(x)
is the joint probability distribution of X.

Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);

Gibbs Sampling
probability q(i);q is a strictly positive distribution over V.

Gibbs Sampling
Step 2: Sample a new value for Xi

Gibbs Sampling
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables

Gibbs Sampling
all other variables, i.e π(Xi |(xv )v∈V ı)

Gibbs Sampling
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)

Gibbs Sampling
)
Step 3: Keep doing this!

Gibbs Sampling
)
• Therefore the transition probability for MRF X is deﬁned as
follows, for two states x and y

Gibbs Sampling
)
pxy = q(i)π(yi |(xv )v∈V −{i}),
only diﬀers in ith−element
∃i ∈ V ∀v ∈ V , v = i; xv = yv

Gibbs Sampling
)
∃i ∈ V ∀v ∈ V , v = i; xv = yv
pxy = 0, otherwise

Gibbs Sampling
)
∃i ∈ V ∀v ∈ V , v = i; xv = yv
pxy = 0, otherwise
pxx = Σi∈Vq(i)π(xi |(xv )v∈V−{i})

Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.

Every single variable Xi can take every state xi ∈ Λ in a single
transition step

transition step → every state can reach any other in a ﬁnite
number of steps

number of steps → Markov chain is irreducible.

• pxx > 0 and pxy > 0 and detailed balance condition

• pxx > 0 and pxy > 0 and detailed balance condition → Markov
Chain is aperiodic.

• pxx > 0 and pxy > 0 and detailed balance condition → Markov
Chain is aperiodic.
Convergence
Aperiodicity and irreducibility guaranty that the chain converges to
the stationary distribution π.

Restricted Boltzmann Machines RBM
RBM
• p(v, h) =
1
Z
e−E(v,h)
with
E(v, h) = −Σn
i=1Σm
j=1wij hi vj − Σm
j=1bj vj − Σn
i=1ci hi .
• ci and bj are real-valued biased terms for hidden and visible
units.

• Hidden variables are independent of each other given visibile
units:
p(h|v) =
n
i=1
p(hi |v) and p(v|h) =
m
j=1
p(vj |h)

units:
p(h|v) =
n
i=1
m
j=1
p(vj |h)
• Joint probability distribution of observations:
p(v) =
1
Z
m
j=1
ebj vj
n
i=1
(1 + eci +Σm
j=1wij vj
)

units:
p(h|v) =
n
i=1
m
j=1
p(vj |h)
• Joint probability distribution of observations:
p(v) =
1
Z
m
j=1
ebj vj
n
i=1
(1 + eci +Σm
j=1wij vj
)
• Conditionaly probability distribution of components:
p(Hi = 1|v) = σ(Σm
j=1wij + ci )
p(Vj = 1|h) = σ(Σn
i=1wij + bj )

Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the ﬁrst term is tractable,e.g for wij ,

∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
−Σhp(h|v)
∂E(v, h)
∂wij

∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj

∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj

∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
• We can do the same thing for the second part;

Computing the derivative of log-liklihood
• w.r.t wij ,
wij
l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj

• w.r.t wij ,
wij
• w.r.t bj ,
bj
l(θ|v) = vj − Σvp(v)vj

• w.r.t wij ,
wij
• w.r.t bj ,
bj
• w.r.t ci ,
ci
l(θ|v) = p(Hi = 1|v) − Σvp(v)p(Hi = 1|v)

• w.r.t wij ,
wij
• w.r.t bj ,
bj
• w.r.t ci ,
ci
l(θ|v) = p(Hi = 1|v) − Σvp(v)p(Hi = 1|v)
• to avoid summation over all possible values of v, we can
approximate the expectation.

Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).

• Starting from a training sample v0
, yeilds sample vk
after k-step.

, yeilds sample vk
after k-step.
• Each step t consists of sampling ht
from p(h|vt
)

, yeilds sample vk
after k-step.
from p(h|vt
) and sampling
vt+1
from p(v|ht
).

, yeilds sample vk
after k-step.
from p(h|vt
) and sampling
vt+1
from p(v|ht
).
• Then using this samples the gradient approximation is given by,
CDk(h, v0
) = −Σhp(h|v0
)
∂E(v0
, h)
∂θ
+ Σhp(h|vk
)
∂E(vk
, h)
∂θ

Restricted Boltzmann Machines Contrastive Divergence
k-CD for Batch

Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).

parameter(vk
step).
• Fast PCD introduce a set of parameters just for sampling and
not for the model to increaes the speed.

parameter(vk
step).
• Parallel Tempering

parameter(vk
step).
we run k (usually k = 1) Gibbs sampling steps.

parameter(vk
step).
we run k (usually k = 1) Gibbs sampling steps.In each tempered
Markov chain yielding samples (v1, h1), ..., (vM, hM), choose two
consecutive temprature and exchange particles (vr , hr ) and
(vr−1, hr−1) with prob.

Results
left: hidden sampling, right: visible sampling

RBM from Scratch

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (10)

Similar to RBM from Scratch

Similar to RBM from Scratch (20)

Recently uploaded

Recently uploaded (20)

RBM from Scratch