1. RBM from Scratch
Hadi Sinaee
Sharif University of Technology
Department of Computer Engineering
May 17, 2015
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 1 / 20
2. Outline
1 Unsupervised Learning
2 Liklihood
3 Optimization
4 Having Latent Variables
5 Markov Chain and Gibbs Sampling
6 Restricted Boltzmann Machines
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 2 / 20
3. Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
4. Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
5. Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
6. Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model and family of energy
function parameterize by θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
7. Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model and family of energy
function parameterize by θ is known,
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
8. Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model and family of energy
function parameterize by θ is known, unsupervised learning of
a data distribution with MRF means adjusting parameters θ.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
9. Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model and family of energy
function parameterize by θ is known, unsupervised learning of
a data distribution with MRF means adjusting parameters θ.
• p(x|θ) shows this dependence.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
10. Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
11. Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
12. Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
• Appling this to MRF → finding the MRF parameters(θ) that
maximize the probability of S under the MRF distribution,p.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
13. Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
• Appling this to MRF → finding the MRF parameters(θ) that
maximize the probability of S under the MRF distribution,p.
L : Θ → R
l(θ|S) = ΣN
i=1ln(p(xi |θ))
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
14. Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
• Appling this to MRF → finding the MRF parameters(θ) that
maximize the probability of S under the MRF distribution,p.
L : Θ → R
l(θ|S) = ΣN
i=1ln(p(xi |θ))
for the Gibbs distribution of an MRF → Cannot find maximum
analytically!
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
15. Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
• Appling this to MRF → finding the MRF parameters(θ) that
maximize the probability of S under the MRF distribution,p.
L : Θ → R
l(θ|S) = ΣN
i=1ln(p(xi |θ))
for the Gibbs distribution of an MRF → Cannot find maximum
analytically! So using numerical approximation.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
16. Liklihood KL divergence
Liklihood
KL of true distribution and MRF distribution:
KL(q||p) = Σx q(x)ln(q(x)) − q(x)ln(p(x))
• KL comprises of entropy of q and expectation over q. Only
latter depends on the paramter subject to optimization.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 5 / 20
17. Liklihood KL divergence
Liklihood
KL of true distribution and MRF distribution:
KL(q||p) = Σx q(x)ln(q(x)) − q(x)ln(p(x))
• KL comprises of entropy of q and expectation over q. Only
latter depends on the paramter subject to optimization.
• Maximizing likilihood → Minimizing the KL-divergence.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 5 / 20
18. Optimization Gradient Ascent
Optimization
Iteratively updating the parameters from θt
to θt+1
based on
log-liklihood.
θt+1
= θt
+ η
∂
∂θt
(ΣN
i=1l(θ|xi ))
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 6 / 20
19. Having Latent Variables Latent Variables
Having Latent Variables
• We want to model m-dimensional prob. distribution q(e.g an
image with m pixels)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
20. Having Latent Variables Latent Variables
Having Latent Variables
• We want to model m-dimensional prob. distribution q(e.g an
image with m pixels)
• X=(V,H) is a set of all variables.
V = (V1, V2, ..., Vm) → visibles units
H = (H1, H2, ..., Hn) → hidden units, n = |V | − m
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
21. Having Latent Variables Latent Variables
Having Latent Variables
• We want to model m-dimensional prob. distribution q(e.g an
image with m pixels)
• X=(V,H) is a set of all variables.
V = (V1, V2, ..., Vm) → visibles units
H = (H1, H2, ..., Hn) → hidden units, n = |V | − m
• e.g V = set of all pixels, H= set of relationships between V units
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
22. Having Latent Variables Latent Variables
Having Latent Variables
• We want to model m-dimensional prob. distribution q(e.g an
image with m pixels)
• X=(V,H) is a set of all variables.
V = (V1, V2, ..., Vm) → visibles units
H = (H1, H2, ..., Hn) → hidden units, n = |V | − m
• e.g V = set of all pixels, H= set of relationships between V units
• Our Gibbs distribution of visible units:
p(v) =
1
Z
Σhe−E(v,h)
, Z = Σv,he−E(v,h)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
23. Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
24. Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
25. Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
• It is difference of two expectations:
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
26. Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
• It is difference of two expectations:
→One over conditional dist. of hidden units
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
27. Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
• It is difference of two expectations:
→One over conditional dist. of hidden units
→One over model dist.
• We have to sum over all possible values of (v,h) for this
computation!!
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
28. Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
• It is difference of two expectations:
→One over conditional dist. of hidden units
→One over model dist.
• We have to sum over all possible values of (v,h) for this
computation!! Instead approximating this expectation.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
29. Markov Chain and Gibbs Sampling Markov Chain
Markov Chain
• Stationary Distribution: for distribution π for which it holds
πT
= πT
P where P is transition matrix with pij as matrix
elements.
• Detailed Balance Condition: sufficient condition for π to be
stationary distribution w.r.t pij , i, j ∈ Ω as transition probablities:
π(i)pij = π(j)pji , ∀i, j ∈ Ω
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 9 / 20
30. Markov Chain and Gibbs Sampling Gibbs Sampling
Gibbs Sampling
• MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where
V = {1, 2, ..., N}.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
31. Markov Chain and Gibbs Sampling Gibbs Sampling
Gibbs Sampling
• MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where
V = {1, 2, ..., N}.
• Xi , i ∈ V takes values in a finite set Λ.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
32. Markov Chain and Gibbs Sampling Gibbs Sampling
Gibbs Sampling
• MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where
V = {1, 2, ..., N}.
• Xi , i ∈ V takes values in a finite set Λ.
• Time varing states X = {Xk
|k ∈ N}, Xk
= {Xk
1 , Xk
2 , ..., Xk
N}
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
33. Markov Chain and Gibbs Sampling Gibbs Sampling
Gibbs Sampling
• MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where
V = {1, 2, ..., N}.
• Xi , i ∈ V takes values in a finite set Λ.
• Time varing states X = {Xk
|k ∈ N}, Xk
= {Xk
1 , Xk
2 , ..., Xk
N}
• π(x) = 1
Z
e−ε(x)
is the joint probability distribution of X.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
34. Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
35. Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
36. Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
37. Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
38. Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
39. Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
40. Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
41. Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
• Therefore the transition probability for MRF X is defined as
follows, for two states x and y
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
42. Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
• Therefore the transition probability for MRF X is defined as
follows, for two states x and y
pxy = q(i)π(yi |(xv )v∈V −{i}),
only differs in ith−element
∃i ∈ V ∀v ∈ V , v = i; xv = yv
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
43. Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
• Therefore the transition probability for MRF X is defined as
follows, for two states x and y
pxy = q(i)π(yi |(xv )v∈V −{i}),
only differs in ith−element
∃i ∈ V ∀v ∈ V , v = i; xv = yv
pxy = 0, otherwise
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
44. Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
• Therefore the transition probability for MRF X is defined as
follows, for two states x and y
pxy = q(i)π(yi |(xv )v∈V −{i}),
only differs in ith−element
∃i ∈ V ∀v ∈ V , v = i; xv = yv
pxy = 0, otherwise
pxx = Σi∈Vq(i)π(xi |(xv )v∈V−{i})
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
45. Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
46. Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
47. Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
48. Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps → Markov chain is irreducible.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
49. Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps → Markov chain is irreducible.
• pxx > 0 and pxy > 0 and detailed balance condition
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
50. Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps → Markov chain is irreducible.
• pxx > 0 and pxy > 0 and detailed balance condition → Markov
Chain is aperiodic.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
51. Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps → Markov chain is irreducible.
• pxx > 0 and pxy > 0 and detailed balance condition → Markov
Chain is aperiodic.
Convergence
Aperiodicity and irreducibility guaranty that the chain converges to
the stationary distribution π.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
52. Restricted Boltzmann Machines RBM
RBM
• p(v, h) =
1
Z
e−E(v,h)
with
E(v, h) = −Σn
i=1Σm
j=1wij hi vj − Σm
j=1bj vj − Σn
i=1ci hi .
• ci and bj are real-valued biased terms for hidden and visible
units.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 13 / 20
53. Restricted Boltzmann Machines RBM
• Hidden variables are independent of each other given visibile
units:
p(h|v) =
n
i=1
p(hi |v) and p(v|h) =
m
j=1
p(vj |h)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
54. Restricted Boltzmann Machines RBM
• Hidden variables are independent of each other given visibile
units:
p(h|v) =
n
i=1
p(hi |v) and p(v|h) =
m
j=1
p(vj |h)
• Joint probability distribution of observations:
p(v) =
1
Z
m
j=1
ebj vj
n
i=1
(1 + eci +Σm
j=1wij vj
)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
55. Restricted Boltzmann Machines RBM
• Hidden variables are independent of each other given visibile
units:
p(h|v) =
n
i=1
p(hi |v) and p(v|h) =
m
j=1
p(vj |h)
• Joint probability distribution of observations:
p(v) =
1
Z
m
j=1
ebj vj
n
i=1
(1 + eci +Σm
j=1wij vj
)
• Conditionaly probability distribution of components:
p(Hi = 1|v) = σ(Σm
j=1wij + ci )
p(Vj = 1|h) = σ(Σn
i=1wij + bj )
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
56. Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
57. Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
58. Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
59. Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
60. Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
• We can do the same thing for the second part;
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
61. Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
• We can do the same thing for the second part;writing it as
Σvp(v)Σhp(h|v)∂E(v,h)
∂θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
62. Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
• We can do the same thing for the second part;writing it as
Σvp(v)Σhp(h|v)∂E(v,h)
∂θ
or Σhp(h)Σvp(v|h)∂E(v,h)
∂θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
63. Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
• We can do the same thing for the second part;writing it as
Σvp(v)Σhp(h|v)∂E(v,h)
∂θ
or Σhp(h)Σvp(v|h)∂E(v,h)
∂θ
• It is also intractable(in terms of smallest layer. i.e 2m
or 2n
)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
64. Restricted Boltzmann Machines Gradient of the Log-Liklihood
Computing the derivative of log-liklihood
• w.r.t wij ,
wij
l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
65. Restricted Boltzmann Machines Gradient of the Log-Liklihood
Computing the derivative of log-liklihood
• w.r.t wij ,
wij
l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj
• w.r.t bj ,
bj
l(θ|v) = vj − Σvp(v)vj
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
66. Restricted Boltzmann Machines Gradient of the Log-Liklihood
Computing the derivative of log-liklihood
• w.r.t wij ,
wij
l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj
• w.r.t bj ,
bj
l(θ|v) = vj − Σvp(v)vj
• w.r.t ci ,
ci
l(θ|v) = p(Hi = 1|v) − Σvp(v)p(Hi = 1|v)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
67. Restricted Boltzmann Machines Gradient of the Log-Liklihood
Computing the derivative of log-liklihood
• w.r.t wij ,
wij
l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj
• w.r.t bj ,
bj
l(θ|v) = vj − Σvp(v)vj
• w.r.t ci ,
ci
l(θ|v) = p(Hi = 1|v) − Σvp(v)p(Hi = 1|v)
• to avoid summation over all possible values of v, we can
approximate the expectation.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
68. Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
69. Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
• Starting from a training sample v0
, yeilds sample vk
after k-step.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
70. Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
• Starting from a training sample v0
, yeilds sample vk
after k-step.
• Each step t consists of sampling ht
from p(h|vt
)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
71. Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
• Starting from a training sample v0
, yeilds sample vk
after k-step.
• Each step t consists of sampling ht
from p(h|vt
) and sampling
vt+1
from p(v|ht
).
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
72. Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
• Starting from a training sample v0
, yeilds sample vk
after k-step.
• Each step t consists of sampling ht
from p(h|vt
) and sampling
vt+1
from p(v|ht
).
• Then using this samples the gradient approximation is given by,
CDk(h, v0
) = −Σhp(h|v0
)
∂E(v0
, h)
∂θ
+ Σhp(h|vk
)
∂E(vk
, h)
∂θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
73. Restricted Boltzmann Machines Contrastive Divergence
k-CD for Batch
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 18 / 20
74. Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
75. Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
• Fast PCD introduce a set of parameters just for sampling and
not for the model to increaes the speed.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
76. Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
• Fast PCD introduce a set of parameters just for sampling and
not for the model to increaes the speed.
• Parallel Tempering
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
77. Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
• Fast PCD introduce a set of parameters just for sampling and
not for the model to increaes the speed.
• Parallel Tempering
we run k (usually k = 1) Gibbs sampling steps.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
78. Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
• Fast PCD introduce a set of parameters just for sampling and
not for the model to increaes the speed.
• Parallel Tempering
we run k (usually k = 1) Gibbs sampling steps.In each tempered
Markov chain yielding samples (v1, h1), ..., (vM, hM), choose two
consecutive temprature and exchange particles (vr , hr ) and
(vr−1, hr−1) with prob.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
79. Restricted Boltzmann Machines Other derivatives
Results
left: hidden sampling, right: visible sampling
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 20 / 20