SlideShare a Scribd company logo
1 of 79
Download to read offline
RBM from Scratch
Hadi Sinaee
Sharif University of Technology
Department of Computer Engineering
May 17, 2015
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 1 / 20
Outline
1 Unsupervised Learning
2 Liklihood
3 Optimization
4 Having Latent Variables
5 Markov Chain and Gibbs Sampling
6 Restricted Boltzmann Machines
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 2 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model and family of energy
function parameterize by θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model and family of energy
function parameterize by θ is known,
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model and family of energy
function parameterize by θ is known, unsupervised learning of
a data distribution with MRF means adjusting parameters θ.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model and family of energy
function parameterize by θ is known, unsupervised learning of
a data distribution with MRF means adjusting parameters θ.
• p(x|θ) shows this dependence.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
• Appling this to MRF → finding the MRF parameters(θ) that
maximize the probability of S under the MRF distribution,p.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
• Appling this to MRF → finding the MRF parameters(θ) that
maximize the probability of S under the MRF distribution,p.
L : Θ → R
l(θ|S) = ΣN
i=1ln(p(xi |θ))
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
• Appling this to MRF → finding the MRF parameters(θ) that
maximize the probability of S under the MRF distribution,p.
L : Θ → R
l(θ|S) = ΣN
i=1ln(p(xi |θ))
for the Gibbs distribution of an MRF → Cannot find maximum
analytically!
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
• Appling this to MRF → finding the MRF parameters(θ) that
maximize the probability of S under the MRF distribution,p.
L : Θ → R
l(θ|S) = ΣN
i=1ln(p(xi |θ))
for the Gibbs distribution of an MRF → Cannot find maximum
analytically! So using numerical approximation.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
Liklihood KL divergence
Liklihood
KL of true distribution and MRF distribution:
KL(q||p) = Σx q(x)ln(q(x)) − q(x)ln(p(x))
• KL comprises of entropy of q and expectation over q. Only
latter depends on the paramter subject to optimization.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 5 / 20
Liklihood KL divergence
Liklihood
KL of true distribution and MRF distribution:
KL(q||p) = Σx q(x)ln(q(x)) − q(x)ln(p(x))
• KL comprises of entropy of q and expectation over q. Only
latter depends on the paramter subject to optimization.
• Maximizing likilihood → Minimizing the KL-divergence.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 5 / 20
Optimization Gradient Ascent
Optimization
Iteratively updating the parameters from θt
to θt+1
based on
log-liklihood.
θt+1
= θt
+ η
∂
∂θt
(ΣN
i=1l(θ|xi ))
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 6 / 20
Having Latent Variables Latent Variables
Having Latent Variables
• We want to model m-dimensional prob. distribution q(e.g an
image with m pixels)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
Having Latent Variables Latent Variables
Having Latent Variables
• We want to model m-dimensional prob. distribution q(e.g an
image with m pixels)
• X=(V,H) is a set of all variables.
V = (V1, V2, ..., Vm) → visibles units
H = (H1, H2, ..., Hn) → hidden units, n = |V | − m
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
Having Latent Variables Latent Variables
Having Latent Variables
• We want to model m-dimensional prob. distribution q(e.g an
image with m pixels)
• X=(V,H) is a set of all variables.
V = (V1, V2, ..., Vm) → visibles units
H = (H1, H2, ..., Hn) → hidden units, n = |V | − m
• e.g V = set of all pixels, H= set of relationships between V units
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
Having Latent Variables Latent Variables
Having Latent Variables
• We want to model m-dimensional prob. distribution q(e.g an
image with m pixels)
• X=(V,H) is a set of all variables.
V = (V1, V2, ..., Vm) → visibles units
H = (H1, H2, ..., Hn) → hidden units, n = |V | − m
• e.g V = set of all pixels, H= set of relationships between V units
• Our Gibbs distribution of visible units:
p(v) =
1
Z
Σhe−E(v,h)
, Z = Σv,he−E(v,h)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
• It is difference of two expectations:
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
• It is difference of two expectations:
→One over conditional dist. of hidden units
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
• It is difference of two expectations:
→One over conditional dist. of hidden units
→One over model dist.
• We have to sum over all possible values of (v,h) for this
computation!!
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
• It is difference of two expectations:
→One over conditional dist. of hidden units
→One over model dist.
• We have to sum over all possible values of (v,h) for this
computation!! Instead approximating this expectation.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
Markov Chain and Gibbs Sampling Markov Chain
Markov Chain
• Stationary Distribution: for distribution π for which it holds
πT
= πT
P where P is transition matrix with pij as matrix
elements.
• Detailed Balance Condition: sufficient condition for π to be
stationary distribution w.r.t pij , i, j ∈ Ω as transition probablities:
π(i)pij = π(j)pji , ∀i, j ∈ Ω
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 9 / 20
Markov Chain and Gibbs Sampling Gibbs Sampling
Gibbs Sampling
• MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where
V = {1, 2, ..., N}.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
Markov Chain and Gibbs Sampling Gibbs Sampling
Gibbs Sampling
• MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where
V = {1, 2, ..., N}.
• Xi , i ∈ V takes values in a finite set Λ.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
Markov Chain and Gibbs Sampling Gibbs Sampling
Gibbs Sampling
• MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where
V = {1, 2, ..., N}.
• Xi , i ∈ V takes values in a finite set Λ.
• Time varing states X = {Xk
|k ∈ N}, Xk
= {Xk
1 , Xk
2 , ..., Xk
N}
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
Markov Chain and Gibbs Sampling Gibbs Sampling
Gibbs Sampling
• MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where
V = {1, 2, ..., N}.
• Xi , i ∈ V takes values in a finite set Λ.
• Time varing states X = {Xk
|k ∈ N}, Xk
= {Xk
1 , Xk
2 , ..., Xk
N}
• π(x) = 1
Z
e−ε(x)
is the joint probability distribution of X.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
• Therefore the transition probability for MRF X is defined as
follows, for two states x and y
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
• Therefore the transition probability for MRF X is defined as
follows, for two states x and y
pxy = q(i)π(yi |(xv )v∈V −{i}),
only differs in ith−element
∃i ∈ V ∀v ∈ V , v = i; xv = yv
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
• Therefore the transition probability for MRF X is defined as
follows, for two states x and y
pxy = q(i)π(yi |(xv )v∈V −{i}),
only differs in ith−element
∃i ∈ V ∀v ∈ V , v = i; xv = yv
pxy = 0, otherwise
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
• Therefore the transition probability for MRF X is defined as
follows, for two states x and y
pxy = q(i)π(yi |(xv )v∈V −{i}),
only differs in ith−element
∃i ∈ V ∀v ∈ V , v = i; xv = yv
pxy = 0, otherwise
pxx = Σi∈Vq(i)π(xi |(xv )v∈V−{i})
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps → Markov chain is irreducible.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps → Markov chain is irreducible.
• pxx > 0 and pxy > 0 and detailed balance condition
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps → Markov chain is irreducible.
• pxx > 0 and pxy > 0 and detailed balance condition → Markov
Chain is aperiodic.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps → Markov chain is irreducible.
• pxx > 0 and pxy > 0 and detailed balance condition → Markov
Chain is aperiodic.
Convergence
Aperiodicity and irreducibility guaranty that the chain converges to
the stationary distribution π.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Restricted Boltzmann Machines RBM
RBM
• p(v, h) =
1
Z
e−E(v,h)
with
E(v, h) = −Σn
i=1Σm
j=1wij hi vj − Σm
j=1bj vj − Σn
i=1ci hi .
• ci and bj are real-valued biased terms for hidden and visible
units.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 13 / 20
Restricted Boltzmann Machines RBM
• Hidden variables are independent of each other given visibile
units:
p(h|v) =
n
i=1
p(hi |v) and p(v|h) =
m
j=1
p(vj |h)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
Restricted Boltzmann Machines RBM
• Hidden variables are independent of each other given visibile
units:
p(h|v) =
n
i=1
p(hi |v) and p(v|h) =
m
j=1
p(vj |h)
• Joint probability distribution of observations:
p(v) =
1
Z
m
j=1
ebj vj
n
i=1
(1 + eci +Σm
j=1wij vj
)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
Restricted Boltzmann Machines RBM
• Hidden variables are independent of each other given visibile
units:
p(h|v) =
n
i=1
p(hi |v) and p(v|h) =
m
j=1
p(vj |h)
• Joint probability distribution of observations:
p(v) =
1
Z
m
j=1
ebj vj
n
i=1
(1 + eci +Σm
j=1wij vj
)
• Conditionaly probability distribution of components:
p(Hi = 1|v) = σ(Σm
j=1wij + ci )
p(Vj = 1|h) = σ(Σn
i=1wij + bj )
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
• We can do the same thing for the second part;
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
• We can do the same thing for the second part;writing it as
Σvp(v)Σhp(h|v)∂E(v,h)
∂θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
• We can do the same thing for the second part;writing it as
Σvp(v)Σhp(h|v)∂E(v,h)
∂θ
or Σhp(h)Σvp(v|h)∂E(v,h)
∂θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
• We can do the same thing for the second part;writing it as
Σvp(v)Σhp(h|v)∂E(v,h)
∂θ
or Σhp(h)Σvp(v|h)∂E(v,h)
∂θ
• It is also intractable(in terms of smallest layer. i.e 2m
or 2n
)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Computing the derivative of log-liklihood
• w.r.t wij ,
wij
l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Computing the derivative of log-liklihood
• w.r.t wij ,
wij
l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj
• w.r.t bj ,
bj
l(θ|v) = vj − Σvp(v)vj
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Computing the derivative of log-liklihood
• w.r.t wij ,
wij
l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj
• w.r.t bj ,
bj
l(θ|v) = vj − Σvp(v)vj
• w.r.t ci ,
ci
l(θ|v) = p(Hi = 1|v) − Σvp(v)p(Hi = 1|v)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Computing the derivative of log-liklihood
• w.r.t wij ,
wij
l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj
• w.r.t bj ,
bj
l(θ|v) = vj − Σvp(v)vj
• w.r.t ci ,
ci
l(θ|v) = p(Hi = 1|v) − Σvp(v)p(Hi = 1|v)
• to avoid summation over all possible values of v, we can
approximate the expectation.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
• Starting from a training sample v0
, yeilds sample vk
after k-step.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
• Starting from a training sample v0
, yeilds sample vk
after k-step.
• Each step t consists of sampling ht
from p(h|vt
)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
• Starting from a training sample v0
, yeilds sample vk
after k-step.
• Each step t consists of sampling ht
from p(h|vt
) and sampling
vt+1
from p(v|ht
).
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
• Starting from a training sample v0
, yeilds sample vk
after k-step.
• Each step t consists of sampling ht
from p(h|vt
) and sampling
vt+1
from p(v|ht
).
• Then using this samples the gradient approximation is given by,
CDk(h, v0
) = −Σhp(h|v0
)
∂E(v0
, h)
∂θ
+ Σhp(h|vk
)
∂E(vk
, h)
∂θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
Restricted Boltzmann Machines Contrastive Divergence
k-CD for Batch
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 18 / 20
Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
• Fast PCD introduce a set of parameters just for sampling and
not for the model to increaes the speed.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
• Fast PCD introduce a set of parameters just for sampling and
not for the model to increaes the speed.
• Parallel Tempering
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
• Fast PCD introduce a set of parameters just for sampling and
not for the model to increaes the speed.
• Parallel Tempering
we run k (usually k = 1) Gibbs sampling steps.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
• Fast PCD introduce a set of parameters just for sampling and
not for the model to increaes the speed.
• Parallel Tempering
we run k (usually k = 1) Gibbs sampling steps.In each tempered
Markov chain yielding samples (v1, h1), ..., (vM, hM), choose two
consecutive temprature and exchange particles (vr , hr ) and
(vr−1, hr−1) with prob.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
Restricted Boltzmann Machines Other derivatives
Results
left: hidden sampling, right: visible sampling
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 20 / 20

More Related Content

What's hot

Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...
Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...
Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...QUT_SEF
 
A Unifying Review of Gaussian Linear Models (Roweis 1999)
A Unifying Review of Gaussian Linear Models (Roweis 1999)A Unifying Review of Gaussian Linear Models (Roweis 1999)
A Unifying Review of Gaussian Linear Models (Roweis 1999)Feynman Liang
 
Elliptic Curve Cryptography: Arithmetic behind
Elliptic Curve Cryptography: Arithmetic behindElliptic Curve Cryptography: Arithmetic behind
Elliptic Curve Cryptography: Arithmetic behindAyan Sengupta
 
RuleML2015: Learning Characteristic Rules in Geographic Information Systems
RuleML2015: Learning Characteristic Rules in Geographic Information SystemsRuleML2015: Learning Characteristic Rules in Geographic Information Systems
RuleML2015: Learning Characteristic Rules in Geographic Information SystemsRuleML
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraData Science Milan
 
powerpoint
powerpointpowerpoint
powerpointbutest
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...Nesma
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Tomasz Kusmierczyk
 
Boosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market DataBoosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market DataJay (Jianqiang) Wang
 
A Non--convex optimization approach to Correlation Clustering
A Non--convex optimization approach to Correlation ClusteringA Non--convex optimization approach to Correlation Clustering
A Non--convex optimization approach to Correlation ClusteringMortezaHChehreghani
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...Mokhtar SELLAMI
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...Vissarion Fisikopoulos
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Craig Chao
 
Stochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithmsStochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithmsSeonho Park
 
Cryptography Baby Step Giant Step
Cryptography Baby Step Giant StepCryptography Baby Step Giant Step
Cryptography Baby Step Giant StepSAUVIK BISWAS
 

What's hot (19)

Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...
Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...
Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...
 
Lect6 csp
Lect6 cspLect6 csp
Lect6 csp
 
A Unifying Review of Gaussian Linear Models (Roweis 1999)
A Unifying Review of Gaussian Linear Models (Roweis 1999)A Unifying Review of Gaussian Linear Models (Roweis 1999)
A Unifying Review of Gaussian Linear Models (Roweis 1999)
 
Elliptic Curve Cryptography: Arithmetic behind
Elliptic Curve Cryptography: Arithmetic behindElliptic Curve Cryptography: Arithmetic behind
Elliptic Curve Cryptography: Arithmetic behind
 
RuleML2015: Learning Characteristic Rules in Geographic Information Systems
RuleML2015: Learning Characteristic Rules in Geographic Information SystemsRuleML2015: Learning Characteristic Rules in Geographic Information Systems
RuleML2015: Learning Characteristic Rules in Geographic Information Systems
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
powerpoint
powerpointpowerpoint
powerpoint
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
 
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
Boosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market DataBoosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market Data
 
A Non--convex optimization approach to Correlation Clustering
A Non--convex optimization approach to Correlation ClusteringA Non--convex optimization approach to Correlation Clustering
A Non--convex optimization approach to Correlation Clustering
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
 
Stochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithmsStochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithms
 
Cryptography Baby Step Giant Step
Cryptography Baby Step Giant StepCryptography Baby Step Giant Step
Cryptography Baby Step Giant Step
 
20180722 pyro
20180722 pyro20180722 pyro
20180722 pyro
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 

Viewers also liked

The Art Of Backpropagation
The Art Of BackpropagationThe Art Of Backpropagation
The Art Of BackpropagationJennifer Prendki
 
Introduction to Neural networks (under graduate course) Lecture 8 of 9
Introduction to Neural networks (under graduate course) Lecture 8 of 9Introduction to Neural networks (under graduate course) Lecture 8 of 9
Introduction to Neural networks (under graduate course) Lecture 8 of 9Randa Elanwar
 
Intro to Excel Basics: Part II
Intro to Excel Basics: Part IIIntro to Excel Basics: Part II
Intro to Excel Basics: Part IISi Krishan
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learningViet-Trung TRAN
 
Learning RBM(Restricted Boltzmann Machine in Practice)
Learning RBM(Restricted Boltzmann Machine in Practice)Learning RBM(Restricted Boltzmann Machine in Practice)
Learning RBM(Restricted Boltzmann Machine in Practice)Mad Scientists
 
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Indraneel Pole
 
Hebbian Learning
Hebbian LearningHebbian Learning
Hebbian LearningESCOM
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 

Viewers also liked (10)

The Art Of Backpropagation
The Art Of BackpropagationThe Art Of Backpropagation
The Art Of Backpropagation
 
Introduction to Neural networks (under graduate course) Lecture 8 of 9
Introduction to Neural networks (under graduate course) Lecture 8 of 9Introduction to Neural networks (under graduate course) Lecture 8 of 9
Introduction to Neural networks (under graduate course) Lecture 8 of 9
 
Intro to Excel Basics: Part II
Intro to Excel Basics: Part IIIntro to Excel Basics: Part II
Intro to Excel Basics: Part II
 
restrictedboltzmannmachines
restrictedboltzmannmachinesrestrictedboltzmannmachines
restrictedboltzmannmachines
 
DNN and RBM
DNN and RBMDNN and RBM
DNN and RBM
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 
Learning RBM(Restricted Boltzmann Machine in Practice)
Learning RBM(Restricted Boltzmann Machine in Practice)Learning RBM(Restricted Boltzmann Machine in Practice)
Learning RBM(Restricted Boltzmann Machine in Practice)
 
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
 
Hebbian Learning
Hebbian LearningHebbian Learning
Hebbian Learning
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 

Similar to RBM from Scratch

Hessian Matrices in Statistics
Hessian Matrices in StatisticsHessian Matrices in Statistics
Hessian Matrices in StatisticsFerris Jumah
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?Dhafer Malouche
 
Joint contrastive learning with infinite possibilities
Joint contrastive learning with infinite possibilitiesJoint contrastive learning with infinite possibilities
Joint contrastive learning with infinite possibilitiestaeseon ryu
 
A baseline for_few_shot_image_classification
A baseline for_few_shot_image_classificationA baseline for_few_shot_image_classification
A baseline for_few_shot_image_classificationDongHeeKim39
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validationgmorishita
 
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論Naoki Hayashi
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix DatasetBen Mabey
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacesbutest
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacesbutest
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big DataGianvito Siciliano
 
Large Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesLarge Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesAnne-Marie Tousch
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013Sanjeev Mishra
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster ValidityAndres Mendez-Vazquez
 
Nimrita koul Machine Learning
Nimrita koul  Machine LearningNimrita koul  Machine Learning
Nimrita koul Machine LearningNimrita Koul
 
Ikdd co ds2017presentation_v2
Ikdd co ds2017presentation_v2Ikdd co ds2017presentation_v2
Ikdd co ds2017presentation_v2Ram Mohan
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative ModelsKenta Oono
 
Talwalkar mlconf (1)
Talwalkar mlconf (1)Talwalkar mlconf (1)
Talwalkar mlconf (1)MLconf
 
Composing graphical models with neural networks for structured representatio...
Composing graphical models with  neural networks for structured representatio...Composing graphical models with  neural networks for structured representatio...
Composing graphical models with neural networks for structured representatio...Jeongmin Cha
 

Similar to RBM from Scratch (20)

Hessian Matrices in Statistics
Hessian Matrices in StatisticsHessian Matrices in Statistics
Hessian Matrices in Statistics
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?
 
Joint contrastive learning with infinite possibilities
Joint contrastive learning with infinite possibilitiesJoint contrastive learning with infinite possibilities
Joint contrastive learning with infinite possibilities
 
A baseline for_few_shot_image_classification
A baseline for_few_shot_image_classificationA baseline for_few_shot_image_classification
A baseline for_few_shot_image_classification
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
 
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
 
Machine learning
Machine learningMachine learning
Machine learning
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
An introduction to R
An introduction to RAn introduction to R
An introduction to R
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big Data
 
Large Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesLarge Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the Trenches
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity
 
Nimrita koul Machine Learning
Nimrita koul  Machine LearningNimrita koul  Machine Learning
Nimrita koul Machine Learning
 
Ikdd co ds2017presentation_v2
Ikdd co ds2017presentation_v2Ikdd co ds2017presentation_v2
Ikdd co ds2017presentation_v2
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
 
Talwalkar mlconf (1)
Talwalkar mlconf (1)Talwalkar mlconf (1)
Talwalkar mlconf (1)
 
Composing graphical models with neural networks for structured representatio...
Composing graphical models with  neural networks for structured representatio...Composing graphical models with  neural networks for structured representatio...
Composing graphical models with neural networks for structured representatio...
 

Recently uploaded

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 

Recently uploaded (20)

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 

RBM from Scratch

  • 1. RBM from Scratch Hadi Sinaee Sharif University of Technology Department of Computer Engineering May 17, 2015 Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 1 / 20
  • 2. Outline 1 Unsupervised Learning 2 Liklihood 3 Optimization 4 Having Latent Variables 5 Markov Chain and Gibbs Sampling 6 Restricted Boltzmann Machines Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 2 / 20
  • 3. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 4. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. • This includes finding a new representatoins of data that foster learning and generalization. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 5. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. • This includes finding a new representatoins of data that foster learning and generalization. • If structure of the graphical model Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 6. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. • This includes finding a new representatoins of data that foster learning and generalization. • If structure of the graphical model and family of energy function parameterize by θ Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 7. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. • This includes finding a new representatoins of data that foster learning and generalization. • If structure of the graphical model and family of energy function parameterize by θ is known, Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 8. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. • This includes finding a new representatoins of data that foster learning and generalization. • If structure of the graphical model and family of energy function parameterize by θ is known, unsupervised learning of a data distribution with MRF means adjusting parameters θ. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 9. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. • This includes finding a new representatoins of data that foster learning and generalization. • If structure of the graphical model and family of energy function parameterize by θ is known, unsupervised learning of a data distribution with MRF means adjusting parameters θ. • p(x|θ) shows this dependence. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 10. Liklihood likilihood of MRF Liklihood • Training data S = {x1, x2, ..., xN}, i.i.d sampled from true distribution q. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
  • 11. Liklihood likilihood of MRF Liklihood • Training data S = {x1, x2, ..., xN}, i.i.d sampled from true distribution q. Standard way of finding the parameters is ML. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
  • 12. Liklihood likilihood of MRF Liklihood • Training data S = {x1, x2, ..., xN}, i.i.d sampled from true distribution q. Standard way of finding the parameters is ML. • Appling this to MRF → finding the MRF parameters(θ) that maximize the probability of S under the MRF distribution,p. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
  • 13. Liklihood likilihood of MRF Liklihood • Training data S = {x1, x2, ..., xN}, i.i.d sampled from true distribution q. Standard way of finding the parameters is ML. • Appling this to MRF → finding the MRF parameters(θ) that maximize the probability of S under the MRF distribution,p. L : Θ → R l(θ|S) = ΣN i=1ln(p(xi |θ)) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
  • 14. Liklihood likilihood of MRF Liklihood • Training data S = {x1, x2, ..., xN}, i.i.d sampled from true distribution q. Standard way of finding the parameters is ML. • Appling this to MRF → finding the MRF parameters(θ) that maximize the probability of S under the MRF distribution,p. L : Θ → R l(θ|S) = ΣN i=1ln(p(xi |θ)) for the Gibbs distribution of an MRF → Cannot find maximum analytically! Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
  • 15. Liklihood likilihood of MRF Liklihood • Training data S = {x1, x2, ..., xN}, i.i.d sampled from true distribution q. Standard way of finding the parameters is ML. • Appling this to MRF → finding the MRF parameters(θ) that maximize the probability of S under the MRF distribution,p. L : Θ → R l(θ|S) = ΣN i=1ln(p(xi |θ)) for the Gibbs distribution of an MRF → Cannot find maximum analytically! So using numerical approximation. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
  • 16. Liklihood KL divergence Liklihood KL of true distribution and MRF distribution: KL(q||p) = Σx q(x)ln(q(x)) − q(x)ln(p(x)) • KL comprises of entropy of q and expectation over q. Only latter depends on the paramter subject to optimization. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 5 / 20
  • 17. Liklihood KL divergence Liklihood KL of true distribution and MRF distribution: KL(q||p) = Σx q(x)ln(q(x)) − q(x)ln(p(x)) • KL comprises of entropy of q and expectation over q. Only latter depends on the paramter subject to optimization. • Maximizing likilihood → Minimizing the KL-divergence. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 5 / 20
  • 18. Optimization Gradient Ascent Optimization Iteratively updating the parameters from θt to θt+1 based on log-liklihood. θt+1 = θt + η ∂ ∂θt (ΣN i=1l(θ|xi )) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 6 / 20
  • 19. Having Latent Variables Latent Variables Having Latent Variables • We want to model m-dimensional prob. distribution q(e.g an image with m pixels) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
  • 20. Having Latent Variables Latent Variables Having Latent Variables • We want to model m-dimensional prob. distribution q(e.g an image with m pixels) • X=(V,H) is a set of all variables. V = (V1, V2, ..., Vm) → visibles units H = (H1, H2, ..., Hn) → hidden units, n = |V | − m Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
  • 21. Having Latent Variables Latent Variables Having Latent Variables • We want to model m-dimensional prob. distribution q(e.g an image with m pixels) • X=(V,H) is a set of all variables. V = (V1, V2, ..., Vm) → visibles units H = (H1, H2, ..., Hn) → hidden units, n = |V | − m • e.g V = set of all pixels, H= set of relationships between V units Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
  • 22. Having Latent Variables Latent Variables Having Latent Variables • We want to model m-dimensional prob. distribution q(e.g an image with m pixels) • X=(V,H) is a set of all variables. V = (V1, V2, ..., Vm) → visibles units H = (H1, H2, ..., Hn) → hidden units, n = |V | − m • e.g V = set of all pixels, H= set of relationships between V units • Our Gibbs distribution of visible units: p(v) = 1 Z Σhe−E(v,h) , Z = Σv,he−E(v,h) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
  • 23. Having Latent Variables Log-Liklihood of MRF • Log-Liklihood for one sample: l(θ|v) = ln[Σhe−E(v,h) ] − ln[ Z Σv,he−E(v,h) ] Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
  • 24. Having Latent Variables Log-Liklihood of MRF • Log-Liklihood for one sample: l(θ|v) = ln[Σhe−E(v,h) ] − ln[ Z Σv,he−E(v,h) ] • Then its gradient is: θl = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
  • 25. Having Latent Variables Log-Liklihood of MRF • Log-Liklihood for one sample: l(θ|v) = ln[Σhe−E(v,h) ] − ln[ Z Σv,he−E(v,h) ] • Then its gradient is: θl = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ • It is difference of two expectations: Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
  • 26. Having Latent Variables Log-Liklihood of MRF • Log-Liklihood for one sample: l(θ|v) = ln[Σhe−E(v,h) ] − ln[ Z Σv,he−E(v,h) ] • Then its gradient is: θl = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ • It is difference of two expectations: →One over conditional dist. of hidden units Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
  • 27. Having Latent Variables Log-Liklihood of MRF • Log-Liklihood for one sample: l(θ|v) = ln[Σhe−E(v,h) ] − ln[ Z Σv,he−E(v,h) ] • Then its gradient is: θl = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ • It is difference of two expectations: →One over conditional dist. of hidden units →One over model dist. • We have to sum over all possible values of (v,h) for this computation!! Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
  • 28. Having Latent Variables Log-Liklihood of MRF • Log-Liklihood for one sample: l(θ|v) = ln[Σhe−E(v,h) ] − ln[ Z Σv,he−E(v,h) ] • Then its gradient is: θl = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ • It is difference of two expectations: →One over conditional dist. of hidden units →One over model dist. • We have to sum over all possible values of (v,h) for this computation!! Instead approximating this expectation. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
  • 29. Markov Chain and Gibbs Sampling Markov Chain Markov Chain • Stationary Distribution: for distribution π for which it holds πT = πT P where P is transition matrix with pij as matrix elements. • Detailed Balance Condition: sufficient condition for π to be stationary distribution w.r.t pij , i, j ∈ Ω as transition probablities: π(i)pij = π(j)pji , ∀i, j ∈ Ω Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 9 / 20
  • 30. Markov Chain and Gibbs Sampling Gibbs Sampling Gibbs Sampling • MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where V = {1, 2, ..., N}. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
  • 31. Markov Chain and Gibbs Sampling Gibbs Sampling Gibbs Sampling • MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where V = {1, 2, ..., N}. • Xi , i ∈ V takes values in a finite set Λ. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
  • 32. Markov Chain and Gibbs Sampling Gibbs Sampling Gibbs Sampling • MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where V = {1, 2, ..., N}. • Xi , i ∈ V takes values in a finite set Λ. • Time varing states X = {Xk |k ∈ N}, Xk = {Xk 1 , Xk 2 , ..., Xk N} Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
  • 33. Markov Chain and Gibbs Sampling Gibbs Sampling Gibbs Sampling • MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where V = {1, 2, ..., N}. • Xi , i ∈ V takes values in a finite set Λ. • Time varing states X = {Xk |k ∈ N}, Xk = {Xk 1 , Xk 2 , ..., Xk N} • π(x) = 1 Z e−ε(x) is the joint probability distribution of X. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
  • 34. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i); Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 35. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 36. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 37. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 38. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 39. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni ) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 40. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni ) Step 3: Keep doing this! Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 41. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni ) Step 3: Keep doing this! • Therefore the transition probability for MRF X is defined as follows, for two states x and y Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 42. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni ) Step 3: Keep doing this! • Therefore the transition probability for MRF X is defined as follows, for two states x and y pxy = q(i)π(yi |(xv )v∈V −{i}), only differs in ith−element ∃i ∈ V ∀v ∈ V , v = i; xv = yv Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 43. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni ) Step 3: Keep doing this! • Therefore the transition probability for MRF X is defined as follows, for two states x and y pxy = q(i)π(yi |(xv )v∈V −{i}), only differs in ith−element ∃i ∈ V ∀v ∈ V , v = i; xv = yv pxy = 0, otherwise Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 44. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni ) Step 3: Keep doing this! • Therefore the transition probability for MRF X is defined as follows, for two states x and y pxy = q(i)π(yi |(xv )v∈V −{i}), only differs in ith−element ∃i ∈ V ∀v ∈ V , v = i; xv = yv pxy = 0, otherwise pxx = Σi∈Vq(i)π(xi |(xv )v∈V−{i}) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 45. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 46. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Every single variable Xi can take every state xi ∈ Λ in a single transition step Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 47. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Every single variable Xi can take every state xi ∈ Λ in a single transition step → every state can reach any other in a finite number of steps Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 48. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Every single variable Xi can take every state xi ∈ Λ in a single transition step → every state can reach any other in a finite number of steps → Markov chain is irreducible. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 49. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Every single variable Xi can take every state xi ∈ Λ in a single transition step → every state can reach any other in a finite number of steps → Markov chain is irreducible. • pxx > 0 and pxy > 0 and detailed balance condition Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 50. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Every single variable Xi can take every state xi ∈ Λ in a single transition step → every state can reach any other in a finite number of steps → Markov chain is irreducible. • pxx > 0 and pxy > 0 and detailed balance condition → Markov Chain is aperiodic. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 51. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Every single variable Xi can take every state xi ∈ Λ in a single transition step → every state can reach any other in a finite number of steps → Markov chain is irreducible. • pxx > 0 and pxy > 0 and detailed balance condition → Markov Chain is aperiodic. Convergence Aperiodicity and irreducibility guaranty that the chain converges to the stationary distribution π. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 52. Restricted Boltzmann Machines RBM RBM • p(v, h) = 1 Z e−E(v,h) with E(v, h) = −Σn i=1Σm j=1wij hi vj − Σm j=1bj vj − Σn i=1ci hi . • ci and bj are real-valued biased terms for hidden and visible units. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 13 / 20
  • 53. Restricted Boltzmann Machines RBM • Hidden variables are independent of each other given visibile units: p(h|v) = n i=1 p(hi |v) and p(v|h) = m j=1 p(vj |h) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
  • 54. Restricted Boltzmann Machines RBM • Hidden variables are independent of each other given visibile units: p(h|v) = n i=1 p(hi |v) and p(v|h) = m j=1 p(vj |h) • Joint probability distribution of observations: p(v) = 1 Z m j=1 ebj vj n i=1 (1 + eci +Σm j=1wij vj ) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
  • 55. Restricted Boltzmann Machines RBM • Hidden variables are independent of each other given visibile units: p(h|v) = n i=1 p(hi |v) and p(v|h) = m j=1 p(vj |h) • Joint probability distribution of observations: p(v) = 1 Z m j=1 ebj vj n i=1 (1 + eci +Σm j=1wij vj ) • Conditionaly probability distribution of components: p(Hi = 1|v) = σ(Σm j=1wij + ci ) p(Vj = 1|h) = σ(Σn i=1wij + bj ) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
  • 56. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 57. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 58. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij = Σhi Σh−i Σh n k=1 p(hk |v) p(h|v) hi vj Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 59. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij = Σhi Σh−i Σh n k=1 p(hk |v) p(h|v) hi vj = p(Hi =1|v) σ(Σm j=1wij + ci ) vj Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 60. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij = Σhi Σh−i Σh n k=1 p(hk |v) p(h|v) hi vj = p(Hi =1|v) σ(Σm j=1wij + ci ) vj • We can do the same thing for the second part; Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 61. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij = Σhi Σh−i Σh n k=1 p(hk |v) p(h|v) hi vj = p(Hi =1|v) σ(Σm j=1wij + ci ) vj • We can do the same thing for the second part;writing it as Σvp(v)Σhp(h|v)∂E(v,h) ∂θ Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 62. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij = Σhi Σh−i Σh n k=1 p(hk |v) p(h|v) hi vj = p(Hi =1|v) σ(Σm j=1wij + ci ) vj • We can do the same thing for the second part;writing it as Σvp(v)Σhp(h|v)∂E(v,h) ∂θ or Σhp(h)Σvp(v|h)∂E(v,h) ∂θ Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 63. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij = Σhi Σh−i Σh n k=1 p(hk |v) p(h|v) hi vj = p(Hi =1|v) σ(Σm j=1wij + ci ) vj • We can do the same thing for the second part;writing it as Σvp(v)Σhp(h|v)∂E(v,h) ∂θ or Σhp(h)Σvp(v|h)∂E(v,h) ∂θ • It is also intractable(in terms of smallest layer. i.e 2m or 2n ) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 64. Restricted Boltzmann Machines Gradient of the Log-Liklihood Computing the derivative of log-liklihood • w.r.t wij , wij l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
  • 65. Restricted Boltzmann Machines Gradient of the Log-Liklihood Computing the derivative of log-liklihood • w.r.t wij , wij l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj • w.r.t bj , bj l(θ|v) = vj − Σvp(v)vj Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
  • 66. Restricted Boltzmann Machines Gradient of the Log-Liklihood Computing the derivative of log-liklihood • w.r.t wij , wij l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj • w.r.t bj , bj l(θ|v) = vj − Σvp(v)vj • w.r.t ci , ci l(θ|v) = p(Hi = 1|v) − Σvp(v)p(Hi = 1|v) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
  • 67. Restricted Boltzmann Machines Gradient of the Log-Liklihood Computing the derivative of log-liklihood • w.r.t wij , wij l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj • w.r.t bj , bj l(θ|v) = vj − Σvp(v)vj • w.r.t ci , ci l(θ|v) = p(Hi = 1|v) − Σvp(v)p(Hi = 1|v) • to avoid summation over all possible values of v, we can approximate the expectation. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
  • 68. Restricted Boltzmann Machines Approximating the RBM log-liklihood Contrastive Divergence • Using a Gibbs chain run for only k steps (and usually k = 1). Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
  • 69. Restricted Boltzmann Machines Approximating the RBM log-liklihood Contrastive Divergence • Using a Gibbs chain run for only k steps (and usually k = 1). • Starting from a training sample v0 , yeilds sample vk after k-step. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
  • 70. Restricted Boltzmann Machines Approximating the RBM log-liklihood Contrastive Divergence • Using a Gibbs chain run for only k steps (and usually k = 1). • Starting from a training sample v0 , yeilds sample vk after k-step. • Each step t consists of sampling ht from p(h|vt ) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
  • 71. Restricted Boltzmann Machines Approximating the RBM log-liklihood Contrastive Divergence • Using a Gibbs chain run for only k steps (and usually k = 1). • Starting from a training sample v0 , yeilds sample vk after k-step. • Each step t consists of sampling ht from p(h|vt ) and sampling vt+1 from p(v|ht ). Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
  • 72. Restricted Boltzmann Machines Approximating the RBM log-liklihood Contrastive Divergence • Using a Gibbs chain run for only k steps (and usually k = 1). • Starting from a training sample v0 , yeilds sample vk after k-step. • Each step t consists of sampling ht from p(h|vt ) and sampling vt+1 from p(v|ht ). • Then using this samples the gradient approximation is given by, CDk(h, v0 ) = −Σhp(h|v0 ) ∂E(v0 , h) ∂θ + Σhp(h|vk ) ∂E(vk , h) ∂θ Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
  • 73. Restricted Boltzmann Machines Contrastive Divergence k-CD for Batch Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 18 / 20
  • 74. Restricted Boltzmann Machines Other derivatives • Persistent CD(PCD relys on the previous chain update of each parameter(vk of previous step is for the initialization of the next step). Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
  • 75. Restricted Boltzmann Machines Other derivatives • Persistent CD(PCD relys on the previous chain update of each parameter(vk of previous step is for the initialization of the next step). • Fast PCD introduce a set of parameters just for sampling and not for the model to increaes the speed. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
  • 76. Restricted Boltzmann Machines Other derivatives • Persistent CD(PCD relys on the previous chain update of each parameter(vk of previous step is for the initialization of the next step). • Fast PCD introduce a set of parameters just for sampling and not for the model to increaes the speed. • Parallel Tempering Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
  • 77. Restricted Boltzmann Machines Other derivatives • Persistent CD(PCD relys on the previous chain update of each parameter(vk of previous step is for the initialization of the next step). • Fast PCD introduce a set of parameters just for sampling and not for the model to increaes the speed. • Parallel Tempering we run k (usually k = 1) Gibbs sampling steps. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
  • 78. Restricted Boltzmann Machines Other derivatives • Persistent CD(PCD relys on the previous chain update of each parameter(vk of previous step is for the initialization of the next step). • Fast PCD introduce a set of parameters just for sampling and not for the model to increaes the speed. • Parallel Tempering we run k (usually k = 1) Gibbs sampling steps.In each tempered Markov chain yielding samples (v1, h1), ..., (vM, hM), choose two consecutive temprature and exchange particles (vr , hr ) and (vr−1, hr−1) with prob. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
  • 79. Restricted Boltzmann Machines Other derivatives Results left: hidden sampling, right: visible sampling Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 20 / 20