SlideShare a Scribd company logo
1 of 6
Download to read offline
Derivation of equations for over-replicated softmax model
Tomonari MASADA @ Nagasaki University
May 31, 2013
1 Joint probability distribution
• We define constants as follows:
– D : the number of documents
– N : the length of the document
– W : the dictionary size, i.e., the number of different words
– J : the number of hidden units in the first hidden layer
– M : the number of hidden units in the second hidden layer
• Let V denote the set of visible binary units with vnw = 1 if the wth word appears as the nth token.
• Let h(1)
denote the set of hidden binary units in the first hidden layer.
• Let H(2)
denote the set of hidden binary units in the second hidden layer. This is an M ×W matrix
with h
(2)
mw = 1 if the mth hidden softmax unit takes on the wth value.
The energy of the joint configuration {V , h(1)
, H(2)
} is defined as:
E(V , h(1)
, H(2)
; θ) = −
N
n=1
J
j=1
W
w=1
W
(1)
njwh
(1)
j vnw −
M
m=1
J
j=1
W
w=1
W
(2)
mjwh
(1)
j h(2)
mw
−
N
n=1
W
w=1
vnwb(1)
nw − (M + N)
J
j=1
h
(1)
j aj −
M
m=1
W
w=1
h(2)
mwb(2)
mw (1)
where θ = {W (1)
, W (2)
, a, b(1)
, b(2)
} are the model parameters.
We ignore the order of the word tokens by letting W
(1)
njw be the same value for all n. In a similar
manner, we let W
(2)
mjw be the same value for all m. Further, we tie the first and second layer weights.
Consequently, we have W
(1)
njw = W
(2)
mjw = Wjw and b
(1)
nw = b
(2)
mk = bw, and the energy is simplified to:
E(V , h(1)
, H(2)
; θ) = −
N
n=1
J
j=1
W
w=1
Wjwh
(1)
j vnw −
M
m=1
J
j=1
W
w=1
Wjwh
(1)
j h(2)
mw
−
N
n=1
W
w=1
vnwbw − (M + N)
J
j=1
h
(1)
j aj −
M
m=1
W
w=1
h(2)
mwbw
= −
J
j=1
W
w=1
Wjwh
(1)
j (ˆvw + ˆh(2)
w ) −
W
w=1
(ˆvw + ˆh(2)
w )bw − (M + N)
J
j=1
h
(1)
j aj , (2)
where ˆvw = n vnw and ˆh
(2)
w = m h
(2)
mw.
The joint probability distribution is defined as:
p(V , h(1)
, H(2)
; θ) =
exp − E(V , h(1)
, H(2)
; θ
Z(θ, N)
. (3)
where Z(θ, N) = V h(1) H(2) exp V , h(1)
, H(2)
; θ .
1
2 Conditional distributions over hidden and visible units
The conditional distribution over a visible unit is
p(vn|V n
, h(1)
, H(2)
; θ) =
p(V , h(1)
, H(2)
; θ)
vn∈{e1,...,eW } p(V , h(1)
, H(2)
; θ)
=
exp − E(V , h(1)
, H(2)
; θ)
vn
exp − E(V , h(1)
, H(2)
; θ)
=
j w exp(Wjwh
(1)
j ˆvw) exp(Wjwh
(1)
j
ˆh
(2)
w ) · w exp(ˆvwbw) exp(ˆh
(2)
w bw) · j exp(h
(1)
j aj)M+N
vn j w exp(Wjwh
(1)
j ˆvw) exp(Wjwh
(1)
j
ˆh
(2)
w ) · w exp(ˆvwbw) exp(ˆh
(2)
w bw) · j exp(h
(1)
j aj)M+N
=
j w exp(Wjwh
(1)
j vnw) · w exp(vnwbw)
vn j w exp(Wjwh
(1)
j vnw) · w exp(vnwbw)
(4)
This results shows p(vn|V n
, h(1)
, H(2)
; θ) = p(vn|h(1)
; θ). For vnw = 1, we obtain
p(vnw = 1|h(1)
; θ) =
j exp(Wjwh
(1)
j ) · exp(bw)
w j exp(Wjwh
(1)
j ) · exp(bw)
(5)
The conditional distribution over a hidden unit of the first hidden layer is
p(h
(1)
j |V , h(1)j
, H(2)
; θ) =
j w exp(Wjwh
(1)
j ˆvw) exp(Wjwh
(1)
j
ˆh
(2)
w ) · j exp(h
(1)
j aj)M+N
h
(1)
j ∈{0,1} j w exp(Wjwh
(1)
j ˆvw) exp(Wjwh
(1)
j
ˆh
(2)
w ) · j exp(h
(1)
j aj)M+N
(6)
This result shows p(h
(1)
j |V , h(1)j
, H(2)
; θ) = p(h
(1)
j |V , H(2)
; θ). For h
(1)
j = 1, we obtain
p(h
(1)
j = 1|V , H(2)
; θ) =
j w exp(Wjw ˆvw) exp(Wjw
ˆh
(2)
w ) · j exp(aj)M+N
1 + j w exp(Wjw ˆvw) exp(Wjw
ˆh
(2)
w ) · j exp(aj)M+N
= σ
j w
Wjw(ˆvw + ˆh(2)
w ) + (M + N)
j
aj . (7)
The conditional distribution over a hidden unit of the second hidden layer is
p(h(2)
m |V , h(1)
, H(2)m
; θ) =
j w exp(Wjwh
(1)
j
ˆh
(2)
w ) · w exp(ˆh
(2)
w bw)
h
(2)
m j w exp(Wjwh
(1)
j
ˆh
(2)
w ) · w exp(ˆh
(2)
w bw)
(8)
This result shows p(h(2)
m |V , h(1)
, H(2)m
; θ) = p(h(2)
m |h(1)
; θ). For h
(2)
mw = 1, we obtain
p(h(2)
mw = 1|h(1)
; θ) =
j exp(Wjwh
(1)
j ) · exp(bw)
w j exp(Wjwh
(1)
j ) · exp(bw)
(9)
The above distributions can be used for sampling V , h(1)
, H(2)
.
When we set h
(2)
mw = n vnw
n w vnw
= ˆvw
N for all m 1
,
p(h
(1)
j = 1|V , H(2)
; θ) = σ
j w
Wjw(ˆvw + MˆvwN) + (M + N)
j
aj
= σ 1 +
M
N j w
Wjw ˆvw + (M + N)
j
aj . (10)
This can be used in pretraining.
1cf. Sec. 2.2 of Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine by Nitish Srivastava,
Ruslan Salakhutdinov, and Geoffrey Hinton.
2
3 Derivatives of log-likelihood
When we have D documents V 1, . . . , V D, the log-likelihood can be written as
ln
D
d=1
P(V d; θ)
=
d
ln
h(1)
H(2)
exp − E(V d, h(1)
, H(2)
; θ) − D ln
V h(1)
H(2)
exp − E(V , h(1)
, H(2)
; θ) . (11)
∂ ln
D
d=1 P(V d; θ)
∂Wjw
=
d
∂
∂Wjw
ln
h(1)
H(2)
exp − E(V d, h(1)
, H(2)
; θ)
− D
∂
∂Wjw
ln
V h(1)
H(2)
exp − E(V , h(1)
, H(2)
; θ) .
=
d
∂
∂Wjw h(1)
H(2) exp − E(V d, h(1)
, H(2)
; θ)
h(1)
H(2) exp − E(V d, h(1)
, H(2)
; θ)
− D
∂
∂Wjw V h(1)
H(2) exp − E(V , h(1)
, H(2)
; θ)
V h(1)
H(2) exp − E(V , h(1)
, H(2)
; θ)
(12)
∂
∂Wjw
h(1)
H(2)
exp − E(V d, h(1)
, H(2)
; θ)
=
∂
∂Wjw
h(1)
H(2) j w
exp(Wjwh
(1)
j ˆvdw) exp(Wjwh
(1)
j
ˆh(2)
w )
·
w
exp(ˆvdwbw) exp(ˆh(2)
w bw) ·
j
exp(h
(1)
j aj)M+N
=
h(1)
H(2)
(h
(1)
j ˆvdw + h
(1)
j
ˆh(2)
w )
j w
exp(Wjwh
(1)
j ˆvdw) exp(Wjwh
(1)
j
ˆh(2)
w )
·
w
exp(ˆvdwbw) exp(ˆh(2)
w bw) ·
j
exp(h
(1)
j aj)M+N
=
h(1)
H(2)
(h
(1)
j ˆvdw + h
(1)
j
ˆh(2)
w ) exp − E(V d, h(1)
, H(2)
; θ) (13)
In a similar manner, we obtain
∂
∂Wjw
V h(1)
H(2)
exp − E(V , h(1)
, H(2)
; θ)
=
V h(1)
H(2)
(h
(1)
j ˆvdw + h
(1)
j
ˆh(2)
w ) exp − E(V , h(1)
, H(2)
; θ) . (14)
3
Therefore,
∂ ln
D
d=1 P(V d; θ)
∂Wjw
=
d
h(1)
H(2) (h
(1)
j ˆvdw + h
(1)
j
ˆh
(2)
w ) exp − E(V d, h(1)
, H(2)
; θ)
h(1)
H(2) exp − E(V d, h(1)
, H(2)
; θ)
− D
V h(1)
H(2) (h
(1)
j ˆvdw + h
(1)
j
ˆh
(2)
w ) exp − E(V , h(1)
, H(2)
; θ)
V h(1)
H(2) exp − E(V , h(1)
, H(2)
; θ)
=
d
h(1)
H(2) (h
(1)
j ˆvdw + h
(1)
j
ˆh
(2)
w )
exp −E(V d,h(1)
,H(2)
;θ)
V h(1) H(2) exp −E(V ,h(1),H(2)
;θ)
h(1) H(2) exp −E(V d,h(1),H(2)
;θ)
V h(1) H(2) exp −E(V ,h(1),H(2)
;θ)
− D
V h(1)
H(2)
(h
(1)
j ˆvdw + h
(1)
j
ˆh(2)
w )
exp − E(V , h(1)
, H(2)
; θ)
V h(1)
H(2) exp − E(V , h(1)
, H(2)
; θ)
=
d h(1)
H(2)
(h
(1)
j ˆvdw + h
(1)
j
ˆh(2)
w )p(h(1)
, H(2)
|V d; θ)
− D
V h(1)
H(2)
(h
(1)
j ˆvdw + h
(1)
j
ˆh(2)
w )p(V , h(1)
, H(2)
; θ) , (15)
where the second term on the right hand side can be approximated by Gibbs sampling (cf. Eqs. (7) and
(9)), and the first term by the variational inference described in the next section.
4 Variational inference
For a particular document V , we have the following based on Jensen’s inequality:
ln p(V ; θ) = ln
h(1)
H(2)
p(V , h(1)
, H(2)
; θ) = ln
h(1)
H(2)
q(h(1)
, H(2)
|V ) ·
p(V , h(1)
, H(2)
; θ)
q(h(1)
, H(2)
|V )
≥
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln
p(V , h(1)
, H(2)
; θ)
q(h(1)
, H(2)
|V )
=
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln p(V , h(1)
, H(2)
; θ) −
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln q(h(1)
, H(2)
|V )
=
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln p(h(1)
, H(2)
|V ; θ)p(V ; θ) −
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln q(h(1)
, H(2)
|V )
= ln p(V ; θ) +
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln p(h(1)
, H(2)
|V ; θ) −
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln q(h(1)
, H(2)
|V )
(16)
We denote the lower bound in Eq. (16) as L. We assume
q(h(1)
, H(2)
|V d) = q(h(1)
|V )q(H(2)
|V d) =
j
qd(h
(1)
j ) ·
m w
qd(h(2)
mw) (17)
with qd(h
(1)
j = 1) = µdj and qd(h
(2)
mw = 1) = νdw for the dth document. Note that w νdw = 1 holds,
because w h
(2)
mw = 1. We omit the subscript d from now on.
The first term of the lower bound in Eq. (16), i.e., ln p(V ; θ), can be regarded as a constant. The
4
second term of the lower bound in Eq. (16) can be rewritten as follows:
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln p(h(1)
, H(2)
|V ; θ)
=
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln
exp − E(V , h(1)
, H(2)
; θ)
h(1)
H(2) exp − E(V , h(1)
, H(2)
; θ)
= −
h(1)
H(2)
q(h(1)
, H(2)
|V )E(V , h(1)
, H(2)
; θ
−
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln
h(1)
H(2)
exp − E(V , h(1)
, H(2)
; θ)
= −
h(1)
H(2)
q(h(1)
, H(2)
|V )E(V , h(1)
, H(2)
; θ − ln
h(1)
H(2)
exp − E(V , h(1)
, H(2)
; θ) (18)
The second term of the right hand side of Eq. (18) is a constant with respect to the hidden units.
The first term can be rewritten as follows:
−
h(1)
H(2)
q(h(1)
, H(2)
|V )E(V , h(1)
, H(2)
; θ
=
h(1)
H(2) j
qd(h
(1)
j ) ·
m w
qd(h(2)
mw)
J
j=1
W
w=1
Wjwh
(1)
j (ˆvw + ˆh(2)
w ) +
W
w=1
ˆh(2)
w bw + (M + N)
J
j=1
h
(1)
j aj
=
h(1)
H(2) j
qd(h
(1)
j ) ·
m w
qd(h(2)
mw)
J
j=1
W
w=1
Wjwh
(1)
j
ˆh(2)
w
+
h(1) j
qd(h
(1)
j )
J
j=1
W
w=1
Wjwh
(1)
j ˆvw + (M + N)
J
j=1
h
(1)
j aj +
H(2) m w
qd(h(2)
mw)
W
w=1
ˆh(2)
w bw
=
j w
MµjνwWjw +
j
µj
w
Wjw ˆvw + (M + N)aj + M
w
νwbw (19)
Therefore,
∂
∂µj
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln p(h(1)
, H(2)
|V ; θ) =
w
MνwWjw +
w
Wjw ˆvw + (M + N)aj
∂
∂νw
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln p(h(1)
, H(2)
|V ; θ) =
j
MµjWjw + Mbw (20)
The third term of the lower bound in Eq. (16) is
−
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln q(h(1)
, H(2)
|V )
= −
j
µj ln µj + (1 − µj) ln(1 − µj) − M
w
νw ln νw . (21)
Therefore,
−
∂
∂µj
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln q(h(1)
, H(2)
|V ) = − ln µj + ln(1 − µj)
−
∂
∂νw
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln q(h(1)
, H(2)
|V ) = −M ln νw − M (22)
Consequently,
∂L
∂µj
=
w
MνwWjw +
w
Wjw ˆvw + (M + N)aj − ln µj + ln(1 − µj)
∂L
∂νw
=
j
MµjWjw + Mbw − M ln νw − M (23)
5
By solving ∂L
∂µj
= 0, we obtain the following:
ln µj − ln(1 − µj) =
w
MνwWjw +
w
Wjw ˆvw + (M + N)aj
µj
1 − µj
= exp
w
MνwWjw +
w
Wjw ˆvw + (M + N)aj
∴ µj = σ
w
MνwWjw +
w
Wjw ˆvw + (M + N)aj . (24)
Be solving ∂L
∂νw
= 0, we obtain the following:
M ln νw + M =
j
MµjWjw + Mbw
νw ∝ exp
j
µjWjw + bw
νw =
exp j µjWjw + bw
w exp j µjWjw + bw
(25)
We can use Eqs. (24) and (25) for updating the variational posterior parameters µ and ν.
5 Learning procedure
Please refer to the following paper:
Nitish Srivastava, Ruslan Salakhutdinov, and Geoffrey Hinton. Fast Inference and Learning for Modeling
Documents with a Deep Boltzmann Machine.
6

More Related Content

What's hot

On 1 d fractional supersymmetric theory
  On 1 d fractional supersymmetric theory  On 1 d fractional supersymmetric theory
On 1 d fractional supersymmetric theory
Alexander Decker
 
11.0003www.iiste.org call for paper.common fixed point theorem for compatible...
11.0003www.iiste.org call for paper.common fixed point theorem for compatible...11.0003www.iiste.org call for paper.common fixed point theorem for compatible...
11.0003www.iiste.org call for paper.common fixed point theorem for compatible...
Alexander Decker
 
Solving the energy problem of helium final report
Solving the energy problem of helium final reportSolving the energy problem of helium final report
Solving the energy problem of helium final report
JamesMa54
 
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
JamesMa54
 
Fuzzy calculation
Fuzzy calculationFuzzy calculation
Fuzzy calculation
Amir Rafati
 

What's hot (19)

Polyadic systems and multiplace representations
Polyadic systems and multiplace representationsPolyadic systems and multiplace representations
Polyadic systems and multiplace representations
 
Stochastic Gravity in Conformally-flat Spacetimes
Stochastic Gravity in Conformally-flat SpacetimesStochastic Gravity in Conformally-flat Spacetimes
Stochastic Gravity in Conformally-flat Spacetimes
 
Various relations on new information
Various relations on new informationVarious relations on new information
Various relations on new information
 
International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI)
 
On 1 d fractional supersymmetric theory
  On 1 d fractional supersymmetric theory  On 1 d fractional supersymmetric theory
On 1 d fractional supersymmetric theory
 
11.0003www.iiste.org call for paper.common fixed point theorem for compatible...
11.0003www.iiste.org call for paper.common fixed point theorem for compatible...11.0003www.iiste.org call for paper.common fixed point theorem for compatible...
11.0003www.iiste.org call for paper.common fixed point theorem for compatible...
 
3.common fixed point theorem for compatible mapping of type a -21-24
3.common fixed point theorem for compatible mapping of type a -21-243.common fixed point theorem for compatible mapping of type a -21-24
3.common fixed point theorem for compatible mapping of type a -21-24
 
Solovay Kitaev theorem
Solovay Kitaev theoremSolovay Kitaev theorem
Solovay Kitaev theorem
 
2014 04 22 wits presentation oqw
2014 04 22 wits presentation oqw2014 04 22 wits presentation oqw
2014 04 22 wits presentation oqw
 
Interpolating rational bézier spline curves with local shape control
Interpolating rational bézier spline curves with local shape controlInterpolating rational bézier spline curves with local shape control
Interpolating rational bézier spline curves with local shape control
 
Solving the energy problem of helium final report
Solving the energy problem of helium final reportSolving the energy problem of helium final report
Solving the energy problem of helium final report
 
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
 
Fuzzy calculation
Fuzzy calculationFuzzy calculation
Fuzzy calculation
 
Wasserstein gan
Wasserstein ganWasserstein gan
Wasserstein gan
 
Fixed point theorems for four mappings in fuzzy metric space using implicit r...
Fixed point theorems for four mappings in fuzzy metric space using implicit r...Fixed point theorems for four mappings in fuzzy metric space using implicit r...
Fixed point theorems for four mappings in fuzzy metric space using implicit r...
 
AEM Integrating factor to orthogonal trajactories
AEM Integrating factor to orthogonal trajactoriesAEM Integrating factor to orthogonal trajactories
AEM Integrating factor to orthogonal trajactories
 
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VECUnit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
 
Further Generalizations of Enestrom-Kakeya Theorem
Further Generalizations of Enestrom-Kakeya TheoremFurther Generalizations of Enestrom-Kakeya Theorem
Further Generalizations of Enestrom-Kakeya Theorem
 
Fixed point result in probabilistic metric space
Fixed point result in probabilistic metric spaceFixed point result in probabilistic metric space
Fixed point result in probabilistic metric space
 

Similar to A Note on Over-replicated Softmax Model

An Efficient Boundary Integral Method for Stiff Fluid Interface Problems
An Efficient Boundary Integral Method for Stiff Fluid Interface ProblemsAn Efficient Boundary Integral Method for Stiff Fluid Interface Problems
An Efficient Boundary Integral Method for Stiff Fluid Interface Problems
Alex (Oleksiy) Varfolomiyev
 
Numerical solution of spatiotemporal models from ecology
Numerical solution of spatiotemporal models from ecologyNumerical solution of spatiotemporal models from ecology
Numerical solution of spatiotemporal models from ecology
Kyrre Wahl Kongsgård
 
11.vibration characteristics of non homogeneous visco-elastic square plate
11.vibration characteristics of non homogeneous visco-elastic square plate11.vibration characteristics of non homogeneous visco-elastic square plate
11.vibration characteristics of non homogeneous visco-elastic square plate
Alexander Decker
 
adv-2015-16-solution-09
adv-2015-16-solution-09adv-2015-16-solution-09
adv-2015-16-solution-09
志远 姚
 

Similar to A Note on Over-replicated Softmax Model (20)

HOSOYA POLYNOMIAL, WIENER AND HYPERWIENER INDICES OF SOME REGULAR GRAPHS
HOSOYA POLYNOMIAL, WIENER AND HYPERWIENER INDICES OF SOME REGULAR GRAPHSHOSOYA POLYNOMIAL, WIENER AND HYPERWIENER INDICES OF SOME REGULAR GRAPHS
HOSOYA POLYNOMIAL, WIENER AND HYPERWIENER INDICES OF SOME REGULAR GRAPHS
 
Hosoya polynomial, Wiener and Hyper-Wiener indices of some regular graphs
Hosoya polynomial, Wiener and Hyper-Wiener indices of some regular graphsHosoya polynomial, Wiener and Hyper-Wiener indices of some regular graphs
Hosoya polynomial, Wiener and Hyper-Wiener indices of some regular graphs
 
An Efficient Boundary Integral Method for Stiff Fluid Interface Problems
An Efficient Boundary Integral Method for Stiff Fluid Interface ProblemsAn Efficient Boundary Integral Method for Stiff Fluid Interface Problems
An Efficient Boundary Integral Method for Stiff Fluid Interface Problems
 
Numerical solution of spatiotemporal models from ecology
Numerical solution of spatiotemporal models from ecologyNumerical solution of spatiotemporal models from ecology
Numerical solution of spatiotemporal models from ecology
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas Eberle
 
Bellman functions and Lp estimates for paraproducts
Bellman functions and Lp estimates for paraproductsBellman functions and Lp estimates for paraproducts
Bellman functions and Lp estimates for paraproducts
 
Signals and Systems Formula Sheet
Signals and Systems Formula SheetSignals and Systems Formula Sheet
Signals and Systems Formula Sheet
 
On Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular IntegralsOn Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular Integrals
 
Neutral Electronic Excitations: a Many-body approach to the optical absorptio...
Neutral Electronic Excitations: a Many-body approach to the optical absorptio...Neutral Electronic Excitations: a Many-body approach to the optical absorptio...
Neutral Electronic Excitations: a Many-body approach to the optical absorptio...
 
Section2 stochastic
Section2 stochasticSection2 stochastic
Section2 stochastic
 
Manual solucoes ex_extras
Manual solucoes ex_extrasManual solucoes ex_extras
Manual solucoes ex_extras
 
Manual solucoes ex_extras
Manual solucoes ex_extrasManual solucoes ex_extras
Manual solucoes ex_extras
 
Manual solucoes ex_extras
Manual solucoes ex_extrasManual solucoes ex_extras
Manual solucoes ex_extras
 
ch02.pdf
ch02.pdfch02.pdf
ch02.pdf
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
Litvinenko_RWTH_UQ_Seminar_talk.pdf
Litvinenko_RWTH_UQ_Seminar_talk.pdfLitvinenko_RWTH_UQ_Seminar_talk.pdf
Litvinenko_RWTH_UQ_Seminar_talk.pdf
 
Physical Chemistry Assignment Help
Physical Chemistry Assignment HelpPhysical Chemistry Assignment Help
Physical Chemistry Assignment Help
 
Vibration characteristics of non homogeneous visco-elastic square plate
Vibration characteristics of non homogeneous visco-elastic square plateVibration characteristics of non homogeneous visco-elastic square plate
Vibration characteristics of non homogeneous visco-elastic square plate
 
11.vibration characteristics of non homogeneous visco-elastic square plate
11.vibration characteristics of non homogeneous visco-elastic square plate11.vibration characteristics of non homogeneous visco-elastic square plate
11.vibration characteristics of non homogeneous visco-elastic square plate
 
adv-2015-16-solution-09
adv-2015-16-solution-09adv-2015-16-solution-09
adv-2015-16-solution-09
 

More from Tomonari Masada

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
Tomonari Masada
 

More from Tomonari Masada (20)

Learning Latent Space Energy Based Prior Modelの解説
Learning Latent Space Energy Based Prior Modelの解説Learning Latent Space Energy Based Prior Modelの解説
Learning Latent Space Energy Based Prior Modelの解説
 
Denoising Diffusion Probabilistic Modelsの重要な式の解説
Denoising Diffusion Probabilistic Modelsの重要な式の解説Denoising Diffusion Probabilistic Modelsの重要な式の解説
Denoising Diffusion Probabilistic Modelsの重要な式の解説
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
 
A note on the density of Gumbel-softmax
A note on the density of Gumbel-softmaxA note on the density of Gumbel-softmax
A note on the density of Gumbel-softmax
 
トピックモデルの基礎と応用
トピックモデルの基礎と応用トピックモデルの基礎と応用
トピックモデルの基礎と応用
 
Expectation propagation for latent Dirichlet allocation
Expectation propagation for latent Dirichlet allocationExpectation propagation for latent Dirichlet allocation
Expectation propagation for latent Dirichlet allocation
 
Mini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic ModelingMini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic Modeling
 
A note on variational inference for the univariate Gaussian
A note on variational inference for the univariate GaussianA note on variational inference for the univariate Gaussian
A note on variational inference for the univariate Gaussian
 
Document Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior DistributionsDocument Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior Distributions
 
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka CompositionLDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
 
A Note on ZINB-VAE
A Note on ZINB-VAEA Note on ZINB-VAE
A Note on ZINB-VAE
 
A Note on Latent LSTM Allocation
A Note on Latent LSTM AllocationA Note on Latent LSTM Allocation
A Note on Latent LSTM Allocation
 
A Note on TopicRNN
A Note on TopicRNNA Note on TopicRNN
A Note on TopicRNN
 
Topic modeling with Poisson factorization (2)
Topic modeling with Poisson factorization (2)Topic modeling with Poisson factorization (2)
Topic modeling with Poisson factorization (2)
 
Poisson factorization
Poisson factorizationPoisson factorization
Poisson factorization
 
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelA Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
 
Word count in Husserliana Volumes 1 to 28
Word count in Husserliana Volumes 1 to 28Word count in Husserliana Volumes 1 to 28
Word count in Husserliana Volumes 1 to 28
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
 
FDSE2015
FDSE2015FDSE2015
FDSE2015
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

A Note on Over-replicated Softmax Model

  • 1. Derivation of equations for over-replicated softmax model Tomonari MASADA @ Nagasaki University May 31, 2013 1 Joint probability distribution • We define constants as follows: – D : the number of documents – N : the length of the document – W : the dictionary size, i.e., the number of different words – J : the number of hidden units in the first hidden layer – M : the number of hidden units in the second hidden layer • Let V denote the set of visible binary units with vnw = 1 if the wth word appears as the nth token. • Let h(1) denote the set of hidden binary units in the first hidden layer. • Let H(2) denote the set of hidden binary units in the second hidden layer. This is an M ×W matrix with h (2) mw = 1 if the mth hidden softmax unit takes on the wth value. The energy of the joint configuration {V , h(1) , H(2) } is defined as: E(V , h(1) , H(2) ; θ) = − N n=1 J j=1 W w=1 W (1) njwh (1) j vnw − M m=1 J j=1 W w=1 W (2) mjwh (1) j h(2) mw − N n=1 W w=1 vnwb(1) nw − (M + N) J j=1 h (1) j aj − M m=1 W w=1 h(2) mwb(2) mw (1) where θ = {W (1) , W (2) , a, b(1) , b(2) } are the model parameters. We ignore the order of the word tokens by letting W (1) njw be the same value for all n. In a similar manner, we let W (2) mjw be the same value for all m. Further, we tie the first and second layer weights. Consequently, we have W (1) njw = W (2) mjw = Wjw and b (1) nw = b (2) mk = bw, and the energy is simplified to: E(V , h(1) , H(2) ; θ) = − N n=1 J j=1 W w=1 Wjwh (1) j vnw − M m=1 J j=1 W w=1 Wjwh (1) j h(2) mw − N n=1 W w=1 vnwbw − (M + N) J j=1 h (1) j aj − M m=1 W w=1 h(2) mwbw = − J j=1 W w=1 Wjwh (1) j (ˆvw + ˆh(2) w ) − W w=1 (ˆvw + ˆh(2) w )bw − (M + N) J j=1 h (1) j aj , (2) where ˆvw = n vnw and ˆh (2) w = m h (2) mw. The joint probability distribution is defined as: p(V , h(1) , H(2) ; θ) = exp − E(V , h(1) , H(2) ; θ Z(θ, N) . (3) where Z(θ, N) = V h(1) H(2) exp V , h(1) , H(2) ; θ . 1
  • 2. 2 Conditional distributions over hidden and visible units The conditional distribution over a visible unit is p(vn|V n , h(1) , H(2) ; θ) = p(V , h(1) , H(2) ; θ) vn∈{e1,...,eW } p(V , h(1) , H(2) ; θ) = exp − E(V , h(1) , H(2) ; θ) vn exp − E(V , h(1) , H(2) ; θ) = j w exp(Wjwh (1) j ˆvw) exp(Wjwh (1) j ˆh (2) w ) · w exp(ˆvwbw) exp(ˆh (2) w bw) · j exp(h (1) j aj)M+N vn j w exp(Wjwh (1) j ˆvw) exp(Wjwh (1) j ˆh (2) w ) · w exp(ˆvwbw) exp(ˆh (2) w bw) · j exp(h (1) j aj)M+N = j w exp(Wjwh (1) j vnw) · w exp(vnwbw) vn j w exp(Wjwh (1) j vnw) · w exp(vnwbw) (4) This results shows p(vn|V n , h(1) , H(2) ; θ) = p(vn|h(1) ; θ). For vnw = 1, we obtain p(vnw = 1|h(1) ; θ) = j exp(Wjwh (1) j ) · exp(bw) w j exp(Wjwh (1) j ) · exp(bw) (5) The conditional distribution over a hidden unit of the first hidden layer is p(h (1) j |V , h(1)j , H(2) ; θ) = j w exp(Wjwh (1) j ˆvw) exp(Wjwh (1) j ˆh (2) w ) · j exp(h (1) j aj)M+N h (1) j ∈{0,1} j w exp(Wjwh (1) j ˆvw) exp(Wjwh (1) j ˆh (2) w ) · j exp(h (1) j aj)M+N (6) This result shows p(h (1) j |V , h(1)j , H(2) ; θ) = p(h (1) j |V , H(2) ; θ). For h (1) j = 1, we obtain p(h (1) j = 1|V , H(2) ; θ) = j w exp(Wjw ˆvw) exp(Wjw ˆh (2) w ) · j exp(aj)M+N 1 + j w exp(Wjw ˆvw) exp(Wjw ˆh (2) w ) · j exp(aj)M+N = σ j w Wjw(ˆvw + ˆh(2) w ) + (M + N) j aj . (7) The conditional distribution over a hidden unit of the second hidden layer is p(h(2) m |V , h(1) , H(2)m ; θ) = j w exp(Wjwh (1) j ˆh (2) w ) · w exp(ˆh (2) w bw) h (2) m j w exp(Wjwh (1) j ˆh (2) w ) · w exp(ˆh (2) w bw) (8) This result shows p(h(2) m |V , h(1) , H(2)m ; θ) = p(h(2) m |h(1) ; θ). For h (2) mw = 1, we obtain p(h(2) mw = 1|h(1) ; θ) = j exp(Wjwh (1) j ) · exp(bw) w j exp(Wjwh (1) j ) · exp(bw) (9) The above distributions can be used for sampling V , h(1) , H(2) . When we set h (2) mw = n vnw n w vnw = ˆvw N for all m 1 , p(h (1) j = 1|V , H(2) ; θ) = σ j w Wjw(ˆvw + MˆvwN) + (M + N) j aj = σ 1 + M N j w Wjw ˆvw + (M + N) j aj . (10) This can be used in pretraining. 1cf. Sec. 2.2 of Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine by Nitish Srivastava, Ruslan Salakhutdinov, and Geoffrey Hinton. 2
  • 3. 3 Derivatives of log-likelihood When we have D documents V 1, . . . , V D, the log-likelihood can be written as ln D d=1 P(V d; θ) = d ln h(1) H(2) exp − E(V d, h(1) , H(2) ; θ) − D ln V h(1) H(2) exp − E(V , h(1) , H(2) ; θ) . (11) ∂ ln D d=1 P(V d; θ) ∂Wjw = d ∂ ∂Wjw ln h(1) H(2) exp − E(V d, h(1) , H(2) ; θ) − D ∂ ∂Wjw ln V h(1) H(2) exp − E(V , h(1) , H(2) ; θ) . = d ∂ ∂Wjw h(1) H(2) exp − E(V d, h(1) , H(2) ; θ) h(1) H(2) exp − E(V d, h(1) , H(2) ; θ) − D ∂ ∂Wjw V h(1) H(2) exp − E(V , h(1) , H(2) ; θ) V h(1) H(2) exp − E(V , h(1) , H(2) ; θ) (12) ∂ ∂Wjw h(1) H(2) exp − E(V d, h(1) , H(2) ; θ) = ∂ ∂Wjw h(1) H(2) j w exp(Wjwh (1) j ˆvdw) exp(Wjwh (1) j ˆh(2) w ) · w exp(ˆvdwbw) exp(ˆh(2) w bw) · j exp(h (1) j aj)M+N = h(1) H(2) (h (1) j ˆvdw + h (1) j ˆh(2) w ) j w exp(Wjwh (1) j ˆvdw) exp(Wjwh (1) j ˆh(2) w ) · w exp(ˆvdwbw) exp(ˆh(2) w bw) · j exp(h (1) j aj)M+N = h(1) H(2) (h (1) j ˆvdw + h (1) j ˆh(2) w ) exp − E(V d, h(1) , H(2) ; θ) (13) In a similar manner, we obtain ∂ ∂Wjw V h(1) H(2) exp − E(V , h(1) , H(2) ; θ) = V h(1) H(2) (h (1) j ˆvdw + h (1) j ˆh(2) w ) exp − E(V , h(1) , H(2) ; θ) . (14) 3
  • 4. Therefore, ∂ ln D d=1 P(V d; θ) ∂Wjw = d h(1) H(2) (h (1) j ˆvdw + h (1) j ˆh (2) w ) exp − E(V d, h(1) , H(2) ; θ) h(1) H(2) exp − E(V d, h(1) , H(2) ; θ) − D V h(1) H(2) (h (1) j ˆvdw + h (1) j ˆh (2) w ) exp − E(V , h(1) , H(2) ; θ) V h(1) H(2) exp − E(V , h(1) , H(2) ; θ) = d h(1) H(2) (h (1) j ˆvdw + h (1) j ˆh (2) w ) exp −E(V d,h(1) ,H(2) ;θ) V h(1) H(2) exp −E(V ,h(1),H(2) ;θ) h(1) H(2) exp −E(V d,h(1),H(2) ;θ) V h(1) H(2) exp −E(V ,h(1),H(2) ;θ) − D V h(1) H(2) (h (1) j ˆvdw + h (1) j ˆh(2) w ) exp − E(V , h(1) , H(2) ; θ) V h(1) H(2) exp − E(V , h(1) , H(2) ; θ) = d h(1) H(2) (h (1) j ˆvdw + h (1) j ˆh(2) w )p(h(1) , H(2) |V d; θ) − D V h(1) H(2) (h (1) j ˆvdw + h (1) j ˆh(2) w )p(V , h(1) , H(2) ; θ) , (15) where the second term on the right hand side can be approximated by Gibbs sampling (cf. Eqs. (7) and (9)), and the first term by the variational inference described in the next section. 4 Variational inference For a particular document V , we have the following based on Jensen’s inequality: ln p(V ; θ) = ln h(1) H(2) p(V , h(1) , H(2) ; θ) = ln h(1) H(2) q(h(1) , H(2) |V ) · p(V , h(1) , H(2) ; θ) q(h(1) , H(2) |V ) ≥ h(1) H(2) q(h(1) , H(2) |V ) ln p(V , h(1) , H(2) ; θ) q(h(1) , H(2) |V ) = h(1) H(2) q(h(1) , H(2) |V ) ln p(V , h(1) , H(2) ; θ) − h(1) H(2) q(h(1) , H(2) |V ) ln q(h(1) , H(2) |V ) = h(1) H(2) q(h(1) , H(2) |V ) ln p(h(1) , H(2) |V ; θ)p(V ; θ) − h(1) H(2) q(h(1) , H(2) |V ) ln q(h(1) , H(2) |V ) = ln p(V ; θ) + h(1) H(2) q(h(1) , H(2) |V ) ln p(h(1) , H(2) |V ; θ) − h(1) H(2) q(h(1) , H(2) |V ) ln q(h(1) , H(2) |V ) (16) We denote the lower bound in Eq. (16) as L. We assume q(h(1) , H(2) |V d) = q(h(1) |V )q(H(2) |V d) = j qd(h (1) j ) · m w qd(h(2) mw) (17) with qd(h (1) j = 1) = µdj and qd(h (2) mw = 1) = νdw for the dth document. Note that w νdw = 1 holds, because w h (2) mw = 1. We omit the subscript d from now on. The first term of the lower bound in Eq. (16), i.e., ln p(V ; θ), can be regarded as a constant. The 4
  • 5. second term of the lower bound in Eq. (16) can be rewritten as follows: h(1) H(2) q(h(1) , H(2) |V ) ln p(h(1) , H(2) |V ; θ) = h(1) H(2) q(h(1) , H(2) |V ) ln exp − E(V , h(1) , H(2) ; θ) h(1) H(2) exp − E(V , h(1) , H(2) ; θ) = − h(1) H(2) q(h(1) , H(2) |V )E(V , h(1) , H(2) ; θ − h(1) H(2) q(h(1) , H(2) |V ) ln h(1) H(2) exp − E(V , h(1) , H(2) ; θ) = − h(1) H(2) q(h(1) , H(2) |V )E(V , h(1) , H(2) ; θ − ln h(1) H(2) exp − E(V , h(1) , H(2) ; θ) (18) The second term of the right hand side of Eq. (18) is a constant with respect to the hidden units. The first term can be rewritten as follows: − h(1) H(2) q(h(1) , H(2) |V )E(V , h(1) , H(2) ; θ = h(1) H(2) j qd(h (1) j ) · m w qd(h(2) mw) J j=1 W w=1 Wjwh (1) j (ˆvw + ˆh(2) w ) + W w=1 ˆh(2) w bw + (M + N) J j=1 h (1) j aj = h(1) H(2) j qd(h (1) j ) · m w qd(h(2) mw) J j=1 W w=1 Wjwh (1) j ˆh(2) w + h(1) j qd(h (1) j ) J j=1 W w=1 Wjwh (1) j ˆvw + (M + N) J j=1 h (1) j aj + H(2) m w qd(h(2) mw) W w=1 ˆh(2) w bw = j w MµjνwWjw + j µj w Wjw ˆvw + (M + N)aj + M w νwbw (19) Therefore, ∂ ∂µj h(1) H(2) q(h(1) , H(2) |V ) ln p(h(1) , H(2) |V ; θ) = w MνwWjw + w Wjw ˆvw + (M + N)aj ∂ ∂νw h(1) H(2) q(h(1) , H(2) |V ) ln p(h(1) , H(2) |V ; θ) = j MµjWjw + Mbw (20) The third term of the lower bound in Eq. (16) is − h(1) H(2) q(h(1) , H(2) |V ) ln q(h(1) , H(2) |V ) = − j µj ln µj + (1 − µj) ln(1 − µj) − M w νw ln νw . (21) Therefore, − ∂ ∂µj h(1) H(2) q(h(1) , H(2) |V ) ln q(h(1) , H(2) |V ) = − ln µj + ln(1 − µj) − ∂ ∂νw h(1) H(2) q(h(1) , H(2) |V ) ln q(h(1) , H(2) |V ) = −M ln νw − M (22) Consequently, ∂L ∂µj = w MνwWjw + w Wjw ˆvw + (M + N)aj − ln µj + ln(1 − µj) ∂L ∂νw = j MµjWjw + Mbw − M ln νw − M (23) 5
  • 6. By solving ∂L ∂µj = 0, we obtain the following: ln µj − ln(1 − µj) = w MνwWjw + w Wjw ˆvw + (M + N)aj µj 1 − µj = exp w MνwWjw + w Wjw ˆvw + (M + N)aj ∴ µj = σ w MνwWjw + w Wjw ˆvw + (M + N)aj . (24) Be solving ∂L ∂νw = 0, we obtain the following: M ln νw + M = j MµjWjw + Mbw νw ∝ exp j µjWjw + bw νw = exp j µjWjw + bw w exp j µjWjw + bw (25) We can use Eqs. (24) and (25) for updating the variational posterior parameters µ and ν. 5 Learning procedure Please refer to the following paper: Nitish Srivastava, Ruslan Salakhutdinov, and Geoffrey Hinton. Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine. 6