A Note on Over-replicated Softmax Model

Derivation of equations for over-replicated softmax model
Tomonari MASADA @ Nagasaki University
May 31, 2013
1 Joint probability distribution
• We define constants as follows:
– D : the number of documents
– N : the length of the document
– W : the dictionary size, i.e., the number of different words
– J : the number of hidden units in the first hidden layer
– M : the number of hidden units in the second hidden layer
• Let V denote the set of visible binary units with vnw = 1 if the wth word appears as the nth token.
• Let h(1)
denote the set of hidden binary units in the first hidden layer.
• Let H(2)
denote the set of hidden binary units in the second hidden layer. This is an M ×W matrix
with h
(2)
mw = 1 if the mth hidden softmax unit takes on the wth value.
The energy of the joint configuration {V , h(1)
, H(2)
} is defined as:
E(V , h(1)
, H(2)
; θ) = −
N
n=1
J
j=1
W
w=1
W
(1)
njwh
(1)
j vnw −
M
m=1
J
j=1
W
w=1
W
(2)
mjwh
(1)
j h(2)
mw
−
N
n=1
W
w=1
vnwb(1)
nw − (M + N)
J
j=1
h
(1)
j aj −
M
m=1
W
w=1
h(2)
mwb(2)
mw (1)
where θ = {W (1)
, W (2)
, a, b(1)
, b(2)
} are the model parameters.
We ignore the order of the word tokens by letting W
(1)
njw be the same value for all n. In a similar
manner, we let W
(2)
mjw be the same value for all m. Further, we tie the first and second layer weights.
Consequently, we have W
(1)
njw = W
(2)
mjw = Wjw and b
(1)
nw = b
(2)
mk = bw, and the energy is simplified to:
E(V , h(1)
, H(2)
; θ) = −
N
n=1
J
j=1
W
w=1
Wjwh
(1)
j vnw −
M
m=1
J
j=1
W
w=1
Wjwh
(1)
j h(2)
mw
−
N
n=1
W
w=1
vnwbw − (M + N)
J
j=1
h
(1)
j aj −
M
m=1
W
w=1
h(2)
mwbw
= −
J
j=1
W
w=1
Wjwh
(1)
j (ˆvw + ˆh(2)
w ) −
W
w=1
(ˆvw + ˆh(2)
w )bw − (M + N)
J
j=1
h
(1)
j aj , (2)
where ˆvw = n vnw and ˆh
(2)
w = m h
(2)
mw.
The joint probability distribution is defined as:
p(V , h(1)
, H(2)
; θ) =
exp − E(V , h(1)
, H(2)
; θ
Z(θ, N)
. (3)
where Z(θ, N) = V h(1) H(2) exp V , h(1)
, H(2)
; θ .
1

2 Conditional distributions over hidden and visible units
The conditional distribution over a visible unit is
p(vn|V n
, h(1)
, H(2)
; θ) =
p(V , h(1)
, H(2)
; θ)
vn∈{e1,...,eW } p(V , h(1)
, H(2)
; θ)
=
exp − E(V , h(1)
, H(2)
; θ)
vn
exp − E(V , h(1)
, H(2)
; θ)
=
j w exp(Wjwh
(1)
j ˆvw) exp(Wjwh
(1)
j
ˆh
(2)
w ) · w exp(ˆvwbw) exp(ˆh
(2)
w bw) · j exp(h
(1)
j aj)M+N
vn j w exp(Wjwh
(1)
j ˆvw) exp(Wjwh
(1)
j
ˆh
(2)
w ) · w exp(ˆvwbw) exp(ˆh
(2)
w bw) · j exp(h
(1)
j aj)M+N
=
j w exp(Wjwh
(1)
j vnw) · w exp(vnwbw)
vn j w exp(Wjwh
(1)
j vnw) · w exp(vnwbw)
(4)
This results shows p(vn|V n
, h(1)
, H(2)
; θ) = p(vn|h(1)
; θ). For vnw = 1, we obtain
p(vnw = 1|h(1)
; θ) =
j exp(Wjwh
(1)
j ) · exp(bw)
w j exp(Wjwh
(1)
j ) · exp(bw)
(5)
The conditional distribution over a hidden unit of the ﬁrst hidden layer is
p(h
(1)
j |V , h(1)j
, H(2)
; θ) =
j w exp(Wjwh
(1)
j ˆvw) exp(Wjwh
(1)
j
ˆh
(2)
w ) · j exp(h
(1)
j aj)M+N
h
(1)
j ∈{0,1} j w exp(Wjwh
(1)
j ˆvw) exp(Wjwh
(1)
j
ˆh
(2)
w ) · j exp(h
(1)
j aj)M+N
(6)
This result shows p(h
(1)
j |V , h(1)j
, H(2)
; θ) = p(h
(1)
j |V , H(2)
; θ). For h
(1)
j = 1, we obtain
p(h
(1)
j = 1|V , H(2)
; θ) =
j w exp(Wjw ˆvw) exp(Wjw
ˆh
(2)
w ) · j exp(aj)M+N
1 + j w exp(Wjw ˆvw) exp(Wjw
ˆh
(2)
w ) · j exp(aj)M+N
= σ
j w
Wjw(ˆvw + ˆh(2)
w ) + (M + N)
j
aj . (7)
The conditional distribution over a hidden unit of the second hidden layer is
p(h(2)
m |V , h(1)
, H(2)m
; θ) =
j w exp(Wjwh
(1)
j
ˆh
(2)
w ) · w exp(ˆh
(2)
w bw)
h
(2)
m j w exp(Wjwh
(1)
j
ˆh
(2)
w ) · w exp(ˆh
(2)
w bw)
(8)
This result shows p(h(2)
m |V , h(1)
, H(2)m
; θ) = p(h(2)
m |h(1)
; θ). For h
(2)
mw = 1, we obtain
p(h(2)
mw = 1|h(1)
; θ) =
j exp(Wjwh
(1)
j ) · exp(bw)
w j exp(Wjwh
(1)
j ) · exp(bw)
(9)
The above distributions can be used for sampling V , h(1)
, H(2)
.
When we set h
(2)
mw = n vnw
n w vnw
= ˆvw
N for all m 1
,
p(h
(1)
j = 1|V , H(2)
; θ) = σ
j w
Wjw(ˆvw + MˆvwN) + (M + N)
j
aj
= σ 1 +
M
N j w
Wjw ˆvw + (M + N)
j
aj . (10)
This can be used in pretraining.
1cf. Sec. 2.2 of Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine by Nitish Srivastava,
Ruslan Salakhutdinov, and Geoﬀrey Hinton.
2

3 Derivatives of log-likelihood
When we have D documents V 1, . . . , V D, the log-likelihood can be written as
ln
D
d=1
P(V d; θ)
=
d
ln
h(1)
H(2)
exp − E(V d, h(1)
, H(2)
; θ) − D ln
V h(1)
H(2)
exp − E(V , h(1)
, H(2)
; θ) . (11)
∂ ln
D
d=1 P(V d; θ)
∂Wjw
=
d
∂
∂Wjw
ln
h(1)
H(2)
exp − E(V d, h(1)
, H(2)
; θ)
− D
∂
∂Wjw
ln
V h(1)
H(2)
exp − E(V , h(1)
, H(2)
; θ) .
=
d
∂
∂Wjw h(1)
H(2) exp − E(V d, h(1)
, H(2)
; θ)
h(1)
H(2) exp − E(V d, h(1)
, H(2)
; θ)
− D
∂
∂Wjw V h(1)
H(2) exp − E(V , h(1)
, H(2)
; θ)
V h(1)
H(2) exp − E(V , h(1)
, H(2)
; θ)
(12)
∂
∂Wjw
h(1)
H(2)
exp − E(V d, h(1)
, H(2)
; θ)
=
∂
∂Wjw
h(1)
H(2) j w
exp(Wjwh
(1)
j ˆvdw) exp(Wjwh
(1)
j
ˆh(2)
w )
·
w
exp(ˆvdwbw) exp(ˆh(2)
w bw) ·
j
exp(h
(1)
j aj)M+N
=
h(1)
H(2)
(h
(1)
j ˆvdw + h
(1)
j
ˆh(2)
w )
j w
exp(Wjwh
(1)
j ˆvdw) exp(Wjwh
(1)
j
ˆh(2)
w )
·
w
exp(ˆvdwbw) exp(ˆh(2)
w bw) ·
j
exp(h
(1)
j aj)M+N
=
h(1)
H(2)
(h
(1)
j ˆvdw + h
(1)
j
ˆh(2)
w ) exp − E(V d, h(1)
, H(2)
; θ) (13)
In a similar manner, we obtain
∂
∂Wjw
V h(1)
H(2)
exp − E(V , h(1)
, H(2)
; θ)
=
V h(1)
H(2)
(h
(1)
j ˆvdw + h
(1)
j
ˆh(2)
w ) exp − E(V , h(1)
, H(2)
; θ) . (14)
3

Therefore,
∂ ln
D
d=1 P(V d; θ)
∂Wjw
=
d
h(1)
H(2) (h
(1)
j ˆvdw + h
(1)
j
ˆh
(2)
w ) exp − E(V d, h(1)
, H(2)
; θ)
h(1)
H(2) exp − E(V d, h(1)
, H(2)
; θ)
− D
V h(1)
H(2) (h
(1)
j ˆvdw + h
(1)
j
ˆh
(2)
w ) exp − E(V , h(1)
, H(2)
; θ)
V h(1)
H(2) exp − E(V , h(1)
, H(2)
; θ)
=
d
h(1)
H(2) (h
(1)
j ˆvdw + h
(1)
j
ˆh
(2)
w )
exp −E(V d,h(1)
,H(2)
;θ)
V h(1) H(2) exp −E(V ,h(1),H(2)
;θ)
h(1) H(2) exp −E(V d,h(1),H(2)
;θ)
V h(1) H(2) exp −E(V ,h(1),H(2)
;θ)
− D
V h(1)
H(2)
(h
(1)
j ˆvdw + h
(1)
j
ˆh(2)
w )
exp − E(V , h(1)
, H(2)
; θ)
V h(1)
H(2) exp − E(V , h(1)
, H(2)
; θ)
=
d h(1)
H(2)
(h
(1)
j ˆvdw + h
(1)
j
ˆh(2)
w )p(h(1)
, H(2)
|V d; θ)
− D
V h(1)
H(2)
(h
(1)
j ˆvdw + h
(1)
j
ˆh(2)
w )p(V , h(1)
, H(2)
; θ) , (15)
where the second term on the right hand side can be approximated by Gibbs sampling (cf. Eqs. (7) and
(9)), and the ﬁrst term by the variational inference described in the next section.
4 Variational inference
For a particular document V , we have the following based on Jensen’s inequality:
ln p(V ; θ) = ln
h(1)
H(2)
p(V , h(1)
, H(2)
; θ) = ln
h(1)
H(2)
q(h(1)
, H(2)
|V ) ·
p(V , h(1)
, H(2)
; θ)
q(h(1)
, H(2)
|V )
≥
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln
p(V , h(1)
, H(2)
; θ)
q(h(1)
, H(2)
|V )
=
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln p(V , h(1)
, H(2)
; θ) −
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln q(h(1)
, H(2)
|V )
=
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln p(h(1)
, H(2)
|V ; θ)p(V ; θ) −
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln q(h(1)
, H(2)
|V )
= ln p(V ; θ) +
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln p(h(1)
, H(2)
|V ; θ) −
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln q(h(1)
, H(2)
|V )
(16)
We denote the lower bound in Eq. (16) as L. We assume
q(h(1)
, H(2)
|V d) = q(h(1)
|V )q(H(2)
|V d) =
j
qd(h
(1)
j ) ·
m w
qd(h(2)
mw) (17)
with qd(h
(1)
j = 1) = µdj and qd(h
(2)
mw = 1) = νdw for the dth document. Note that w νdw = 1 holds,
because w h
(2)
mw = 1. We omit the subscript d from now on.
The ﬁrst term of the lower bound in Eq. (16), i.e., ln p(V ; θ), can be regarded as a constant. The
4

second term of the lower bound in Eq. (16) can be rewritten as follows:
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln p(h(1)
, H(2)
|V ; θ)
=
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln
exp − E(V , h(1)
, H(2)
; θ)
h(1)
H(2) exp − E(V , h(1)
, H(2)
; θ)
= −
h(1)
H(2)
q(h(1)
, H(2)
|V )E(V , h(1)
, H(2)
; θ
−
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln
h(1)
H(2)
exp − E(V , h(1)
, H(2)
; θ)
= −
h(1)
H(2)
q(h(1)
, H(2)
|V )E(V , h(1)
, H(2)
; θ − ln
h(1)
H(2)
exp − E(V , h(1)
, H(2)
; θ) (18)
The second term of the right hand side of Eq. (18) is a constant with respect to the hidden units.
The ﬁrst term can be rewritten as follows:
−
h(1)
H(2)
q(h(1)
, H(2)
|V )E(V , h(1)
, H(2)
; θ
=
h(1)
H(2) j
qd(h
(1)
j ) ·
m w
qd(h(2)
mw)
J
j=1
W
w=1
Wjwh
(1)
j (ˆvw + ˆh(2)
w ) +
W
w=1
ˆh(2)
w bw + (M + N)
J
j=1
h
(1)
j aj
=
h(1)
H(2) j
qd(h
(1)
j ) ·
m w
qd(h(2)
mw)
J
j=1
W
w=1
Wjwh
(1)
j
ˆh(2)
w
+
h(1) j
qd(h
(1)
j )
J
j=1
W
w=1
Wjwh
(1)
j ˆvw + (M + N)
J
j=1
h
(1)
j aj +
H(2) m w
qd(h(2)
mw)
W
w=1
ˆh(2)
w bw
=
j w
MµjνwWjw +
j
µj
w
Wjw ˆvw + (M + N)aj + M
w
νwbw (19)
Therefore,
∂
∂µj
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln p(h(1)
, H(2)
|V ; θ) =
w
MνwWjw +
w
Wjw ˆvw + (M + N)aj
∂
∂νw
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln p(h(1)
, H(2)
|V ; θ) =
j
MµjWjw + Mbw (20)
The third term of the lower bound in Eq. (16) is
−
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln q(h(1)
, H(2)
|V )
= −
j
µj ln µj + (1 − µj) ln(1 − µj) − M
w
νw ln νw . (21)
Therefore,
−
∂
∂µj
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln q(h(1)
, H(2)
|V ) = − ln µj + ln(1 − µj)
−
∂
∂νw
h(1)
H(2)
q(h(1)
, H(2)
|V ) ln q(h(1)
, H(2)
|V ) = −M ln νw − M (22)
Consequently,
∂L
∂µj
=
w
MνwWjw +
w
Wjw ˆvw + (M + N)aj − ln µj + ln(1 − µj)
∂L
∂νw
=
j
MµjWjw + Mbw − M ln νw − M (23)
5

By solving ∂L
∂µj
= 0, we obtain the following:
ln µj − ln(1 − µj) =
w
MνwWjw +
w
µj
1 − µj
= exp
w
MνwWjw +
w
∴ µj = σ
w
MνwWjw +
w
Wjw ˆvw + (M + N)aj . (24)
Be solving ∂L
∂νw
= 0, we obtain the following:
M ln νw + M =
j
MµjWjw + Mbw
νw ∝ exp
j
µjWjw + bw
νw =
exp j µjWjw + bw
w exp j µjWjw + bw
(25)
We can use Eqs. (24) and (25) for updating the variational posterior parameters µ and ν.
5 Learning procedure
Please refer to the following paper:
Nitish Srivastava, Ruslan Salakhutdinov, and Geoﬀrey Hinton. Fast Inference and Learning for Modeling
Documents with a Deep Boltzmann Machine.
6

A Note on Over-replicated Softmax Model

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to A Note on Over-replicated Softmax Model

Similar to A Note on Over-replicated Softmax Model (20)

More from Tomonari Masada

More from Tomonari Masada (20)

Recently uploaded

Recently uploaded (20)

A Note on Over-replicated Softmax Model