A Note on TopicRNN
Tomonari MASADA @ Nagasaki University
July 13, 2017
1 Model
TopicRNN is a generative model proposed by [1], whose generative story for a particular document x1:T
is given as below.
1. Draw a topic vector θ ∼ N(0, I).
2. Given word y1:t−1, for the tth word yt in the document,
(a) Compute hidden state ht = fW (xt, ht−1), where we let xt yt−1.
(b) Draw stop word indicator lt ∼ Bernoulli(σ(Γ ht)), with σ the sigmoid function.
(c) Draw word yt ∼ p(yt|ht, θ, lt, B), where
p(yt = i|ht, θ, lt, B) ∝ exp(vi ht + (1 − lt)bi θ) .
2 Lower bound
The log marginal likelihood of the word sequence y1:T and the stop word indicators l1:T is
log p(y1:T , l1:T |h1:T ) = log p(θ)
T
t=1
p(yt|ht, lt, θ; W)p(lt|ht; Γ)dθ (1)
A lower bound can be obtained as follows:
log p(y1:T , l1:T |h1:T ) = log p(θ)
T
t=1
p(yt|ht, lt, θ; W)p(lt|ht; Γ)dθ
= log q(θ)
p(θ)
T
t=1 p(yt|ht, lt, θ; W)p(lt|ht; Γ)
q(θ)
dθ
≥ q(θ) log
p(θ)
T
t=1 p(yt|ht, lt, θ; W)p(lt|ht; Γ)
q(θ)
dθ
= q(θ) log p(θ)dθ +
T
t=1
q(θ) log p(yt|ht, lt, θ; W)dθ +
T
t=1
q(θ) log p(lt|ht; Γ)dθ − q(θ) log q(θ)dθ
L(y1:T , l1:T |q(θ), Θ) (2)
3 Approximate posterior
The form of q(θ) is chosen to be an inference network using a feed-forward neural network. Each expec-
tation in Eq. (2) is approximated with the samples from q(θ|Xc), where Xc denotes the term-frequency
representation of y1:T excluding stop words. The density of the approximate posterior q(θ|Xc) is specified
as follows:
q(θ|Xc) = N(θ; µ(Xc), diag(σ2
(Xc))), (3)
µ(Xc) = W1g(Xc) + a1, (4)
log σ(Xc) = W2g(Xc) + a2, (5)
where g(·) denotes the feed-forward neural network. Eq. (3) gives the reparameterization of θk as θk =
µk(Xc) + kσk(Xc) for k = 1, . . . , K, where k is a sample from the standard normal distribution N(0, 1).
1
4 Monte Carlo integration
We can now rewrite each term of the lower bound L(y1:T , l1:T |q(θ), Θ) in Eq. (2) as below, where the θ(s)
s
denote the samples drawn from the approximate posterior q(θ|Xc).
The first term:
q(θ) log p(θ)dθ ≈
1
S
S
s=1
log p(θ(s)
) =
1
S
S
s=1
K
k=1
log
1
√
2π
exp −
θ
(s)
k
2
2
= −
K log(2π)
2
−
1
2
K
k=1
s θ
(s)
k
2
S
(6)
Each addend of the second term:
q(θ) log p(yt|ht, lt, θ; W)dθ ≈
1
S
S
s=1
log
exp(vyt
ht + (1 − lt)byt
θ(s)
)
C
j=1 exp(vj ht + (1 − lt)bj θ(s))
= vyt
ht + (1 − lt)byt
S
s=1 θ(s)
S
−
1
S
S
s=1
log
C
j=1
exp vj ht + (1 − lt)bj θ(s)
(7)
Each addend of the third term:
q(θ) log p(lt|ht; Γ)dθ = lt log(σ(Γ ht)) + (1 − lt) log(1 − σ(Γ ht)) (8)
The fourth term:
q(θ) log q(θ)dθ ≈
1
S
S
s=1
K
k=1
log
1
2πσ2
k(Xc)
exp −
(θ
(s)
k − µk(Xc))2
2σ2
k(Xc)
= −
K log(2π)
2
−
K
k=1
log(σk(Xc)) −
1
S
S
s=1
K
k=1
θ
(s)
k − µk(Xc)
2
2σ2
k(Xc)
(9)
5 Objective to be maximized
Each of the s samples (i.e., θ(s)
for s = 1, . . . , S) is obtained as θ(s)
= µ(Xc)+ (s)
◦σ(Xc) via the reparam-
eterization, where the
(s)
k s are drawn from the standard normal, and ◦ is the element-wise multiplication.
Consequently, the lower bound L(y1:T , l1:T |q(θ), Θ) to be maximized is obtained as follows:
L(y1:T , l1:T |q(θ), Θ) = −
1
2
K
k=1
s µk(Xc) +
(s)
k σk(Xc)
2
S
+
T
t=1
vyt
ht +
1
S
S
s=1
T
t=1
(1 − lt)byt
µ(Xc) + (s)
◦ σ(Xc)
−
T
t=1
1
S
S
s=1
log
C
j=1
exp vj ht + (1 − lt)bj µ(Xc) + (s)
◦ σ(Xc)
+
T
t=1
lt log(σ(Γ ht)) + (1 − lt) log(1 − σ(Γ ht))
+
K
k=1
log(σk(Xc)) + const. (10)
References
[1] Adji Bousso Dieng, Chong Wang, Jianfeng Gao, and John Paisley. TopicRNN: A Recurrent Neural
Network with Long-Range Semantic Dependency. ICLR, 2017.
2

A Note on TopicRNN

  • 1.
    A Note onTopicRNN Tomonari MASADA @ Nagasaki University July 13, 2017 1 Model TopicRNN is a generative model proposed by [1], whose generative story for a particular document x1:T is given as below. 1. Draw a topic vector θ ∼ N(0, I). 2. Given word y1:t−1, for the tth word yt in the document, (a) Compute hidden state ht = fW (xt, ht−1), where we let xt yt−1. (b) Draw stop word indicator lt ∼ Bernoulli(σ(Γ ht)), with σ the sigmoid function. (c) Draw word yt ∼ p(yt|ht, θ, lt, B), where p(yt = i|ht, θ, lt, B) ∝ exp(vi ht + (1 − lt)bi θ) . 2 Lower bound The log marginal likelihood of the word sequence y1:T and the stop word indicators l1:T is log p(y1:T , l1:T |h1:T ) = log p(θ) T t=1 p(yt|ht, lt, θ; W)p(lt|ht; Γ)dθ (1) A lower bound can be obtained as follows: log p(y1:T , l1:T |h1:T ) = log p(θ) T t=1 p(yt|ht, lt, θ; W)p(lt|ht; Γ)dθ = log q(θ) p(θ) T t=1 p(yt|ht, lt, θ; W)p(lt|ht; Γ) q(θ) dθ ≥ q(θ) log p(θ) T t=1 p(yt|ht, lt, θ; W)p(lt|ht; Γ) q(θ) dθ = q(θ) log p(θ)dθ + T t=1 q(θ) log p(yt|ht, lt, θ; W)dθ + T t=1 q(θ) log p(lt|ht; Γ)dθ − q(θ) log q(θ)dθ L(y1:T , l1:T |q(θ), Θ) (2) 3 Approximate posterior The form of q(θ) is chosen to be an inference network using a feed-forward neural network. Each expec- tation in Eq. (2) is approximated with the samples from q(θ|Xc), where Xc denotes the term-frequency representation of y1:T excluding stop words. The density of the approximate posterior q(θ|Xc) is specified as follows: q(θ|Xc) = N(θ; µ(Xc), diag(σ2 (Xc))), (3) µ(Xc) = W1g(Xc) + a1, (4) log σ(Xc) = W2g(Xc) + a2, (5) where g(·) denotes the feed-forward neural network. Eq. (3) gives the reparameterization of θk as θk = µk(Xc) + kσk(Xc) for k = 1, . . . , K, where k is a sample from the standard normal distribution N(0, 1). 1
  • 2.
    4 Monte Carlointegration We can now rewrite each term of the lower bound L(y1:T , l1:T |q(θ), Θ) in Eq. (2) as below, where the θ(s) s denote the samples drawn from the approximate posterior q(θ|Xc). The first term: q(θ) log p(θ)dθ ≈ 1 S S s=1 log p(θ(s) ) = 1 S S s=1 K k=1 log 1 √ 2π exp − θ (s) k 2 2 = − K log(2π) 2 − 1 2 K k=1 s θ (s) k 2 S (6) Each addend of the second term: q(θ) log p(yt|ht, lt, θ; W)dθ ≈ 1 S S s=1 log exp(vyt ht + (1 − lt)byt θ(s) ) C j=1 exp(vj ht + (1 − lt)bj θ(s)) = vyt ht + (1 − lt)byt S s=1 θ(s) S − 1 S S s=1 log C j=1 exp vj ht + (1 − lt)bj θ(s) (7) Each addend of the third term: q(θ) log p(lt|ht; Γ)dθ = lt log(σ(Γ ht)) + (1 − lt) log(1 − σ(Γ ht)) (8) The fourth term: q(θ) log q(θ)dθ ≈ 1 S S s=1 K k=1 log 1 2πσ2 k(Xc) exp − (θ (s) k − µk(Xc))2 2σ2 k(Xc) = − K log(2π) 2 − K k=1 log(σk(Xc)) − 1 S S s=1 K k=1 θ (s) k − µk(Xc) 2 2σ2 k(Xc) (9) 5 Objective to be maximized Each of the s samples (i.e., θ(s) for s = 1, . . . , S) is obtained as θ(s) = µ(Xc)+ (s) ◦σ(Xc) via the reparam- eterization, where the (s) k s are drawn from the standard normal, and ◦ is the element-wise multiplication. Consequently, the lower bound L(y1:T , l1:T |q(θ), Θ) to be maximized is obtained as follows: L(y1:T , l1:T |q(θ), Θ) = − 1 2 K k=1 s µk(Xc) + (s) k σk(Xc) 2 S + T t=1 vyt ht + 1 S S s=1 T t=1 (1 − lt)byt µ(Xc) + (s) ◦ σ(Xc) − T t=1 1 S S s=1 log C j=1 exp vj ht + (1 − lt)bj µ(Xc) + (s) ◦ σ(Xc) + T t=1 lt log(σ(Γ ht)) + (1 − lt) log(1 − σ(Γ ht)) + K k=1 log(σk(Xc)) + const. (10) References [1] Adji Bousso Dieng, Chong Wang, Jianfeng Gao, and John Paisley. TopicRNN: A Recurrent Neural Network with Long-Range Semantic Dependency. ICLR, 2017. 2