Next-generation AAM aircraft unveiled by Supernal, S-A2
A Note on the Derivation of the Variational Inference Updates for DILN
1. A Note on the Derivation of the Variational Inference Updates for
DILN [2]
Tomonari MASADA @ Nagasaki University
August 30, 2013
1
Let M, Nm, T be the number of documents, the number of word tokens appearing in the dth document,
and the truncation level. Xmn denotes the word appearing as the nth token of the mth document, and
Cmn denotes the latent topic for the nth token of the dth document. The definitions of other symbols can
be found in the original paper [2].
The joint distribution can be written as follows:
p(X, Z, C, w, η, V , α, β, m, K)
= p(X|C, η)p(Z|V , w, β)p(C|Z)p(w|m, K)p(η)p(V |α)p(α)p(β)p(m)p(K). (1)
A lower bound of the log evidence can be obtained by using Jensen’s inequality as follows:
ln p(X) = ln
∫ ∑
C
p(X, Z, C, w, η, V , α, β, m, K)dZdwdηdV dαdβdmdK
= ln
∫ ∑
C
q(Z)q(C)q(w)q(η)q(V )q(α)q(β)q(m)q(K)
·
p(X|C, η)p(Z|V , w, β)p(C|Z)p(w|m, K)p(η)p(V |α)p(α)p(β)p(m)p(K)
q(Z)q(C)q(w)q(η)q(V )q(α)q(β)q(m)q(K)
dZdwdηdV dαdβdmdK
≥
∫ ∑
C
q(Z)q(C)q(w)q(η)q(V )q(α)q(β)q(m)q(K)
· ln
p(X|C, η)p(Z|V , w, β)p(C|Z)p(w|m, K)p(η)p(V |α)p(α)p(β)p(m)p(K)
q(Z)q(C)q(w)q(η)q(V )q(α)q(β)q(m)q(K)
dZdwdηdV dαdβdmdK
=
∫ ∑
C
q(C)q(η) ln p(X|C, η)dη +
∫
q(Z)q(V )q(w)q(β) ln p(Z|V , w, β)dZdV dwdβ
+
∫ ∑
C
q(C)q(Z) ln p(C|Z)dZ +
∫
q(w)q(m)q(K) ln p(w|m, K)dwdmdK
+
∫
q(η) ln p(η)dη +
∫
q(V ) ln p(V |α)dV +
∫
q(α) ln p(α)dα
+
∫
q(β) ln p(β)dβ +
∫
q(m) ln p(m)dm +
∫
q(K) ln p(K)dK
−
∫
q(Z) ln q(Z)dZ −
∑
C
q(C) ln q(C) −
∫
q(w) ln q(w)dw
−
∫
q(η) ln q(η)dη −
∫
q(V ) ln q(V )dV −
∫
q(α) ln q(α)dα
−
∫
q(β) ln q(β)dβ −
∫
q(m) ln q(m)dm −
∫
q(K) ln q(K)dK. (2)
2. Since q(V ) = δV , q(m) = δm, q(K) = δK, q(α) = δα, q(β) = δβ, we can rewrite the right hand side
of Eq. (2) as follows:
ln p(X) ≥
∫ ∑
C
q(C)q(η) ln p(X|C, η)dη +
∫
q(Z)q(w) ln p(Z|V , w, β)dZdw
+
∫ ∑
C
q(C)q(Z) ln p(C|Z)dZ +
∫
q(w) ln p(w|m, K)dw +
∫
q(η) ln p(η)dη + ln p(V |α)
+ ln p(α) + ln p(β) + ln p(m) + ln p(K)
−
∫
q(Z) ln q(Z)dZ −
∑
C
q(C) ln q(C) −
∫
q(w) ln q(w)dw −
∫
q(η) ln q(η)dη. (3)
2
We examine each term of the right hand side of Eq. (3).
∫ ∑
C
q(C)q(η) ln p(X|C, η)dη =
M∑
m=1
Nm∑
n=1
T∑
k=1
ϕmnk
∫
Γ(
∑
d γ′
kd)
∏
d Γ(γ′
kd)
D∏
d=1
η
γ′
kd−1
kd ln ηkXmn dηk
=
M∑
m=1
Nm∑
n=1
T∑
k=1
ϕmnk
{
ψ(γ′
kXmn
) − ψ(γ′
k)
}
, (4)
where γ′
k ≡
∑D
d=1 γ′
kd.
∫
q(Z)q(w) ln p(Z|V , w, β)dZdw
=
∑
m
∑
k
∫
q(Zmk)q(wmk) ln
{
(e−wmk
)βpk
Γ(βpk)
Zβpk−1
mk e−e−wmk Zmk
}
dZmkdwmk
= −
∑
k
βpk
∑
m
∫
q(wmk)wmkdwmk −
∑
k
ln Γ(βpk)
+
∑
k
(βpk − 1)
∑
m
∫
q(Zmk) ln ZmkdZmk −
∑
m
∑
k
∫
q(Zmk)q(wmk)e−wmk
ZmkdZmkdwmk, (5)
where
∫
q(wmk)e−wmk
dwmk =
∫
1
√
2πvmk
exp
{
−
(wmk − µmk)2
2vmk
− wmk
}
dwmk
=
∫
1
√
2πvmk
exp
(
−
w2
mk − 2µmkwmk + 2vmkwmk + µ2
mk
2vmk
)
dwmk
=
∫
1
√
2πvmk
exp
{
−
(wmk − µmk + vmk)2
2vmk
− µmk +
vmk
2
}
dwmk = exp
(
− µmk +
vmk
2
)
. (6)
Note that vmk is a variance. Consequently, we have
∫
q(Z)q(w) ln p(Z|V , w, β)dZdw
= −
∑
k
βpk
∑
m
µmk −
∑
k
ln Γ(βpk) +
∑
k
(βpk − 1)
∑
m
{
ψ(amk) − ln bmk
}
−
∑
m
∑
k
amk
bmk
exp
(
− µmk +
vmk
2
)
. (7)
Note that pk ≡ Vk
∏k−1
j=1 (1 − Vj).
3. ∫ ∑
C
q(C)q(Z) ln p(C|Z)dZ =
∑
m
∑
n
∫
q(Zm)
∑
k
ϕmnk ln
Zmk
∑T
j=1 Zmj
dZm
=
∑
m
∑
k
( ∑
n
ϕmnk
) ∫
q(Zmk) ln ZmkdZmk −
∑
m
Nm
∫
q(Zm) ln
( T∑
j=1
Zmj
)
dZm. (8)
Since ln x ≤ x
ξ − 1 + ln ξ for any ξ > 0,
∫
q(Zm) ln
( T∑
j=1
Zmj
)
dZm ≤
∫
q(Zm)
(∑
j Zmk
ξm
− 1 + ln ξm
)
dZm =
1
ξm
∑
k
amk
bmk
− 1 + ln ξm. (9)
Therefore,
∫ ∑
C
q(C)q(Z) ln p(C|Z)dZ
=
∑
m
∑
k
( ∑
n
ϕmnk
){
ψ(amk) − ln bmk
}
−
∑
m
Nm
ξm
∑
k
amk
bmk
+
∑
m
Nm −
∑
m
Nm ln ξm. (10)
∫
q(w) ln p(w|m, K)dw =
∑
m
∫
q(wm) ln p(wm|m, K)dwm
=
∑
m
[
−
D
2
ln 2π −
1
2
ln |K| −
1
2
∫
q(wm)(wm − m)T
K−1
(wm − m)dwm
]
= −
MD ln 2π
2
−
M ln |K|
2
−
1
2
∑
m
{ ∑
k
(µ2
mk + vmk)K−1
k:k − 2
∑
k
mkµmkK−1
k:k +
∑
k
m2
kK−1
k:k
+
∑
k
∑
j̸=k
(µmkµmj − 2µmkmj + mkmj)K−1
k:j
}
= −
MD ln 2π
2
−
M ln |K|
2
−
1
2
∑
m
{ ∑
k
vmkK−1
k:k +
∑
k
∑
j
(µmk − mk)(µmj − mj)K−1
k:j
}
(11)
∫
q(η) ln p(η)dη =
∑
k
∫
Γ(
∑
d γ′
kd)
∏
d Γ(γ′
kd)
D∏
d=1
η
γ′
kd−1
kd
{
ln Γ(Dγ) − DΓ(γ) +
∑
d′
(γ − 1) ln ηkd
}
dηk
= T ln Γ(Dγ) − TDΓ(γ) + (γ − 1)
∑
k
∑
d
{
ψ(γ′
kd) − ψ(γ′
k)
}
(12)
ln p(V |α) = T ln Γ(α + 1) − TΓ(α) + (α − 1)
∑
k
ln(1 − Vk) (13)
∫
q(Z) ln q(Z)dZ = −
∑
m
∑
k
{
ln Γ(amk) − (amk − 1)ψ(amk) − ln bmk + amk
}
(14)
∑
C
q(C) ln q(C) =
∑
m
∑
n
∑
k
ϕmnk ln ϕmnk (15)
∫
q(w) ln q(w)dw = −
MT(1 + ln 2π)
2
−
∑
m
∑
k
ln vmk
2
(16)
∫
q(η) ln q(η)dη =
∑
k
[ ∑
d
(γ′
kd − 1)
{
ψ(γ′
kd) − ψ(γ′
k)
}
+ ln Γ(γ′
k) −
∑
d
ln Γ(γ′
kd)
]
(17)
5. ∂L
∂bmk
= 0 gives
0 = −bmk
{
βVk
k−1∏
j=1
(1 − Vj) +
Nm∑
n=1
ϕmnk
}
+ amk
{
exp
(
− µmk +
vmk
2
)
+
Nm
ξm
}
. (22)
Therefore,
bmk = amk ·
exp
(
− µmk + vmk
2
)
+ Nm
ξm
βVk
∏k−1
j=1 (1 − Vj) +
∑Nm
n=1 ϕmnk
. (23)
∂L
∂amk
=
{
βVk
k−1∏
j=1
(1 − Vj) − 1
}
ψ′
(amk) −
1
bmk
exp
(
− µmk +
vmk
2
)
+
( Nm∑
n=1
ϕmnk
)
ψ′
(amk) −
Nm
ξm
1
bmk
− (amk − 1)ψ′
(amk) + 1
=
{
βVk
k−1∏
j=1
(1 − Vj) +
Nm∑
n=1
ϕmnk − amk
}
ψ′
(amk) −
1
bmk
{
exp
(
− µmk +
vmk
2
)
+
Nm
ξm
}
+ 1 (24)
By using the result for bmk, we obtain
∂L
∂amk
=
{
βVk
k−1∏
j=1
(1 − Vj) +
Nm∑
n=1
ϕmnk − amk
}
ψ′
(amk) −
βVk
∏k−1
j=1 (1 − Vj) +
∑Nm
n=1 ϕmnk
amk
+ 1
=
{
βVk
k−1∏
j=1
(1 − Vj) +
Nm∑
n=1
ϕmnk − amk
}{
ψ′
(amk) −
1
amk
}
∴ amk = βVk
k−1∏
j=1
(1 − Vj) +
Nm∑
n=1
ϕmnk, bmk = exp
(
− µmk +
vmk
2
)
+
Nm
ξm
. (25)
3.3 Update q(wmk)
∂L
∂µmk
=
amk
bmk
exp
(
− µmk +
vmk
2
)
−
{
βVk
k−1∏
j=1
(1 − Vj)
}
−
T∑
j=1
(µmj − mj)K−1
k:j (26)
∂L
∂vmk
=
1
2
{
−
amk
bmk
exp
(
− µmk +
vmk
2
)
− K−1
k:k +
1
vmk
}
(27)
The plus and minus signs on the right hand side of the second line of Eq. (22) in the original paper are
different from those given above. We may use L-BFGS for updating µmk and vmk.
3.4 Update q(ηk)
∂L
∂γ′
kd
=
∑
m
∑
n
I(Xmn = d)ϕmnkψ′
(γ′
kd) −
∑
m
∑
n
ϕmnkψ′
(γ′
k) + (γ − 1)ψ′
(γ′
kd) − (γ − 1)
∑
d
ψ′
(γ′
k)
− ψ(γ′
kd) + ψ(γ′
k) − (γ′
kd − 1)ψ′
(γ′
kd) +
∑
d
(γ′
kd − 1)ψ′
(γ′
k) − ψ(γ′
k) + ψ(γ′
dk)
=
∑
m
∑
n
I(Xmn = d)ϕmnkψ′
(γ′
kd) −
∑
m
∑
n
ϕmnkψ′
(γ′
k) + (γ − γ′
kd)ψ′
(γ′
kd) −
∑
d
(γ − γ′
kd)ψ′
(γ′
k)
= ψ′
(γ′
kd)
{ ∑
m
∑
n
I(Xmn = d)ϕmnk + γ − γ′
kd
}
− ψ′
(γ′
k)
∑
d
{ ∑
m
∑
n
I(Xmn = d)ϕmnk + γ − γ′
kd
}
∴ γ′
kd = γ +
∑
m
∑
n
I(Xmn = d)ϕmnk (28)
6. 3.5 Update q(Vk)
∂L
∂Vk
= −
α − 1
1 − Vk
− β
k−1∏
j=1
(1 − Vj)
M∑
m=1
{
µmk − ψ(amk) + ln bmk
}
−
1
1 − Vk
T∑
ˆk=k+1
{
βVˆk
ˆk−1∏
j=1
(1 − Vj)
} M∑
m=1
{
µmˆk − ψ(amˆk) + ln bmˆk
}
− β
k−1∏
j=1
(1 − Vj)ψ
(
βVk
k−1∏
j=1
(1 − Vj)
)
−
T∑
ˆk=k+1
1
1 − Vk
βVˆk
ˆk−1∏
j=1
(1 − Vj)ψ
(
βVˆk
ˆk−1∏
j=1
(1 − Vj)
)
= −
α − 1
1 − Vk
− β
k−1∏
j=1
(1 − Vj)
M∑
m=1
{
µmk − ψ(amk) + ln bmk
}
− β
k−1∏
j=1
(1 − Vj)
T∑
ˆk=k+1
{
Vˆk
ˆk−1∏
j=k+1
(1 − Vj)
} M∑
m=1
{
µmˆk − ψ(amˆk) + ln bmˆk
}
− β
k−1∏
j=1
(1 − Vj)ψ
(
βVk
k−1∏
j=1
(1 − Vj)
)
− β
k−1∏
j=1
(1 − Vj)
T∑
ˆk=k+1
{
Vˆk
ˆk−1∏
j=k+1
(1 − Vj)
}
ψ
(
βVˆk
ˆk−1∏
j=1
(1 − Vj)
)
= −
α − 1
1 − Vk
− β
k−1∏
j=1
(1 − Vj)
[ M∑
m=1
{
µmk − ψ(amk) + ln bmk
}
+ ψ
(
βVk
k−1∏
j=1
(1 − Vj)
)]
− β
k−1∏
j=1
(1 − Vj)
T∑
ˆk=k+1
{
Vˆk
ˆk−1∏
j=k+1
(1 − Vj)
}[ M∑
m=1
{
µmˆk − ψ(amˆk) + ln bmˆk
}
+ ψ
(
βVˆk
ˆk−1∏
j=1
(1 − Vj)
)]
= −
α − 1
1 − Vk
−
pk
Vk
[ M∑
m=1
{
µmk − ψ(amk) + ln bmk
}
+ ψ(βpk)
]
−
T∑
j=k+1
pj
1 − Vk
[ M∑
m=1
{
µmj − ψ(amj) + ln bmj
}
+ ψ(βpj)
]
(29)
I think that Vk on the second line of Eq. (24) in the original paper is not required.
3.6 Update q(K)
With respect to K, we maximize the following function:
L(K) = −
M
2
ln |K| −
1
2
M∑
m=1
T∑
k=1
vmkK−1
k:k −
1
2
M∑
m=1
(µm − m)T
K−1
(µm − m), (30)
where the last term is equal to 1
2
∑M
m=1
∑T
k=1
∑T
j=1(µmk − mk)(µmj − mj)K−1
k:j.
The derivative of the first term of the right hand side in Eq. (30) is obtained based on the following
identity (Cf. Eq. (51) of The Matrix Cookbook1
):
∂ ln |K|
∂K
= K−1
. (31)
For the second term of the right hand side in Eq. (30), it holds that
∑
k vmkK−1
k:k = Tr[K−1
diag(vm)],
where diag(vm) is a diagonal matrix whose kth diagonal entry is vmk. By using the following identity (Cf.
Eq. (16) in Old and New Matrix Algebra Useful for Statistics2
):
∂Tr[AΣ−1
B]
∂Σ
= −Σ−1
BAΣ−1
, (32)
1http://orion.uwaterloo.ca/ hwolkowi/matrixcookbook.pdf
2http://research.microsoft.com/en-us/um/people/minka/papers/matrix/minka-matrix.pdf
7. we obtain
∂
∑
m
∑
k vmkK−1
k:k
∂K = −K−1
{ ∑
m diag(vm)
}
K−1
.
For the last term in Eq. (30), it holds that
(µm − m)T
K−1
(µm − m) = Tr
[
(µm − m)T
K−1
(µm − m)
]
. (33)
Therefore, by using Eq. (32), we obtain ∂(µm−m)T
K−1
(µm−m)
∂K = −K−1
(µm − m)(µm − m)T
K−1
.
Consequently, we have
∂L(K)
∂K
= −
M
2
K−1
+
1
2
K−1
{ ∑
m
diag(vm)
}
K−1
+
1
2
K−1
∑
m
{
(µm − m)(µm − m)T
}
K−1
. (34)
∂L(K)
∂K = 0 holds when
K−1
=
1
M
K−1
∑
m
{
diag(vm) + (µm − m)(µm − m)T
}
K−1
. (35)
By multiplying K on both sides of the above equation from left and right, we obtain
K =
1
M
∑
m
{
diag(vm) + (µm − m)(µm − m)T
}
. (36)
This derivation is completely the same with that of CTM [1].
3.7 Update q(m)
∂L
∂mk
=
T∑
j=1
(µmj − mj)K−1
k:j , ∴ mk =
1
T
T∑
j=1
µmj (37)
3.8 Update q(α)
With respect to α, we maximize the following function:
L(α) = T ln Γ(α + 1) − T ln Γ(α) + (α − 1)
T∑
k=1
ln(1 − Vk) (38)
We use the following identity (Cf. Eqs. (120), (121), and (122) in Estimating a Dirichlet distribution3
):
Γ(n + x)
Γ(x)
≥ cxa
if n ≥ 1 (39)
a =
{
ψ(n + ˆx) − ψ(ˆx)
}
ˆx (40)
c =
Γ(n + ˆx)
Γ(ˆx)
ˆx−a
(41)
Then we obtain:
L(α) ≥ T
{
ψ(ˆα + 1) − ψ(ˆα)
}
ˆα ln α + (α − 1)
T∑
k=1
ln(1 − Vk) + const. (42)
We maximize this lower bound, which we denote as L(α).
∂L(α)
∂α
=
1
α
T
{
ψ(ˆα + 1) − ψ(ˆα)
}
ˆα +
T∑
k=1
ln(1 − Vk) (43)
∴ α = α ·
T
{
ψ(α + 1) − ψ(α)
}
−
∑T
k=1 ln(1 − Vk)
(44)
3http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/
8. This is a multiplicative update.
When we apply a Gamma prior p(α) =
b
a0
0
Γ(a0) αa0−1
e−b0α
to α, we have the following result:
∂L(α)
∂α
=
1
α
T
{
ψ(ˆα + 1) − ψ(ˆα)
}
ˆα +
T∑
k=1
ln(1 − Vk) + (a0 − 1)
1
α
− b0 (45)
∴ α = α ·
a0 − 1 + T
{
ψ(α + 1) − ψ(α)
}
b0 −
∑T
k=1 ln(1 − Vk)
(46)
3.9 Update q(β)
With respect to β, we maximize the following function L(β):
L(β) = −
T∑
k=1
{
βVk
k−1∏
j=1
(1 − Vj)
} M∑
m=1
µmk −
T∑
k=1
ln Γ
(
βVk
k−1∏
j=1
(1 − Vj)
)
+
T∑
k=1
{
βVk
k−1∏
j=1
(1 − Vj)
} M∑
m=1
{
ψ(amk) − ln bmk
}
= −
T∑
k=1
βpk
M∑
m=1
µmk −
T∑
k=1
ln Γ(βpk) +
T∑
k=1
βpk
M∑
m=1
{
ψ(amk) − ln bmk
}
(47)
The first and the second derivatives are obtained as follows:
∂L(β)
∂β
= −
T∑
k=1
pk
[
ψ(βpk) +
M∑
m=1
{
µmk − ψ(amk) + ln bmk
}]
∂2
L(β)
∂β2
= −
T∑
k=1
p2
kψ′
(βpk) (48)
We can use Newton’s method to update β.
When we apply a Gamma prior p(β) =
d
c0
0
Γ(c0) βc0−1
e−d0β
to β, we have the following result:
∂L(β)
∂β
= −
T∑
k=1
pk
[
ψ(βpk) +
M∑
m=1
{
µmk − ψ(amk) + ln bmk
}]
+ (c0 − 1)
1
β
− d0
∂2
L(β)
∂β2
= −
T∑
k=1
p2
kψ′
(βpk) − (c0 − 1)
1
β2
(49)
References
[1] David M. Blei and John D. Lafferty. Correlated topic models. In NIPS, 2005.
[2] John Paisley, Chong Wang, and David Blei. The discrete infinite logistic normal distribution for
mixed-membership modeling. In AISTATS, 2011.