Upcoming SlideShare
×

# Lda-the gritty details

2,126 views
2,038 views

Published on

Published in: Technology, Education
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
2,126
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
43
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Lda-the gritty details

1. 1. Distributed Gibbs Sampling of Latent Dirichlet Allocation: The Gritty Details Yi Wang yiwang@tencent.com THIS IS AN EARLY DRAFT. YOUR FEEDBACKS ARE HIGHLY AP-PRECIATED.1 IntroductionI started to write this chapter since November 2007, right after my ﬁrst MapRe-duce implementation of the AD-LDA algorithm[7]. I had worked so hard tounderstand LDA, but cannot ﬁnd any document that were comprehensive, com-plete, and contain all necessary details. It is true that I can ﬁnd some opensource implementations of LDA, for example, LDA Gibbs sampler in Java1 andGibbsLDA++2 , but code does not reveal math derivations. I read Griﬃths’paper [4] and technical report [3], but realized most derivations are skipped.It was lucky to me that in the paper about Topic-over-time[9], an extensionof LDA, the authors show part of the derivation. The derivation is helpful tounderstand LDA but is too short and not self-contained. A primer [5] by GregorHeinrich contains more details than documents mentioned above. Indeed, mostof my initial understanding on LDA comes from [5]. As [5] focuses more on theBayesian background of latent variable modeling, I wrote this article focusingon more on the derivation of the Gibbs sampling algorithm for LDA. You maynotice that the Gibbs sampling updating rule we derived in (38) diﬀers slightlyfrom what was presented in [3] and [5], but matches that in [9].2 LDA and Its Learning ProblemIn the literature, the training documents of LDA are often denoted by a longand segmented vector W : {w1 , . . . , wN1 },   words in the 1st document  {wN1 +1 , . . . , wN1 +N2 }, words in the 2nd document  W =  ... ...  ,  (1) {w1+ D−1 Nj , . . . , w D Nj } words in the D-th document j=1 j=1 1 http://arbylon.net/projects/LdaGibbsSampler.java 2 http://gibbslda++.sourceforge.net 1
2. 2. where Nd denotes the number of words in the d-th document. 3 In rest of this article, we consider hidden variables {z1 , . . . , zN1 },   topics of words in the 1st document  {zN1 +1 , . . . , zN1 +N2 }, topics of words in the 2nd document  Z=  ... ...  ,  {z1+ D−1 Nj , . . . , z D Nj } topics of words in the D-th document j=1 j=1 (3)where each zi ∈ Z corresponds to a word wi ∈ W in (1).3 Dirichlet and MultinomialTo derive the Gibbs sampling algorithm with LDA, it is important to be familiarwith the conjugacy between Dirichlet distribution and multinomial distribution. The Dirichlet distribution is deﬁned as: |α| 1 Dir(p; α) = pαt −1 , (4) B(α) v=1 twhere the normalizing constant is the multinomial beta function, which can beexpressed in terms of gamma function: |α| i=1 Γ(αi ) B(α) = |α| . (5) Γ( i=1 αi )When Dirichlet distribution is used in LDA to model the prior, αi ’s are positiveintegers. In this case, the gamma function degenerates to the factorial function: Γ(n) = (n − 1)! (6) The multinomial distribution is deﬁned as: K n! Mult(x; p) = K p xi , i (7) i=1 xi ! i=1where xi denotes the number of times that value i appears in the samples drawn Kfrom the discrete distribution p, K = |x| = |p| = |α| and n = i=1 xi . Thenormalization constant comes from the product of combinatorial numbers. We say the Dirichlet distribution is the conjugate distribution of the multi-nomial distribution, because if the prior of p is Dir(p; α) and x is generated 3 Another representation of the training documents are two vectors: d and W : d d1 , ..., dN = , (2) W w1 , ..., wN D T denotes anwhere N = j=1 Nj , di indices in D, wi indices in W, and the tuple [di , wi ]edge between the two vertices indexed by di and wi respectively. Such representation iscomprehensive and is used in many implementations of LDA, e.g, [8]. However, many papers (including this article) do not use this representation, because itmisleads readers to consider two sets of random variables, W and d. The fact is that d isthe structure of W and is highly dependent with W . d is also the structure of the hiddenvariables Z, as shown by (3). 2
3. 3. Figure 1: The procedure of learning LDA by Gibbs sampling.by Mult(x; p), then the posterior distribution of p, p(p|x, α), is a Dirichletdistribution: p(p|x, α) = Dir(p; x + α) |α| 1 (8) = pxt +αt −1 . B(x + α) v=1 t Because (8) is a probability distribution function, integrating it over p shouldresult in 1: |α| 1 1= pxt +αt −1 dp B(x + α) v=1 t (9) |α| 1 = pt t +αt −1 dp x . B(x + α) v=1This implies |α| pt t +αt −1 dp = B(x + α) . x (10) v=1This property will be used in the following derivations.4 Learning LDA by Gibbs SamplingThere have been three strategies to learn LDA: EM with variational inference[2], EM with expectation propagation [6], and Gibbs sampling [4]. In this article,we focus on the Gibbs sampling method, whose performance is comparable withthe other two but is tolerant better to local optima. Gibbs sampling is one of the class of samplings methods known as MarkovChain Monte Carlo. We use it to sample from the posterior distribution,p(Z|W ), given the training data W represented in the form of (1). As willbe shown by the following text, given the sample of Z we can infer model pa-rameters Φ and Θ. This forms a learning algorithm with its general frameworkshown in Fig. 1. In order to sample from p(Z|W ) using the Gibbs sampling method, we needthe full conditional posterior distribution p(zi |Z−i , W ), where Z−i denotes all 3
4. 4. zj ’s with j = i. In particular, Gibbs sampling does not require knowing theexact form of p(zi |Z−i , W ); instead, it is enough to have a function f (·), where f (zi |Z−i , W ) ∝ p(zi |Z−i , W ) (11)The following mathematical derivation is for such a function f (·).The Joint Distribution of LDA We start from deriving the joint distribu-tion 4 , p(Z, W |α, β) = p(W |Z, β)p(Z|α) , (12)which is the basis of the derivation of the Gibbs updating rule and the parameterestimation rule. As p(W |Z, β) and p(Z|α) depend on Φ and Θ respectively, wederive them separately. According to the deﬁnition of LDA, we have p(W |Z, β) = p(W |Z, Φ)p(Φ|β)dΦ , (13)where p(Φ|β) has Dirichlet distribution: K K V 1 p(Φ|β) = p(φk |β) = φβv −1 , (14) B(β) v=1 k,v k=1 k=1and p(Φ|β) has multinomial distribution: N K V Ψ(k,v) p(W |Z, Φ) = φzi ,wi = φk,v , (15) i=1 k=1 v=1where Ψ is a K × V count matrix and Ψ(k, v) is the number of times that topick is assigned to word v. With W and Z deﬁned by (1) and (3) respectively, wecan compute Ψ(k, v) by N Ψ(k, v) = I{wi = v ∧ zi = k} , (16) i=1where N is the corpus size by words. Given (15) and (14), (13) becomes K V 1 Ψ(k,v)+βv −1 p(W |Z, β) = φ dφk . (17) B(β) v=1 k,v k=1Using the property of integration of a product, which we learned in our colleaguetime, K K fk (φk ) dφ1 . . . dφK = fk (φk ) dφk , (18) k=1 k=1 4 Notations used in this article are identical with those used in [5]. 4
5. 5. we have K V 1 Ψ(k,v)+βv −1 p(W |Z, β) = φ dφk B(β) v=1 k,v k=1 (19) K V 1 Ψ(k,v)+βk −1 = φk,v dφk . B(β) v=1 k=1Using property (10), we can compute the integratation over φz in a close form,so K B(Ψk + β) p(W |Z, β) = , (20) B(β) k=1where Ψk denotes the k-th row of matrix Ψ. Now we derive p(Z|α) analogous to p(W |Z, β). Similar with (14), we have D D K 1 p(Θ|α) = p(θ m |α) = θd,k−1 ; αk (21) B(α) d=1 d=1 k=1similar with (15), we have N D K Ω(d,k) p(Z|Θ) = θdi ,zi = θd,k ; (22) i=1 d=1 k=1and similar with (20) we have p(Z|α) = p(Z|Θ)p(Θ|α)dΘ D K 1 Ω(d,k)+αk −1 = θd,k dθ k (23) B(α) d=1 k=1 D B(Ωd + α) = , B(α) d=1where Ω is a count matrix and Ω(d, k) is the number of times that topic k isassigned to words in document d; Ωd denotes the d-th row of Ω. With inputdata deﬁned by (2), Ω(d, k) can be computed as N Ω(d, k) = I{di = m ∧ zi = z} , (24) i=1where N is the corpus size by words. Given (20) and (23), the joint distribution (12) becomes: p(Z, W |α, β) = p(W |Z, β)p(Z|α) K D B(Ψk + β) B(Ωd + α) (25) = · B(β) B(α) k=1 d=1 5
6. 6. The Gibbs Updating Rule With (25), we can derive the Gibbs updatingrule for LDA: p(zi = k, Z−i , W |α, β) p(zi = k|Z−i , W, α, β) = . (26) p(Z−i , W |α, β)Because zi depends only on wi , (26) becomes p(Z, W |α, β) p(zi |Z−i , W, α, β) = where zi = k . (27) p(Z−i , W−i |α, β) The numerator in (27) is (25). The denominator has a similar form: p(Z, W |α, β) = p(W |Z, β)p(Z|α) K D B(Ψ−i + β) k B(Ω−i + α) d (28) = · , B(β) B(α) k=1 d=1 where Ψ−i denotes the k-th row of the count K × V count matrix Ψ−i , and k −iΨ (k, v) is the number of times that topic k is assigned to word w, but withthe i-th word and its topic assignment excluded. Similarly, Ω−i is the d-th row dof count matrix Ω−i , and Ω−i (d, k) is the number of words in document d thatare assigned topic k, but with the i-th word and its topic assignment excluded.In more details, Ψ−i (k, v) = I{wj = v ∧ zj = k} , 1≤j≤N j=i (29) Ω−i (d, k) = I{dj = d ∧ zj = k} , 1≤j≤N j=iwhere dj denotes the document to which the i-th word in the corpus belongs.Some properties can be derived from the deﬁnitions directly: Ψ−i (k, v) + 1 if v = wi and k = zi ; Ψ(k, v) = Ψ−i (k, v) all other cases. (30) Ω−i (d, k) + 1 if d = di and k = zi ; Ω(d, k) = Ω−i (d, k) all other cases. Correct notations from here on: From (30) we have V V n(v; zi ) = 1 + n−i (v|zi ) v=1 v=1 (31) K K n(z; di ) = 1 + n−i (z|di ) k=1 k=1 (30) and (31) are important to compute (27). Expanding the products in(25) and take the fraction in (27), due to (30), those factors related with z = ziand m = mi disappear, and we get B(n(·; zi ) + β) B(n(·; mi ) + α) p(zi |Z−i , W, α, β) = · . (32) B(n−i (·|zi ) + β) B(n−i (·|mi ) + α) 6
7. 7. This equation can be further simpliﬁed by expanding B(·). Considering dim x k=1 Γ(xk ) B(x) = dim x , (33) Γ( k=1 xk )(32) becomes V K v=1 Γ(n(t;zi )+βt ) k=1 Γ(n(z;di )+αz ) V K Γ( v=1 n(t;zi )+βt ) Γ( k=1 n(z;di )+αz ) V · K . (34) v=1 Γ(n−i (t|zi )+βt ) k=1 Γ(n−i (z|di )+αz ) Γ( V n−i (t|zi )+βt ) v=1 Γ( K n−i (z|di )+αz ) k=1Using (30), we can remove most factors in the products of (34) and get Γ(n(wi ;zi )+βwi ) Γ(n(zi ;di )+αzi ) Γ( V n(t;zi )+βt ) Γ( K n(z;di )+αz ) v=1 Γ(n−i (wi |zi )+βwi ) · k=1 Γ(n−i (zi |di )+αzi ) . (35) Γ( V n−i (t|zi )+βt ) v=1 Γ( K n−i (z|di )+αz ) k=1Here it is important to know that Γ(y) = (y − 1)! if y is a positive integer , (36)so we expand (35) and using (30) to remove most factors in the factorials. Thisleads us to the Gibbs updating rule for LDA: p(zi |Z−i ,W, α, β) = n(wi ; zi ) + βwi − 1 n(zi ; di ) + αzi − 1 (37) · . V K v=1 n(t; zi ) + βt − 1 k=1 n(z; di ) + αz − 1 Note that the denominator of the second factor at the right hand side of(38) does not depend on zi , the parameter of the function p(zi |Z−i , W, α, β).Because the Gibbs sampling method requires only a function f (zi ) ∝ p(zi ),c.f. (11), we can use the following updating rule in practice: p(zi |Z−i ,W, α, β) = n(wi ; zi ) + βwi − 1 (38) · n(zi ; di ) + αzi − 1 . V v=1 n(t; zi ) + βt − 1Parameter Estimation Given the sampled topics Z, as well as the input:α, β, and W , we can estimate model parameters Φ and Θ. Because Ψ(k, v) is a matrix that consists of the number of times of co-occurrences of word w and topic z, which is proportional to the joint probabilityp(w, z). Similarly, n(z; d) is proportional to p(z, d). Therefore, we can estimatethe model parameters by simply normalizing Ψ(k, v) and n(z; d); this leads tothe updating equation (43). Another more precise understanding comes from [5]. According to the deﬁnitions φk,t = p(w = t|z = k) and θm,k = p(z = k|d =m), we can interpret φk,t and θm,k by the posterior probabilities: φk,t = p(w = t|z = k, W, Z, β) (39) θm,k = p(z = k|Z, α) 7
8. 8. The following is a trick that derives (39) using the joint distribution (25). φk,t · θm,k = p(w = t | z = k, W, Z, β) · p(z = k | Z, α) = p(w = t, z = k | W, Z, α, β) (40) ˆ ˆ p(W , Z | α, β) = , p(W, Z | α, β) ˆ ˆ ˆwhere W = {W, t}, Z = {Z, k} and d = {m}. Both the numerator and denomi-nator can be computed using (25). Analogous to what we did to compute (27),(40) becomes: ˆ ˆ p(W , Z | α, β) φk,t · θm,k = p(W, Z | α, β) Γ(n(t;k)+1+βt ) Γ(n(k;m)+1+αk ) (41) V K Γ( v=1 n(t;k)+1+βt ) Γ( k=1 Ω(d,k)+1+αz ) = Γ(n−i (t|k)+βt ) · Γ(n−i (k|m)+αk ) Γ( V n−i (t|k)+βt ) v=1 Γ( K Ω−i (d,k)+αz ) k=1 ˆ ˆNote that the “+1” terms in (41) is due to |W | = |W | + 1 and |Z| = |Z| + 1. Analogous to the simpliﬁcation we did for (34), simplifying (41) results in n(t; k) + βt n(k; m) + αk φk,t · θm,k = · . (42) V K v=1 n(t; k) + βt k=1 Ω(d, k) + αzComparing with (38), (42) does not have those “−1” terms because of those“+1” terms in (41). Because the two factors at the right hand side of (42) correspond to p(w =t | z = k, W, Z, β) and p(z = k | Z, α) respectively, according to the deﬁnitionof φk,t and θm,k , we have n(t; k) + βt φk,t = , V t =1 n(t ; k) + βt (43) n(k; m) + αk θm,k = . K k=1 Ω(d, k) + αzThe Algorithm With (38) known, we can realize the learning procedureshown in Fig. 1 by the algorithm shown in Fig. 1. It is notable that this algorithm does not require input of α and β, because ittakes α and β as zero vectors. This simpliﬁcation is reasonable because we caninterpret the conjugate property of Dirichlet and multinomial distribution [1] asthat αz and βt denote the number of times that topic z and word t respectivelyare observed out of the training data set. Taking α and β as zero vectors meanswe do not have prior knowledge on these numbers of times. This simpliﬁcation makes it easy to maintain the quantities used in (38).We use a 2D array NWZ[w,z] to maintain n(w, z) and NZM[z,m] for Ω(d, k).To accelerate the computation, we use a 1D array NZ[z] to maintain n(z) = |W, |Z, v=1 n(t; z). Note that we do not need an array NM[d] for n(z) = k=1 n(z, d);which appears in (37) but is unnecessary in (38). 8
9. 9. zero all count variables NWZ, NZM, NZ ; foreach document m ∈ [1, D] do foreach word n ∈ [1, Nm ] in document m do sample topic index zm,n ∼ Mult(1/K) for word wm,n ; increment document-topic count: NZM[zm,n , m]++ ; increment topic-term count: NWZ[wm,n , zm,n ]++ ; increment topic-term sum: NZ[zm,n ]++ ; end end while not ﬁnished do foreach document m ∈ [1, D] do foreach word n ∈ [1, Nm ] in document m do NWZ[wm,n , zm,n ]--, NZ[zm,n ]--, NZM[zm,n , m]-- ; sample topic index zm,n according to (44) ; ˜ NWZ[wm,n , zm,n ]++, NZ[˜m,n ]++, NZM[˜m,n , m]++ ; ˜ z z end end if converged and L sampling iterations since last read out then read out parameter set Θ and Φ according to (43) ; end end Algorithm 1: The Gibbs sampling algorithm that learns LDA. If α and β are known as non-zero vectors, we can easily modify the algorithmto make use of them. In particular, NWZ[w,z]=n(w, z)+βw , NZM[z,m]=Ω(d, k)+ Vαz and NZ[z]= v=1 n(t; z) + βt . Only the initialization part of this algorithmneed to be modiﬁed. The input of this algorithm is not in the form of (1); instead, it is in amore convenient form: with each document represented by a set of words itincludes, the input is given as a set of documents. Denote the n-th word inthe m-th document by wm,n , the corresponding topic is denoted by zm,n andcorresponding document is m. With these correspondence known, the abovederivations can be easily implemented in the algorithm. The sampling operation in this algorithm is not identical with the full condi-tional posterior distribution (38); instead, it does not contain those “−1” terms: n(wi ; zi ) + βwi p(zi ) = · n(zi ; di ) + αzi . (44) V v=1 n(t; zi ) + βtThis makes it elegant to maintain the consistency of sampling new topics andupdating the counts (NWZ, NZ and NZM) using three lines of code: decreasing thecounts, sampling and increasing the counts. 5 9
10. 10. Figure 2: The ground-truth Φ used to synthesize our testing data.5 ExperimentsThe synthetic image data mentioned in [4] is useful to test the correctness ofany implementation of LDA learning algorithm. To synthesize this data set, weﬁx Φ = {φk }K=10 as visualized in Fig. 2 (also, Fig. 1 in [4]), set α = [1], the k=1number of words/pixels in each document/image d = 100. Because every imagehas 5 × 5 pixels, the vocabulary size is 25.6 Fig. 3 shows the result of running Gibbs sampling to learn an LDA from thistesting data set, where the left pane shows the convergence of the likelihood,and the right pane shows the updating process of Φ along the iterations of thealgorithm. The 20 rows of images in the right pane are visualizations of Φestimated at the iterations of 1, 2, 3, 5, 7, , 10, 15, 20, 25, 50, 75, 100, 125, 150,175, , 200, 250, 300, 400, 499. From this ﬁgure we can see that since iteration100, the estimated Φ is almost identical to Fig. 1(a) of [4].6 AcknowledgementThanks go to Gregor Heinrich and Xuerui Wang for helping me grasp somedetails in the derivation of Gibbs sampling of LDA.References[1] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2007.[2] D. Blei, A. Ng, M. Jordan, and J. Laﬀerty. Latent dirichlet allocation. JMLR, 3:993–1022, 20003.[3] T. Griﬃth. Gibbs sampling in the generative model of latent dirichlet allo- cation. Technical report, Stanford University, 2004.[4] T. Griﬃths and M. Steyvers. Finding scientiﬁc topics. In PNAS, volume 101, pages 5228–5235, 2004.[5] G. Heinrich. Parameter estimation for text analysis. Technical report, vsonix GmbH and University of Leipzig, Germany, 2009.[6] T. Minka and J. Laﬀerty. Expectation propaga- tion for the generative aspect model. In UAI, 2002. https://research.microsoft.com/ minka/papers/aspect/minka-aspect.pdf. 5 This trick illustrates the physical meaning of those “−1” terms in (38). I learn this trickfrom Heinrich’s code at http://arbylon.net/projects/LdaGibbsSampler.java. 6 All these settings are identical with those described in [4]. 10
11. 11. Figure 3: The result of running the Gibbs sampling algorithm to learn an LDAfrom synthetic data. Each row of 10 images in the right pane visualizes theΦ = {φk }K=10 estimated at those iterations indicated by red circles in the left k=1pane.[7] D. Newman, A. Asuncion, P. Smyth, and MaxWelling. Distributed inference for latent dirichlet allocation. In NIPS, 2007. 11
12. 12. [8] M. Steyvers and T. Griﬃths. Matlab topic modeling toolbox. Technical report, University of California, Berkeley, 2007.[9] X. Wang and A. McCallum. Topics over time: A non-markov continuous- time model of topical trends. In KDD, 2006. 12