Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowledge into Topic Models

12,475 views

Published on

EMNLP 2015 読み会での発表資料です。

Published in: Technology
  • Be the first to comment

[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowledge into Topic Models

  1. 1. Efficient Methods for Incorporating Knowledge into Topic Models [Yang, Downey and Boyd-Graber 2015] 2015/10/24 EMNLP 2015 Reading @shuyo
  2. 2. Large-scale Topic Model • In academic papers – Up to 10^3 topics • Industrial applications – 10^5~10^6 topics! – Search engines, online ads. and so on – To capture infrequent topics • This paper handles up to 500 topics... really?
  3. 3. (Standard) LDA [Blei+ 2003, Griffiths+ 2004] • "Conventional" Gibbs sampling 𝑃 𝑧 = 𝑡 𝒛−, 𝑤 ∝ 𝑞𝑡 ≔ 𝑛 𝑑,𝑡 + 𝛼 𝑛 𝑤,𝑡 + 𝛽 𝑛 𝑡 + 𝑉𝛽 – 𝑇 : Topic size – For 𝑈~𝒰 0, 𝑧 𝑇 𝑞 𝑧 , find 𝑡 s.t. 𝑧 𝑡−1 𝑞 𝑧 < 𝑈 < 𝑧 𝑡 𝑞 𝑧 • For large T, it is computationally intensive – 𝑛 𝑤,𝑡 is sparse – When T is very large, 𝑛 𝑑,𝑡 is too e.g. 𝑇 = 106 > 𝑛 𝑑
  4. 4. SparseLDA [Yao+ 2009] 𝑡 𝑃 𝑧 = 𝑡 𝒛−, 𝑤 ∝ 𝑡 𝛼𝛽 𝑛 𝑡 + 𝑉𝛽 + 𝑡 𝑛 𝑑,𝑡 𝛽 𝑛 𝑡 + 𝑉𝛽 + 𝑡 𝑛 𝑑,𝑡 + 𝛼 𝑛 𝑤,𝑡 𝑛 𝑡 + 𝑉𝛽 • 𝑠 = 𝑡 𝑠𝑡 , 𝑟 = 𝑡 𝑟𝑡 , 𝑞 = 𝑡 𝑞𝑡 • For 𝑈~𝒰 0, 𝑠 + 𝑟 + 𝑞 , – If 0 < 𝑈 < 𝑠, find 𝑡 s.t. 𝑧 𝑡−1 𝑠 𝑧 < 𝑈 < 𝑧 𝑡 𝑠 𝑧 – If 𝑠 < 𝑈 < 𝑠 + 𝑟, find 𝑡 s.t.𝑛 𝑑,𝑡 > 0, 𝑧 𝑡−1 𝑟𝑧 < 𝑈 − 𝑠 < 𝑧 𝑡 𝑟𝑧 – If 𝑠 + 𝑟 < 𝑈 < 𝑠 + 𝑟 + 𝑞, find 𝑡 s.t.𝑛 𝑤,𝑡 > 0, 𝑧 𝑡−1 𝑞 𝑧 < 𝑈 − 𝑠 − 𝑟 < 𝑧 𝑡 𝑞 𝑧 • Faster because 𝑛 𝑤,𝑡 and 𝑛 𝑑,𝑡 are sparse 𝑠𝑡 𝑟𝑡 𝑞𝑡 independent on w, d dependent on d only
  5. 5. Leveraging Prior Knowledge • The objective function of topic models does not correlate with human judgements
  6. 6. Word correlation prior knowledge • Must-link – “quarterback” and “fumble” are both related to American football • Cannot-link – “fumble” and “bank” imply two different topics
  7. 7. SC-LDA [Yang+ 2015] • 𝑚 ∈ 𝑀 : Prior knowledge • 𝑓𝑚(𝑧, 𝑤, 𝑑) : Potential function of prior knowledge 𝑚 about word 𝑤 with topic 𝑧 in document 𝑑 • 𝜓 𝒛, 𝑀 = 𝑧∈𝒛 exp 𝑓𝑚 𝑧, 𝑤, 𝑑 • 𝑃 𝒘, 𝒛 𝛼, 𝛽, 𝑀 = 𝑃 𝒘 𝒛, 𝛽 𝑃 𝒛 𝛼 𝜓(𝒛, 𝑀) maybe ∝ maybe 𝑚 ∈ 𝑀, all 𝑤 with 𝑧 in all 𝑑 Sparse Constrained
  8. 8. Inference for SC-LDA 𝑉
  9. 9. Word correlation prior knowledge for SC-LDA • 𝑓𝑚 𝑧, 𝑤, 𝑑 = 𝑢∈𝑀 𝑤 𝑚 log max 𝜆, 𝑛 𝑢,𝑧 + 𝑣∈𝑀 𝑤 𝑐 log 1 max 𝜆, 𝑛 𝑣,𝑧 – where 𝑀 𝑤 𝑚 : Must-link of 𝑤, 𝑀 𝑤 𝑐 : Cannot-link of 𝑤 • 𝑃 𝑧 = 𝑡 𝒛−, 𝑤, 𝑀 ∝ 𝛼𝛽 𝑛 𝑡+𝑉𝛽 + 𝑛 𝑑,𝑡 𝛽 𝑛 𝑡+𝑉𝛽 + 𝑛 𝑑,𝑡+𝛼 𝑛 𝑤,𝑡 𝑛 𝑡+𝑉𝛽 𝑢∈𝑀 𝑤 𝑚 max 𝜆, 𝑛 𝑢,𝑧 𝑣∈𝑀 𝑤 𝑐 1 max 𝜆, 𝑛 𝑣,𝑧
  10. 10. Factor Graph • They tell that prior knowledge is incorporated “by adding a factor graph to encode prior knowledge,” but it does not be drawn. • The potential function 𝑓𝑚 𝑧, 𝑤, 𝑑 contains 𝑛 𝑤,𝑧, and 𝜑 𝑤,𝑧 ∝ 𝑛 𝑤,𝑧 + 𝛽. • So the above model seems like Fig.b: Fig.a Fig.b
  11. 11. [Ramage+ 2009] Labeled LDA • Supervized LDA for labeled documents – It is equivalent to SC-LDA with the following potential function 𝑓𝑚 𝑧, 𝑤, 𝑑 = 1, if 𝑧 ∈ 𝑚 𝑑 −∞, else where 𝑚 𝑑 specifies a label set of 𝑑
  12. 12. Experiments • Baselines – Dirichlet Forest-LDA [Andrzejewski+ 2009] – Logic-LDA [Andrzejewski+ 2011] – MRF-LDA [Xie+ 2015] • Encodes word correlations in LDA as MRF – SparseLDA DATASET DOCS TYPE TOKEN(APPROX) Experiments NIPS 1,500 12,419 1,900,000 Word correlation NYT-NEWS 3,000,000 102,660 100,000,000 20NG 18,828 21,514 1,946,000 Labeled docs
  13. 13. Generate Word Correlation • Must-link – Obtain synsets from WordNet 3.0 – Similarity between the word and its synsets on word embedding from word2vec is higher than threshold 0.2 • Cannot-link – Nothing?
  14. 14. Convergence Speed The average running time per iteration over 100 iterations, averaged over 5 seeds, on 20NG dataset.
  15. 15. Coherence [Mimno+ 2011] • 𝐶 𝑡: 𝑉 𝑡 = 𝑚=2 𝑀 𝑙=1 𝑚−1 log 𝐹 𝑣 𝑚 𝑡 ,𝑣𝑙 𝑡 +𝜖 𝐹 𝑣𝑙 𝑡 – 𝐹 𝑣 : document frequency of word type 𝑣 – 𝐹 𝑣, 𝑣′ :co-document frequency of word type 𝑣, 𝑣′ It means “include”? 𝜖 is very small like 10−12 [Röder+ 2015] -39.1 -36.6
  16. 16. References • [Yang+ 2015] Efficient Methods for Incorporating Knowledge into Topic Models • [Blei+ 2003] Latent Dirichlet allocation. • [Griffiths+ 2004] Finding scientific topics. • [Yao+ 2009] Efficient methods for topic model inference on streaming document collections. • [Ramage+ 2009] Labeled LDA: A supervised topic model for credit attribution in multilabeled corpora. • [Andrzejewski+ 2009] Incorporating domain knowledge into topic modeling via Dirichlet forest priors. • [Andrzejewski+ 2011] A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic. • [Xie+ 2015] Incorporating word correlation knowledge into topic modeling. • [Mimno+ 2011] Optimizing semantic coherence in topic models. • [Röder+ 2015] Exploring the space of topic coherence measures.

×