[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowledge into Topic Models
Efficient Methods for Incorporating
Knowledge into Topic Models
[Yang, Downey and Boyd-Graber 2015]
EMNLP 2015 Reading
Large-scale Topic Model
• In academic papers
– Up to 10^3 topics
• Industrial applications
– 10^5~10^6 topics!
– Search engines, online ads. and so on
– To capture infrequent topics
• This paper handles up to 500 topics...
[Blei+ 2003, Griffiths+ 2004]
• "Conventional" Gibbs sampling
𝑃 𝑧 = 𝑡 𝒛−, 𝑤 ∝ 𝑞𝑡 ≔ 𝑛 𝑑,𝑡 + 𝛼
𝑛 𝑤,𝑡 + 𝛽
𝑛 𝑡 + 𝑉𝛽
– 𝑇 : Topic size
– For 𝑈~𝒰 0, 𝑧
𝑇 𝑞 𝑧 , find 𝑡 s.t. 𝑧
𝑡−1 𝑞 𝑧 < 𝑈 < 𝑧
𝑡 𝑞 𝑧
• For large T, it is computationally intensive
– 𝑛 𝑤,𝑡 is sparse
– When T is very large, 𝑛 𝑑,𝑡 is too e.g. 𝑇 = 106
> 𝑛 𝑑
Word correlation prior
knowledge for SC-LDA
• 𝑓𝑚 𝑧, 𝑤, 𝑑 =
log max 𝜆, 𝑛 𝑢,𝑧 +
max 𝜆, 𝑛 𝑣,𝑧
– where 𝑀 𝑤
𝑚 : Must-link of 𝑤, 𝑀 𝑤
𝑐 : Cannot-link of 𝑤
• 𝑃 𝑧 = 𝑡 𝒛−, 𝑤, 𝑀 ∝
𝑛 𝑑,𝑡 𝛽
𝑛 𝑑,𝑡+𝛼 𝑛 𝑤,𝑡
max 𝜆, 𝑛 𝑢,𝑧
max 𝜆, 𝑛 𝑣,𝑧
• They tell that prior knowledge is incorporated
“by adding a factor graph to encode prior
knowledge,” but it does not be drawn.
• The potential function 𝑓𝑚 𝑧, 𝑤, 𝑑 contains 𝑛 𝑤,𝑧,
and 𝜑 𝑤,𝑧 ∝ 𝑛 𝑤,𝑧 + 𝛽.
• So the above model seems like Fig.b:
[Ramage+ 2009] Labeled LDA
• Supervized LDA for labeled documents
– It is equivalent to SC-LDA with the
following potential function
𝑓𝑚 𝑧, 𝑤, 𝑑 =
1, if 𝑧 ∈ 𝑚 𝑑
where 𝑚 𝑑 specifies a label set of 𝑑
– Dirichlet Forest-LDA [Andrzejewski+ 2009]
– Logic-LDA [Andrzejewski+ 2011]
– MRF-LDA [Xie+ 2015]
• Encodes word correlations in LDA as MRF
DATASET DOCS TYPE TOKEN(APPROX) Experiments
NIPS 1,500 12,419 1,900,000
NYT-NEWS 3,000,000 102,660 100,000,000
20NG 18,828 21,514 1,946,000 Labeled docs
Generate Word Correlation
– Obtain synsets from WordNet 3.0
– Similarity between the word and its
synsets on word embedding from
word2vec is higher than threshold 0.2
The average running time per iteration
over 100 iterations, averaged over 5
seeds, on 20NG dataset.
Coherence [Mimno+ 2011]
• 𝐶 𝑡: 𝑉 𝑡 = 𝑚=2
𝐹 𝑣 𝑚
– 𝐹 𝑣 : document frequency of word type 𝑣
– 𝐹 𝑣, 𝑣′ :co-document frequency of word type 𝑣, 𝑣′
𝜖 is very small like
• [Yang+ 2015] Efficient Methods for Incorporating Knowledge into Topic Models
• [Blei+ 2003] Latent Dirichlet allocation.
• [Griffiths+ 2004] Finding scientific topics.
• [Yao+ 2009] Efficient methods for topic model inference on streaming document
• [Ramage+ 2009] Labeled LDA: A supervised topic model for credit attribution in
• [Andrzejewski+ 2009] Incorporating domain knowledge into topic modeling via
Dirichlet forest priors.
• [Andrzejewski+ 2011] A framework for incorporating general domain knowledge
into latent Dirichlet allocation using first-order logic.
• [Xie+ 2015] Incorporating word correlation knowledge into topic modeling.
• [Mimno+ 2011] Optimizing semantic coherence in topic models.
• [Röder+ 2015] Exploring the space of topic coherence measures.