Informed Neural Topic Model with Pre-trained Word Embeddings
1. Document Informed Neural Autoregressive Topic Models
with Distributional Prior
Informed Neural Topic Model with Pre-trained Word Embeddings
Pankaj Gupta1,2
, Yatin Chaudhary2
, Florian Buettner2
& Hinrich SchĀØutze1
1
CIS, University of Munich (LMU), Germany
2
Corporate Technology, Machine-Intelligence, Siemens AG Munich, Germany
pankaj.gupta@campus.lmu.de | pankaj.gupta@siemens.com
Introduction
A novel Neural Autoregressive Topic Model for short and long texts, empowered by:
ā¢ Context-awareness in learning better representations,
ā¢ Distributional semantics i.e., Word Embeddings as prior knowledge
Problem Statement / Motivation
1. āNeed for the context-awareness in representation learningā?
ā¢ To determine actual meaning of ambiguous words
ā¢ To improve word and document representations
Figure 1: Need for context-awareness in learning representations
2. āNeed for Prior Knowledge in the limited context settingsā?
ā¢ ā word co-occurrences in short texts (e.g., headlines, tweets) or small corpora
ā¢ Difļ¬cult to learn good representations ā Generates incoherent topics
Figure 2: (left): Word embedding similarity (right): Topic examples
Figure 3: Contributions in this work
Evaluation and Analysis
ā¢ 8 short-text and 7 long-text datasets from news, Q&A, sentiment and Industrial domains
ā¢ Generalization (perplexity), Interpretability (topic coherence), Text retrieval (IR) and classiļ¬cation
Table 1: Perplexity (PPL) and IR-precision (at fraction 0.02) scores for short and long texts
Table 2: IR-precision at different retrieval fractions
Methodology: Document Neural Autoregressive Topic Models
Figure 4: (left): DocNADE[1] (baseline) (right): iDocNADE (DocNADE + context-awareness)
Figure 5: (left): DocNADE[1] (baseline) (right): DocNADEe (DocNADE + Word Embeddings)
Table 3: (left): Topic coherence with the top 10 and 20 words (right): Qualitative example
Table 4: (left): Text classiļ¬cation (F1 and Accuracy) scores for short texts
Conclusion & Key Takeaways
ā¢ Leverage full context + pre-trained word embeddings in neural autoregressive topic model
ā¢ Gain of 5.2% (404 vs 426) in PPL, 2.8% (.74 vs .72) in coherence, 11.1% (.60 vs .54) in IR-
precision, 5.2% (.664 vs .631) in F1 for text categorization, on avg over 15 datasets
ā¢ Demonstrate learning better word/document representation for short and long texts
ā¢ Tryout: Code available at: https://github.com/pgcool/iDocNADEe
Our recent extension of this work: ātextTOvecā
Pankaj Gupta, Yatin Chaudhary, Florian Buettner and Hinrich SchĀØutze. textTOvec: Deep Contextu-
alized Neural Autoregressive Topic Models of Language with Distributed Compositional Prior. To
appear in ICLR2019. TL;DR: A Neural Topic Model with Language Structures.
References
[1] Hugo Larochelle and Stanislas Lauly. A neural autoregressive topic model. In Advances in Neural Information
Processing Systems 25, pages 2708ā2716. Curran Associates, Inc., 2012.