Document Informed Neural Autoregressive Topic Models with Distributional Prior

Intern © Siemens AG 2017
Document Informed Neural Autoregressive
Topic Models with Distributional Prior
Author(s): Pankaj Gupta1,2, Yatin Chaudhary2, Florian Büttner2 and Hinrich Schütze1
Presenter: Pankaj Gupta @AAAI19 Honolulu Hawaii, USA. Jan 2019
1CIS, University of Munich (LMU) | 2 Machine Intelligence, Siemens AG | Jan 2019

May 2017Seite 2 Corporate Technology
Outline
Motivation
→ Context awareness in Topic Models
→ Distributional semantics in Neural Topic Model
→ Baseline: DocNADE
Proposed Models
→ DocNADE+ context awareness, i.e., iDocNADE
→ DocNADE + word embeddings priors, i.e., DocNADEe
Evaluation
→ Generalization, Topic Coherence, Text retrieval and classification

Outline
Motivation
Proposed Models
Evaluation

This Work: CONTRIBUTIONS
Incorporate context information around words
→ determining actual meaning of ambiguous words
→ improving word and document representations
Improving Topic Modeling for short-text and long-text documents
Incorporate external knowledge for each word
→ using distributional semantics, i.e., word embeddings
→ improving document representations and topics

Motivation1: Need for Full Contextual Information
Source Text Sense/Topic
In biological brains, we study noisy neurons at cellular level → “biological neural network”
Like biological brains, study of noisy neurons in artificial neural networks → “artificial neural network”

Preceding Context Following Context Sense/Topic of “neurons”
Like biological brains, study of noisy + in artificial neural networks → “biological neural network”

Like biological brains, study of noisy + in artificial neural networks → “artificial neural network”

Like biological brains, study of noisy + in artificial neural networks → “artificial neural network”
Context information around words helps in determining their actual meaning !!!

Motivation2: Need for Distributional Semantics or Prior Knowledge
➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc.
➢ “Lack of Context” in a corpus of few documents
Small number of
Word co-occurrences

Small number of
Word co-occurrences
Lack of Context
Difficult to learn good representations

Small number of
Word co-occurrences
Lack of Context
Generate Incoherent TopicsDifficult to learn good representations

Small number of
Word co-occurrences
Lack of Context
Topic1: price, wall, china, fall, shares
Topic2: shares, price, profits, rises, earnings
incoherent
coherent
example topics for ‘trading’

Small number of
Word co-occurrences
Lack of Context
TO RESCUE: Use External/additional information, e.g., WORD EMBEDDINGS
(encodes semantic and syntactic relatedness in words in a vector space)

Small number of
Word co-occurrences
Lack of Context
→ trading
No word overlap
(e.g., 1-hot-encoding)
Same
topic
class→ trading
TO RESCUE: Use External/additional information, e.g., WORD EMBEDDINGS
(encodes semantic and syntactic relatedness in words in a vector space)

Baseline: Neural Autoregressive Topic Model (DocNADE)
A probabilistic graphical model inspired by RBM, RSM and NADE models
➢ learns topics over sequences of words, v
➢ learn distributed word representations based on word co-occurrences
➢ follows encoder-decoder mechanism of unsupervised learning
EncodingDecoding

➢ follows encoder-decoder principle
➢ computes joint distribution or log-likelihood for a document, v
in language modeling fashion via autoregressive conditionals
i.e., predict the word vi, given the sequence of preceding words v<i
autoregressive
conditional, p(vi=3 | v<3)
via a feed-forward neural network, where
EncodingDecoding

➢ follows encoder-decoder principle
➢ computes joint distribution or log-likelihood for a document, v
in language modeling fashion via autoregressive conditionals
i.e., predict the word vi, given the sequence of preceding words v<i
autoregressive
via a feed-forward neural network, where
where,
Topic matrix
Encoding
EncodingDecoding
Embedding aggregation

Limitations
➢ does not take into account the following words v>i in the sequence
➢ poor in modeling short-text documents due to limited context
➢ does not use pre-trained word embeddings (or external knowledge)
autoregressive
EncodingDecoding
DOES NOT take into account the
following words v>i

Proposed Neural Architectures, extending DocNADE to
→incorporate full context information
→incorporate pre-trained Word Embeddings

Proposed Variant1: Contextualized DocNADE (iDocNADE)

➢ incorporating full contextual information around words in a document (preceding and following words)
➢ boost the likelihood of each word and subsequently the document likelihood
➢ improved representation learning

DocNADE
Incomplete Context
around words
iDocNADE
Full Context
around words in
Need for

Proposed Variant2: DocNADE + Embedding Priors (DocNADEe)

Proposed Variant2: DocNADE + Embedding Priors ‘e’ (DocNADEe)

➢ introduce weighted pre-trained
word embedding aggregation at
each autoregressive step k
➢ E, pretrained emb as fixed prior
➢ generate topics with embeddings
➢ learn a complementary textual
representation
E

mixture weights
➢ introduce weighted pre-trained
word embedding aggregation at
each autoregressive step k
➢ E, pretrained emb as fixed prior
➢ generate topics with embeddings
➢ learn a complementary textual
representation

Proposed Variant3: iDocNADE + Embedding Priors ‘e’ (iDocNADEe)
Vairant1
iDocNADE
Vairant2
DocNADEe
Vairant3
iDocNADEe

Evaluation: Datasets, Statistics and Properties
➢ 8 short-text and 7 long-text datasets
➢ short-text → #words < 25 in a document
➢ a corpus of few documents (e.g., 20NSsmall)
➢ topics: 50 and 200 (hidden layer size)
➢ Quantitatively evaluated using:
- generalization (perplexity, PPL)
- interpretability using topic coherence
- text/information retrieval (IR)
- text classification

Evaluation: Generalization via Perplexity (PPL)
Generalization (PPL)
→ lower the better
→ on short-text datasets

Gain (%)
4.1%

Gain (%)
4.1% 4.3%

Gain (%)
4.1% 4.3% 5.1%

→ on long-text datasets
Gain (%)
5.3% 4.8% 5.5%

Evaluation: Applicability (Information Retrieval)
IR-precision (on short-text data)
→ Precision at retrieval fraction 0.02
→ higher the better
short-text

Gain (%)
short-text

Gain (%)
5.6%
short-text

Gain (%)
5.6% 7.4%
short-text

Gain (%)
5.6% 7.4%
11.1%
short-text

IR-precision (on long-text data)
Gain (%)
7.1% 7.1%7.1%
long-text

TMNtitle AGnewstitle

Evaluation: Interpretability (Topic Coherence)
→ assess the meaningfulness of
the underlying topics captured
→ coherence measure proposed by
Roeder, Both, and Hinneburg (2015)
→ higher scores imply
more coherent topics

short-text

long-text

Evaluation: Qualitative Topics (e.g., ‘religion’)
coherent
topic words

Conclusion: Take Away
➢ Leveraging full contextual information in neural autoregressive topic model
➢ Introducing distributional priors via pre-trained word embeddings
➢ Gain of 5.2% (404 vs 426) in perplexity,
2.8% (.74 vs .72) in topic coherence,
11.1% (.60 vs .54) in precision at retrieval fraction 0.02,
5.2% (.664 vs .631) in F1 for text categorization
on avg over 15 datasets
➢ Learning better word/document representation for short/long texts
➢ State-of-the-art topic models unified with textual representation learning

Conclusion: Take Away
➢ Leveraging full contextual information in neural autoregressive topic model
➢ Introducing distributional priors via pre-trained word embeddings
➢ Gain of 5.2% (404 vs 426) in perplexity,
2.8% (.74 vs .72) in topic coherence,
11.1% (.60 vs .54) in precision at retrieval fraction 0.02,
5.2% (.664 vs .631) in F1 for text categorization
on avg over 15 datasets
➢ Learning better word/document representation for short/long texts
➢ State-of-the-art topic models unified with textual representation learning
Tryout: The code and data are available at https://github.com/pgcool/iDocNADEe
Thanks !!
“textTOvec”: Latest work to appear in ICLR19 | A Neural Topic Model with Language Structures

Document Informed Neural Autoregressive Topic Models with Distributional Prior

Recommended

Recommended

More Related Content

Similar to Document Informed Neural Autoregressive Topic Models with Distributional Prior

Similar to Document Informed Neural Autoregressive Topic Models with Distributional Prior (20)

More from Pankaj Gupta, PhD

More from Pankaj Gupta, PhD (6)

Recently uploaded

Recently uploaded (20)

Document Informed Neural Autoregressive Topic Models with Distributional Prior