Intern © Siemens AG 2017
Document Informed Neural Autoregressive
Topic Models with Distributional Prior
Author(s): Pankaj Gupta1,2, Yatin Chaudhary2, Florian Büttner2 and Hinrich Schütze1
Presenter: Pankaj Gupta @AAAI19 Honolulu Hawaii, USA. Jan 2019
1CIS, University of Munich (LMU) | 2 Machine Intelligence, Siemens AG | Jan 2019
Intern © Siemens AG 2017
May 2017Seite 2 Corporate Technology
Outline
Motivation
→ Context awareness in Topic Models
→ Distributional semantics in Neural Topic Model
→ Baseline: DocNADE
Proposed Models
→ DocNADE+ context awareness, i.e., iDocNADE
→ DocNADE + word embeddings priors, i.e., DocNADEe
Evaluation
→ Generalization, Topic Coherence, Text retrieval and classification
Intern © Siemens AG 2017
May 2017Seite 3 Corporate Technology
Outline
Motivation
→ Context awareness in Topic Models
→ Distributional semantics in Neural Topic Model
→ Baseline: DocNADE
Proposed Models
→ DocNADE+ context awareness, i.e., iDocNADE
→ DocNADE + word embeddings priors, i.e., DocNADEe
Evaluation
→ Generalization, Topic Coherence, Text retrieval and classification
Intern © Siemens AG 2017
May 2017Seite 4 Corporate Technology
Outline
Motivation
→ Context awareness in Topic Models
→ Distributional semantics in Neural Topic Model
→ Baseline: DocNADE
Proposed Models
→ DocNADE+ context awareness, i.e., iDocNADE
→ DocNADE + word embeddings priors, i.e., DocNADEe
Evaluation
→ Generalization, Topic Coherence, Text retrieval and classification
Intern © Siemens AG 2017
May 2017Seite 5 Corporate Technology
Outline
Motivation
→ Context awareness in Topic Models
→ Distributional semantics in Neural Topic Model
→ Baseline: DocNADE
Proposed Models
→ DocNADE+ context awareness, i.e., iDocNADE
→ DocNADE + word embeddings priors, i.e., DocNADEe
Evaluation
→ Generalization, Topic Coherence, Text retrieval and classification
Intern © Siemens AG 2017
May 2017Seite 6 Corporate Technology
This Work: CONTRIBUTIONS
Incorporate context information around words
→ determining actual meaning of ambiguous words
→ improving word and document representations
Improving Topic Modeling for short-text and long-text documents
Incorporate external knowledge for each word
→ using distributional semantics, i.e., word embeddings
→ improving document representations and topics
Intern © Siemens AG 2017
May 2017Seite 7 Corporate Technology
This Work: CONTRIBUTIONS
Incorporate context information around words
→ determining actual meaning of ambiguous words
→ improving word and document representations
Improving Topic Modeling for short-text and long-text documents
Incorporate external knowledge for each word
→ using distributional semantics, i.e., word embeddings
→ improving document representations and topics
Intern © Siemens AG 2017
May 2017Seite 8 Corporate Technology
This Work: CONTRIBUTIONS
Incorporate context information around words
→ determining actual meaning of ambiguous words
→ improving word and document representations
Improving Topic Modeling for short-text and long-text documents
Incorporate external knowledge for each word
→ using distributional semantics, i.e., word embeddings
→ improving document representations and topics
Intern © Siemens AG 2017
May 2017Seite 9 Corporate Technology
This Work: CONTRIBUTIONS
Incorporate context information around words
→ determining actual meaning of ambiguous words
→ improving word and document representations
Improving Topic Modeling for short-text and long-text documents
Incorporate external knowledge for each word
→ using distributional semantics, i.e., word embeddings
→ improving document representations and topics
Intern © Siemens AG 2017
May 2017Seite 10 Corporate Technology
This Work: CONTRIBUTIONS
Incorporate context information around words
→ determining actual meaning of ambiguous words
→ improving word and document representations
Improving Topic Modeling for short-text and long-text documents
Incorporate external knowledge for each word
→ using distributional semantics, i.e., word embeddings
→ improving document representations and topics
Intern © Siemens AG 2017
May 2017Seite 11 Corporate Technology
This Work: CONTRIBUTIONS
Incorporate context information around words
→ determining actual meaning of ambiguous words
→ improving word and document representations
Improving Topic Modeling for short-text and long-text documents
Incorporate external knowledge for each word
→ using distributional semantics, i.e., word embeddings
→ improving document representations and topics
Intern © Siemens AG 2017
May 2017Seite 12 Corporate Technology
Motivation1: Need for Full Contextual Information
Source Text Sense/Topic
In biological brains, we study noisy neurons at cellular level → “biological neural network”
Like biological brains, study of noisy neurons in artificial neural networks → “artificial neural network”
Intern © Siemens AG 2017
May 2017Seite 13 Corporate Technology
Motivation1: Need for Full Contextual Information
Source Text Sense/Topic
In biological brains, we study noisy neurons at cellular level → “biological neural network”
Like biological brains, study of noisy neurons in artificial neural networks → “artificial neural network”
Preceding Context Following Context Sense/Topic of “neurons”
Like biological brains, study of noisy + in artificial neural networks → “biological neural network”
Intern © Siemens AG 2017
May 2017Seite 14 Corporate Technology
Motivation1: Need for Full Contextual Information
Source Text Sense/Topic
In biological brains, we study noisy neurons at cellular level → “biological neural network”
Like biological brains, study of noisy neurons in artificial neural networks → “artificial neural network”
Preceding Context Following Context Sense/Topic of “neurons”
Like biological brains, study of noisy + in artificial neural networks → “biological neural network”
Like biological brains, study of noisy + in artificial neural networks → “artificial neural network”
Intern © Siemens AG 2017
May 2017Seite 15 Corporate Technology
Motivation1: Need for Full Contextual Information
Source Text Sense/Topic
In biological brains, we study noisy neurons at cellular level → “biological neural network”
Like biological brains, study of noisy neurons in artificial neural networks → “artificial neural network”
Preceding Context Following Context Sense/Topic of “neurons”
Like biological brains, study of noisy + in artificial neural networks → “biological neural network”
Like biological brains, study of noisy + in artificial neural networks → “artificial neural network”
Context information around words helps in determining their actual meaning !!!
Intern © Siemens AG 2017
May 2017Seite 16 Corporate Technology
Motivation2: Need for Distributional Semantics or Prior Knowledge
➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc.
➢ “Lack of Context” in a corpus of few documents
Small number of
Word co-occurrences
Intern © Siemens AG 2017
May 2017Seite 17 Corporate Technology
Motivation2: Need for Distributional Semantics or Prior Knowledge
➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc.
➢ “Lack of Context” in a corpus of few documents
Small number of
Word co-occurrences
Lack of Context
Difficult to learn good representations
Intern © Siemens AG 2017
May 2017Seite 18 Corporate Technology
Motivation2: Need for Distributional Semantics or Prior Knowledge
➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc.
➢ “Lack of Context” in a corpus of few documents
Small number of
Word co-occurrences
Lack of Context
Generate Incoherent TopicsDifficult to learn good representations
Intern © Siemens AG 2017
May 2017Seite 19 Corporate Technology
Motivation2: Need for Distributional Semantics or Prior Knowledge
➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc.
➢ “Lack of Context” in a corpus of few documents
Small number of
Word co-occurrences
Lack of Context
Generate Incoherent TopicsDifficult to learn good representations
Topic1: price, wall, china, fall, shares
Topic2: shares, price, profits, rises, earnings
incoherent
coherent
example topics for ‘trading’
Intern © Siemens AG 2017
May 2017Seite 20 Corporate Technology
Motivation2: Need for Distributional Semantics or Prior Knowledge
➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc.
➢ “Lack of Context” in a corpus of few documents
Small number of
Word co-occurrences
Lack of Context
Generate Incoherent TopicsDifficult to learn good representations
TO RESCUE: Use External/additional information, e.g., WORD EMBEDDINGS
(encodes semantic and syntactic relatedness in words in a vector space)
Intern © Siemens AG 2017
May 2017Seite 21 Corporate Technology
Motivation2: Need for Distributional Semantics or Prior Knowledge
➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc.
➢ “Lack of Context” in a corpus of few documents
Small number of
Word co-occurrences
Lack of Context
Generate Incoherent TopicsDifficult to learn good representations
→ trading
No word overlap
(e.g., 1-hot-encoding)
Same
topic
class→ trading
TO RESCUE: Use External/additional information, e.g., WORD EMBEDDINGS
(encodes semantic and syntactic relatedness in words in a vector space)
Intern © Siemens AG 2017
May 2017Seite 22 Corporate Technology
Baseline: Neural Autoregressive Topic Model (DocNADE)
A probabilistic graphical model inspired by RBM, RSM and NADE models
➢ learns topics over sequences of words, v
➢ learn distributed word representations based on word co-occurrences
➢ follows encoder-decoder mechanism of unsupervised learning
EncodingDecoding
Intern © Siemens AG 2017
May 2017Seite 23 Corporate Technology
Baseline: Neural Autoregressive Topic Model (DocNADE)
A probabilistic graphical model inspired by RBM, RSM and NADE models
➢ learns topics over sequences of words, v
➢ learn distributed word representations based on word co-occurrences
➢ follows encoder-decoder principle
➢ computes joint distribution or log-likelihood for a document, v
in language modeling fashion via autoregressive conditionals
i.e., predict the word vi, given the sequence of preceding words v<i
autoregressive
conditional, p(vi=3 | v<3)
via a feed-forward neural network, where
EncodingDecoding
Intern © Siemens AG 2017
May 2017Seite 24 Corporate Technology
Baseline: Neural Autoregressive Topic Model (DocNADE)
A probabilistic graphical model inspired by RBM, RSM and NADE models
➢ learns topics over sequences of words, v
➢ learn distributed word representations based on word co-occurrences
➢ follows encoder-decoder principle
➢ computes joint distribution or log-likelihood for a document, v
in language modeling fashion via autoregressive conditionals
i.e., predict the word vi, given the sequence of preceding words v<i
autoregressive
conditional, p(vi=3 | v<3)
via a feed-forward neural network, where
where,
Topic matrix
Encoding
EncodingDecoding
Embedding aggregation
Intern © Siemens AG 2017
May 2017Seite 25 Corporate Technology
Baseline: Neural Autoregressive Topic Model (DocNADE)
Limitations
➢ does not take into account the following words v>i in the sequence
➢ poor in modeling short-text documents due to limited context
➢ does not use pre-trained word embeddings (or external knowledge)
autoregressive
conditional, p(vi=3 | v<3)
EncodingDecoding
DOES NOT take into account the
following words v>i
Intern © Siemens AG 2017
May 2017Seite 26 Corporate Technology
Proposed Neural Architectures, extending DocNADE to
→incorporate full context information
→incorporate pre-trained Word Embeddings
Intern © Siemens AG 2017
May 2017Seite 27 Corporate Technology
Proposed Variant1: Contextualized DocNADE (iDocNADE)
Intern © Siemens AG 2017
May 2017Seite 28 Corporate Technology
Proposed Variant1: Contextualized DocNADE (iDocNADE)
➢ incorporating full contextual information around words in a document (preceding and following words)
➢ boost the likelihood of each word and subsequently the document likelihood
➢ improved representation learning
Intern © Siemens AG 2017
May 2017Seite 29 Corporate Technology
Proposed Variant1: Contextualized DocNADE (iDocNADE)
➢ incorporating full contextual information around words in a document (preceding and following words)
➢ boost the likelihood of each word and subsequently the document likelihood
➢ improved representation learning
Intern © Siemens AG 2017
May 2017Seite 30 Corporate Technology
Proposed Variant1: Contextualized DocNADE (iDocNADE)
Intern © Siemens AG 2017
May 2017Seite 31 Corporate Technology
Proposed Variant1: Contextualized DocNADE (iDocNADE)
DocNADE
Incomplete Context
around words
iDocNADE
Full Context
around words in
Need for
Intern © Siemens AG 2017
May 2017Seite 32 Corporate Technology
Proposed Variant2: DocNADE + Embedding Priors (DocNADEe)
Intern © Siemens AG 2017
May 2017Seite 33 Corporate Technology
Proposed Variant2: DocNADE + Embedding Priors ‘e’ (DocNADEe)
Intern © Siemens AG 2017
May 2017Seite 34 Corporate Technology
Proposed Variant2: DocNADE + Embedding Priors ‘e’ (DocNADEe)
➢ introduce weighted pre-trained
word embedding aggregation at
each autoregressive step k
➢ E, pretrained emb as fixed prior
➢ generate topics with embeddings
➢ learn a complementary textual
representation
E
Intern © Siemens AG 2017
May 2017Seite 35 Corporate Technology
Proposed Variant2: DocNADE + Embedding Priors ‘e’ (DocNADEe)
mixture weights
➢ introduce weighted pre-trained
word embedding aggregation at
each autoregressive step k
➢ E, pretrained emb as fixed prior
➢ generate topics with embeddings
➢ learn a complementary textual
representation
Intern © Siemens AG 2017
May 2017Seite 36 Corporate Technology
Proposed Variant3: iDocNADE + Embedding Priors ‘e’ (iDocNADEe)
Vairant1
iDocNADE
Vairant2
DocNADEe
Vairant3
iDocNADEe
Intern © Siemens AG 2017
May 2017Seite 37 Corporate Technology
Evaluation: Datasets, Statistics and Properties
➢ 8 short-text and 7 long-text datasets
➢ short-text → #words < 25 in a document
➢ a corpus of few documents (e.g., 20NSsmall)
➢ topics: 50 and 200 (hidden layer size)
➢ Quantitatively evaluated using:
- generalization (perplexity, PPL)
- interpretability using topic coherence
- text/information retrieval (IR)
- text classification
Intern © Siemens AG 2017
May 2017Seite 38 Corporate Technology
Evaluation: Datasets, Statistics and Properties
➢ 8 short-text and 7 long-text datasets
➢ short-text → #words < 25 in a document
➢ a corpus of few documents (e.g., 20NSsmall)
➢ topics: 50 and 200 (hidden layer size)
➢ Quantitatively evaluated using:
- generalization (perplexity, PPL)
- interpretability using topic coherence
- text/information retrieval (IR)
- text classification
Intern © Siemens AG 2017
May 2017Seite 39 Corporate Technology
Evaluation: Generalization via Perplexity (PPL)
Generalization (PPL)
→ lower the better
→ on short-text datasets
Intern © Siemens AG 2017
May 2017Seite 40 Corporate Technology
Evaluation: Generalization via Perplexity (PPL)
Generalization (PPL)
→ lower the better
→ on short-text datasets
Intern © Siemens AG 2017
May 2017Seite 41 Corporate Technology
Evaluation: Generalization via Perplexity (PPL)
Generalization (PPL)
→ lower the better
→ on short-text datasets
Gain (%)
4.1%
Intern © Siemens AG 2017
May 2017Seite 42 Corporate Technology
Evaluation: Generalization via Perplexity (PPL)
Generalization (PPL)
→ lower the better
→ on short-text datasets
Gain (%)
4.1% 4.3%
Intern © Siemens AG 2017
May 2017Seite 43 Corporate Technology
Evaluation: Generalization via Perplexity (PPL)
Generalization (PPL)
→ lower the better
→ on short-text datasets
Gain (%)
4.1% 4.3% 5.1%
Intern © Siemens AG 2017
May 2017Seite 44 Corporate Technology
Evaluation: Generalization via Perplexity (PPL)
Generalization (PPL)
→ lower the better
→ on long-text datasets
Gain (%)
5.3% 4.8% 5.5%
Intern © Siemens AG 2017
May 2017Seite 45 Corporate Technology
Evaluation: Applicability (Information Retrieval)
IR-precision (on short-text data)
→ Precision at retrieval fraction 0.02
→ higher the better
short-text
Intern © Siemens AG 2017
May 2017Seite 46 Corporate Technology
Evaluation: Applicability (Information Retrieval)
IR-precision (on short-text data)
→ Precision at retrieval fraction 0.02
→ higher the better
Gain (%)
short-text
Intern © Siemens AG 2017
May 2017Seite 47 Corporate Technology
Evaluation: Applicability (Information Retrieval)
IR-precision (on short-text data)
→ Precision at retrieval fraction 0.02
→ higher the better
Gain (%)
5.6%
short-text
Intern © Siemens AG 2017
May 2017Seite 48 Corporate Technology
Evaluation: Applicability (Information Retrieval)
IR-precision (on short-text data)
→ Precision at retrieval fraction 0.02
→ higher the better
Gain (%)
5.6% 7.4%
short-text
Intern © Siemens AG 2017
May 2017Seite 49 Corporate Technology
Evaluation: Applicability (Information Retrieval)
IR-precision (on short-text data)
→ Precision at retrieval fraction 0.02
→ higher the better
Gain (%)
5.6% 7.4%
11.1%
short-text
Intern © Siemens AG 2017
May 2017Seite 50 Corporate Technology
Evaluation: Applicability (Information Retrieval)
IR-precision (on long-text data)
→ Precision at retrieval fraction 0.02
→ higher the better
Gain (%)
7.1% 7.1%7.1%
long-text
Intern © Siemens AG 2017
May 2017Seite 51 Corporate Technology
Evaluation: Applicability (Information Retrieval)
TMNtitle AGnewstitle
Intern © Siemens AG 2017
May 2017Seite 52 Corporate Technology
Evaluation: Interpretability (Topic Coherence)
→ assess the meaningfulness of
the underlying topics captured
→ coherence measure proposed by
Roeder, Both, and Hinneburg (2015)
→ higher scores imply
more coherent topics
Intern © Siemens AG 2017
May 2017Seite 53 Corporate Technology
Evaluation: Interpretability (Topic Coherence)
→ assess the meaningfulness of
the underlying topics captured
→ coherence measure proposed by
Roeder, Both, and Hinneburg (2015)
→ higher scores imply
more coherent topics
short-text
Intern © Siemens AG 2017
May 2017Seite 54 Corporate Technology
Evaluation: Interpretability (Topic Coherence)
→ assess the meaningfulness of
the underlying topics captured
→ coherence measure proposed by
Roeder, Both, and Hinneburg (2015)
→ higher scores imply
more coherent topics
long-text
Intern © Siemens AG 2017
May 2017Seite 55 Corporate Technology
Evaluation: Qualitative Topics (e.g., ‘religion’)
coherent
topic words
Intern © Siemens AG 2017
May 2017Seite 56 Corporate Technology
Conclusion: Take Away
➢ Leveraging full contextual information in neural autoregressive topic model
➢ Introducing distributional priors via pre-trained word embeddings
➢ Gain of 5.2% (404 vs 426) in perplexity,
2.8% (.74 vs .72) in topic coherence,
11.1% (.60 vs .54) in precision at retrieval fraction 0.02,
5.2% (.664 vs .631) in F1 for text categorization
on avg over 15 datasets
➢ Learning better word/document representation for short/long texts
➢ State-of-the-art topic models unified with textual representation learning
Intern © Siemens AG 2017
May 2017Seite 57 Corporate Technology
Conclusion: Take Away
➢ Leveraging full contextual information in neural autoregressive topic model
➢ Introducing distributional priors via pre-trained word embeddings
➢ Gain of 5.2% (404 vs 426) in perplexity,
2.8% (.74 vs .72) in topic coherence,
11.1% (.60 vs .54) in precision at retrieval fraction 0.02,
5.2% (.664 vs .631) in F1 for text categorization
on avg over 15 datasets
➢ Learning better word/document representation for short/long texts
➢ State-of-the-art topic models unified with textual representation learning
Tryout: The code and data are available at https://github.com/pgcool/iDocNADEe
Thanks !!
“textTOvec”: Latest work to appear in ICLR19 | A Neural Topic Model with Language Structures

Document Informed Neural Autoregressive Topic Models with Distributional Prior

  • 1.
    Intern © SiemensAG 2017 Document Informed Neural Autoregressive Topic Models with Distributional Prior Author(s): Pankaj Gupta1,2, Yatin Chaudhary2, Florian Büttner2 and Hinrich Schütze1 Presenter: Pankaj Gupta @AAAI19 Honolulu Hawaii, USA. Jan 2019 1CIS, University of Munich (LMU) | 2 Machine Intelligence, Siemens AG | Jan 2019
  • 2.
    Intern © SiemensAG 2017 May 2017Seite 2 Corporate Technology Outline Motivation → Context awareness in Topic Models → Distributional semantics in Neural Topic Model → Baseline: DocNADE Proposed Models → DocNADE+ context awareness, i.e., iDocNADE → DocNADE + word embeddings priors, i.e., DocNADEe Evaluation → Generalization, Topic Coherence, Text retrieval and classification
  • 3.
    Intern © SiemensAG 2017 May 2017Seite 3 Corporate Technology Outline Motivation → Context awareness in Topic Models → Distributional semantics in Neural Topic Model → Baseline: DocNADE Proposed Models → DocNADE+ context awareness, i.e., iDocNADE → DocNADE + word embeddings priors, i.e., DocNADEe Evaluation → Generalization, Topic Coherence, Text retrieval and classification
  • 4.
    Intern © SiemensAG 2017 May 2017Seite 4 Corporate Technology Outline Motivation → Context awareness in Topic Models → Distributional semantics in Neural Topic Model → Baseline: DocNADE Proposed Models → DocNADE+ context awareness, i.e., iDocNADE → DocNADE + word embeddings priors, i.e., DocNADEe Evaluation → Generalization, Topic Coherence, Text retrieval and classification
  • 5.
    Intern © SiemensAG 2017 May 2017Seite 5 Corporate Technology Outline Motivation → Context awareness in Topic Models → Distributional semantics in Neural Topic Model → Baseline: DocNADE Proposed Models → DocNADE+ context awareness, i.e., iDocNADE → DocNADE + word embeddings priors, i.e., DocNADEe Evaluation → Generalization, Topic Coherence, Text retrieval and classification
  • 6.
    Intern © SiemensAG 2017 May 2017Seite 6 Corporate Technology This Work: CONTRIBUTIONS Incorporate context information around words → determining actual meaning of ambiguous words → improving word and document representations Improving Topic Modeling for short-text and long-text documents Incorporate external knowledge for each word → using distributional semantics, i.e., word embeddings → improving document representations and topics
  • 7.
    Intern © SiemensAG 2017 May 2017Seite 7 Corporate Technology This Work: CONTRIBUTIONS Incorporate context information around words → determining actual meaning of ambiguous words → improving word and document representations Improving Topic Modeling for short-text and long-text documents Incorporate external knowledge for each word → using distributional semantics, i.e., word embeddings → improving document representations and topics
  • 8.
    Intern © SiemensAG 2017 May 2017Seite 8 Corporate Technology This Work: CONTRIBUTIONS Incorporate context information around words → determining actual meaning of ambiguous words → improving word and document representations Improving Topic Modeling for short-text and long-text documents Incorporate external knowledge for each word → using distributional semantics, i.e., word embeddings → improving document representations and topics
  • 9.
    Intern © SiemensAG 2017 May 2017Seite 9 Corporate Technology This Work: CONTRIBUTIONS Incorporate context information around words → determining actual meaning of ambiguous words → improving word and document representations Improving Topic Modeling for short-text and long-text documents Incorporate external knowledge for each word → using distributional semantics, i.e., word embeddings → improving document representations and topics
  • 10.
    Intern © SiemensAG 2017 May 2017Seite 10 Corporate Technology This Work: CONTRIBUTIONS Incorporate context information around words → determining actual meaning of ambiguous words → improving word and document representations Improving Topic Modeling for short-text and long-text documents Incorporate external knowledge for each word → using distributional semantics, i.e., word embeddings → improving document representations and topics
  • 11.
    Intern © SiemensAG 2017 May 2017Seite 11 Corporate Technology This Work: CONTRIBUTIONS Incorporate context information around words → determining actual meaning of ambiguous words → improving word and document representations Improving Topic Modeling for short-text and long-text documents Incorporate external knowledge for each word → using distributional semantics, i.e., word embeddings → improving document representations and topics
  • 12.
    Intern © SiemensAG 2017 May 2017Seite 12 Corporate Technology Motivation1: Need for Full Contextual Information Source Text Sense/Topic In biological brains, we study noisy neurons at cellular level → “biological neural network” Like biological brains, study of noisy neurons in artificial neural networks → “artificial neural network”
  • 13.
    Intern © SiemensAG 2017 May 2017Seite 13 Corporate Technology Motivation1: Need for Full Contextual Information Source Text Sense/Topic In biological brains, we study noisy neurons at cellular level → “biological neural network” Like biological brains, study of noisy neurons in artificial neural networks → “artificial neural network” Preceding Context Following Context Sense/Topic of “neurons” Like biological brains, study of noisy + in artificial neural networks → “biological neural network”
  • 14.
    Intern © SiemensAG 2017 May 2017Seite 14 Corporate Technology Motivation1: Need for Full Contextual Information Source Text Sense/Topic In biological brains, we study noisy neurons at cellular level → “biological neural network” Like biological brains, study of noisy neurons in artificial neural networks → “artificial neural network” Preceding Context Following Context Sense/Topic of “neurons” Like biological brains, study of noisy + in artificial neural networks → “biological neural network” Like biological brains, study of noisy + in artificial neural networks → “artificial neural network”
  • 15.
    Intern © SiemensAG 2017 May 2017Seite 15 Corporate Technology Motivation1: Need for Full Contextual Information Source Text Sense/Topic In biological brains, we study noisy neurons at cellular level → “biological neural network” Like biological brains, study of noisy neurons in artificial neural networks → “artificial neural network” Preceding Context Following Context Sense/Topic of “neurons” Like biological brains, study of noisy + in artificial neural networks → “biological neural network” Like biological brains, study of noisy + in artificial neural networks → “artificial neural network” Context information around words helps in determining their actual meaning !!!
  • 16.
    Intern © SiemensAG 2017 May 2017Seite 16 Corporate Technology Motivation2: Need for Distributional Semantics or Prior Knowledge ➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc. ➢ “Lack of Context” in a corpus of few documents Small number of Word co-occurrences
  • 17.
    Intern © SiemensAG 2017 May 2017Seite 17 Corporate Technology Motivation2: Need for Distributional Semantics or Prior Knowledge ➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc. ➢ “Lack of Context” in a corpus of few documents Small number of Word co-occurrences Lack of Context Difficult to learn good representations
  • 18.
    Intern © SiemensAG 2017 May 2017Seite 18 Corporate Technology Motivation2: Need for Distributional Semantics or Prior Knowledge ➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc. ➢ “Lack of Context” in a corpus of few documents Small number of Word co-occurrences Lack of Context Generate Incoherent TopicsDifficult to learn good representations
  • 19.
    Intern © SiemensAG 2017 May 2017Seite 19 Corporate Technology Motivation2: Need for Distributional Semantics or Prior Knowledge ➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc. ➢ “Lack of Context” in a corpus of few documents Small number of Word co-occurrences Lack of Context Generate Incoherent TopicsDifficult to learn good representations Topic1: price, wall, china, fall, shares Topic2: shares, price, profits, rises, earnings incoherent coherent example topics for ‘trading’
  • 20.
    Intern © SiemensAG 2017 May 2017Seite 20 Corporate Technology Motivation2: Need for Distributional Semantics or Prior Knowledge ➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc. ➢ “Lack of Context” in a corpus of few documents Small number of Word co-occurrences Lack of Context Generate Incoherent TopicsDifficult to learn good representations TO RESCUE: Use External/additional information, e.g., WORD EMBEDDINGS (encodes semantic and syntactic relatedness in words in a vector space)
  • 21.
    Intern © SiemensAG 2017 May 2017Seite 21 Corporate Technology Motivation2: Need for Distributional Semantics or Prior Knowledge ➢ “Lack of Context” in short-text documents, e.g., headlines, tweets, etc. ➢ “Lack of Context” in a corpus of few documents Small number of Word co-occurrences Lack of Context Generate Incoherent TopicsDifficult to learn good representations → trading No word overlap (e.g., 1-hot-encoding) Same topic class→ trading TO RESCUE: Use External/additional information, e.g., WORD EMBEDDINGS (encodes semantic and syntactic relatedness in words in a vector space)
  • 22.
    Intern © SiemensAG 2017 May 2017Seite 22 Corporate Technology Baseline: Neural Autoregressive Topic Model (DocNADE) A probabilistic graphical model inspired by RBM, RSM and NADE models ➢ learns topics over sequences of words, v ➢ learn distributed word representations based on word co-occurrences ➢ follows encoder-decoder mechanism of unsupervised learning EncodingDecoding
  • 23.
    Intern © SiemensAG 2017 May 2017Seite 23 Corporate Technology Baseline: Neural Autoregressive Topic Model (DocNADE) A probabilistic graphical model inspired by RBM, RSM and NADE models ➢ learns topics over sequences of words, v ➢ learn distributed word representations based on word co-occurrences ➢ follows encoder-decoder principle ➢ computes joint distribution or log-likelihood for a document, v in language modeling fashion via autoregressive conditionals i.e., predict the word vi, given the sequence of preceding words v<i autoregressive conditional, p(vi=3 | v<3) via a feed-forward neural network, where EncodingDecoding
  • 24.
    Intern © SiemensAG 2017 May 2017Seite 24 Corporate Technology Baseline: Neural Autoregressive Topic Model (DocNADE) A probabilistic graphical model inspired by RBM, RSM and NADE models ➢ learns topics over sequences of words, v ➢ learn distributed word representations based on word co-occurrences ➢ follows encoder-decoder principle ➢ computes joint distribution or log-likelihood for a document, v in language modeling fashion via autoregressive conditionals i.e., predict the word vi, given the sequence of preceding words v<i autoregressive conditional, p(vi=3 | v<3) via a feed-forward neural network, where where, Topic matrix Encoding EncodingDecoding Embedding aggregation
  • 25.
    Intern © SiemensAG 2017 May 2017Seite 25 Corporate Technology Baseline: Neural Autoregressive Topic Model (DocNADE) Limitations ➢ does not take into account the following words v>i in the sequence ➢ poor in modeling short-text documents due to limited context ➢ does not use pre-trained word embeddings (or external knowledge) autoregressive conditional, p(vi=3 | v<3) EncodingDecoding DOES NOT take into account the following words v>i
  • 26.
    Intern © SiemensAG 2017 May 2017Seite 26 Corporate Technology Proposed Neural Architectures, extending DocNADE to →incorporate full context information →incorporate pre-trained Word Embeddings
  • 27.
    Intern © SiemensAG 2017 May 2017Seite 27 Corporate Technology Proposed Variant1: Contextualized DocNADE (iDocNADE)
  • 28.
    Intern © SiemensAG 2017 May 2017Seite 28 Corporate Technology Proposed Variant1: Contextualized DocNADE (iDocNADE) ➢ incorporating full contextual information around words in a document (preceding and following words) ➢ boost the likelihood of each word and subsequently the document likelihood ➢ improved representation learning
  • 29.
    Intern © SiemensAG 2017 May 2017Seite 29 Corporate Technology Proposed Variant1: Contextualized DocNADE (iDocNADE) ➢ incorporating full contextual information around words in a document (preceding and following words) ➢ boost the likelihood of each word and subsequently the document likelihood ➢ improved representation learning
  • 30.
    Intern © SiemensAG 2017 May 2017Seite 30 Corporate Technology Proposed Variant1: Contextualized DocNADE (iDocNADE)
  • 31.
    Intern © SiemensAG 2017 May 2017Seite 31 Corporate Technology Proposed Variant1: Contextualized DocNADE (iDocNADE) DocNADE Incomplete Context around words iDocNADE Full Context around words in Need for
  • 32.
    Intern © SiemensAG 2017 May 2017Seite 32 Corporate Technology Proposed Variant2: DocNADE + Embedding Priors (DocNADEe)
  • 33.
    Intern © SiemensAG 2017 May 2017Seite 33 Corporate Technology Proposed Variant2: DocNADE + Embedding Priors ‘e’ (DocNADEe)
  • 34.
    Intern © SiemensAG 2017 May 2017Seite 34 Corporate Technology Proposed Variant2: DocNADE + Embedding Priors ‘e’ (DocNADEe) ➢ introduce weighted pre-trained word embedding aggregation at each autoregressive step k ➢ E, pretrained emb as fixed prior ➢ generate topics with embeddings ➢ learn a complementary textual representation E
  • 35.
    Intern © SiemensAG 2017 May 2017Seite 35 Corporate Technology Proposed Variant2: DocNADE + Embedding Priors ‘e’ (DocNADEe) mixture weights ➢ introduce weighted pre-trained word embedding aggregation at each autoregressive step k ➢ E, pretrained emb as fixed prior ➢ generate topics with embeddings ➢ learn a complementary textual representation
  • 36.
    Intern © SiemensAG 2017 May 2017Seite 36 Corporate Technology Proposed Variant3: iDocNADE + Embedding Priors ‘e’ (iDocNADEe) Vairant1 iDocNADE Vairant2 DocNADEe Vairant3 iDocNADEe
  • 37.
    Intern © SiemensAG 2017 May 2017Seite 37 Corporate Technology Evaluation: Datasets, Statistics and Properties ➢ 8 short-text and 7 long-text datasets ➢ short-text → #words < 25 in a document ➢ a corpus of few documents (e.g., 20NSsmall) ➢ topics: 50 and 200 (hidden layer size) ➢ Quantitatively evaluated using: - generalization (perplexity, PPL) - interpretability using topic coherence - text/information retrieval (IR) - text classification
  • 38.
    Intern © SiemensAG 2017 May 2017Seite 38 Corporate Technology Evaluation: Datasets, Statistics and Properties ➢ 8 short-text and 7 long-text datasets ➢ short-text → #words < 25 in a document ➢ a corpus of few documents (e.g., 20NSsmall) ➢ topics: 50 and 200 (hidden layer size) ➢ Quantitatively evaluated using: - generalization (perplexity, PPL) - interpretability using topic coherence - text/information retrieval (IR) - text classification
  • 39.
    Intern © SiemensAG 2017 May 2017Seite 39 Corporate Technology Evaluation: Generalization via Perplexity (PPL) Generalization (PPL) → lower the better → on short-text datasets
  • 40.
    Intern © SiemensAG 2017 May 2017Seite 40 Corporate Technology Evaluation: Generalization via Perplexity (PPL) Generalization (PPL) → lower the better → on short-text datasets
  • 41.
    Intern © SiemensAG 2017 May 2017Seite 41 Corporate Technology Evaluation: Generalization via Perplexity (PPL) Generalization (PPL) → lower the better → on short-text datasets Gain (%) 4.1%
  • 42.
    Intern © SiemensAG 2017 May 2017Seite 42 Corporate Technology Evaluation: Generalization via Perplexity (PPL) Generalization (PPL) → lower the better → on short-text datasets Gain (%) 4.1% 4.3%
  • 43.
    Intern © SiemensAG 2017 May 2017Seite 43 Corporate Technology Evaluation: Generalization via Perplexity (PPL) Generalization (PPL) → lower the better → on short-text datasets Gain (%) 4.1% 4.3% 5.1%
  • 44.
    Intern © SiemensAG 2017 May 2017Seite 44 Corporate Technology Evaluation: Generalization via Perplexity (PPL) Generalization (PPL) → lower the better → on long-text datasets Gain (%) 5.3% 4.8% 5.5%
  • 45.
    Intern © SiemensAG 2017 May 2017Seite 45 Corporate Technology Evaluation: Applicability (Information Retrieval) IR-precision (on short-text data) → Precision at retrieval fraction 0.02 → higher the better short-text
  • 46.
    Intern © SiemensAG 2017 May 2017Seite 46 Corporate Technology Evaluation: Applicability (Information Retrieval) IR-precision (on short-text data) → Precision at retrieval fraction 0.02 → higher the better Gain (%) short-text
  • 47.
    Intern © SiemensAG 2017 May 2017Seite 47 Corporate Technology Evaluation: Applicability (Information Retrieval) IR-precision (on short-text data) → Precision at retrieval fraction 0.02 → higher the better Gain (%) 5.6% short-text
  • 48.
    Intern © SiemensAG 2017 May 2017Seite 48 Corporate Technology Evaluation: Applicability (Information Retrieval) IR-precision (on short-text data) → Precision at retrieval fraction 0.02 → higher the better Gain (%) 5.6% 7.4% short-text
  • 49.
    Intern © SiemensAG 2017 May 2017Seite 49 Corporate Technology Evaluation: Applicability (Information Retrieval) IR-precision (on short-text data) → Precision at retrieval fraction 0.02 → higher the better Gain (%) 5.6% 7.4% 11.1% short-text
  • 50.
    Intern © SiemensAG 2017 May 2017Seite 50 Corporate Technology Evaluation: Applicability (Information Retrieval) IR-precision (on long-text data) → Precision at retrieval fraction 0.02 → higher the better Gain (%) 7.1% 7.1%7.1% long-text
  • 51.
    Intern © SiemensAG 2017 May 2017Seite 51 Corporate Technology Evaluation: Applicability (Information Retrieval) TMNtitle AGnewstitle
  • 52.
    Intern © SiemensAG 2017 May 2017Seite 52 Corporate Technology Evaluation: Interpretability (Topic Coherence) → assess the meaningfulness of the underlying topics captured → coherence measure proposed by Roeder, Both, and Hinneburg (2015) → higher scores imply more coherent topics
  • 53.
    Intern © SiemensAG 2017 May 2017Seite 53 Corporate Technology Evaluation: Interpretability (Topic Coherence) → assess the meaningfulness of the underlying topics captured → coherence measure proposed by Roeder, Both, and Hinneburg (2015) → higher scores imply more coherent topics short-text
  • 54.
    Intern © SiemensAG 2017 May 2017Seite 54 Corporate Technology Evaluation: Interpretability (Topic Coherence) → assess the meaningfulness of the underlying topics captured → coherence measure proposed by Roeder, Both, and Hinneburg (2015) → higher scores imply more coherent topics long-text
  • 55.
    Intern © SiemensAG 2017 May 2017Seite 55 Corporate Technology Evaluation: Qualitative Topics (e.g., ‘religion’) coherent topic words
  • 56.
    Intern © SiemensAG 2017 May 2017Seite 56 Corporate Technology Conclusion: Take Away ➢ Leveraging full contextual information in neural autoregressive topic model ➢ Introducing distributional priors via pre-trained word embeddings ➢ Gain of 5.2% (404 vs 426) in perplexity, 2.8% (.74 vs .72) in topic coherence, 11.1% (.60 vs .54) in precision at retrieval fraction 0.02, 5.2% (.664 vs .631) in F1 for text categorization on avg over 15 datasets ➢ Learning better word/document representation for short/long texts ➢ State-of-the-art topic models unified with textual representation learning
  • 57.
    Intern © SiemensAG 2017 May 2017Seite 57 Corporate Technology Conclusion: Take Away ➢ Leveraging full contextual information in neural autoregressive topic model ➢ Introducing distributional priors via pre-trained word embeddings ➢ Gain of 5.2% (404 vs 426) in perplexity, 2.8% (.74 vs .72) in topic coherence, 11.1% (.60 vs .54) in precision at retrieval fraction 0.02, 5.2% (.664 vs .631) in F1 for text categorization on avg over 15 datasets ➢ Learning better word/document representation for short/long texts ➢ State-of-the-art topic models unified with textual representation learning Tryout: The code and data are available at https://github.com/pgcool/iDocNADEe Thanks !! “textTOvec”: Latest work to appear in ICLR19 | A Neural Topic Model with Language Structures