Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Survey- topical phrases and exploration of text corpora by topics

1,861 views

Published on

University of Illinois at Urbana-Champaign
DEPARTMENT OF COMPUTER SCIENCE

CS 512 – Spring 2014

Ahmed El-Kishky
Mehrdad Biglari
Ramesh Kolavennu

Published in: Technology
  • Be the first to comment

Survey- topical phrases and exploration of text corpora by topics

  1. 1. University of Illinois at Urbana-Champaign DEPARTMENT OF COMPUTER SCIENCE Topical Phrases and Exploration of Text Corpora by Topics Ahmed El-Kishky Mehrdad Biglari Ramesh Kolavennu CS 512 – Spring 2014 Professor: Jiawei Han
  2. 2. Introduction • This survey is in relation to a system for text exploration based on the framework ToPMine – Scalable Topical Phrase Mining from Large Text Corpora, El-Kishky, Song, Wang, Voss, Han
  3. 3. Outline • Unigram Topic Modeling – Latent Dirichlet Allocation • Post Topic Model Topical Phrase Construction – Turbo Topics – Keyphrase Extraction and Ranking by Topic (KERT) – Constructing Topical Hierarchies • Inferring Phrases and Topic – Bigram Language Model – Topical N-grams – Phrase-Discovering LDA • Separation of Phrase and Topic – ToPMine • Constraining LDA – Hidden Topic Markov Model – Sentence LDA • Frequent Pattern Mining and Topic Models – Sentence LDA • Phrase Extraction Literature – Common Phrase Extraction Techniques • Visualization – Visualizing Topic Models
  4. 4. Latent Dirichlet Allocation LDA is a generative topic model that posits that each document is a mixture of a small number of topics, with each topic a word multinomial. 1. Sample a distribution over topics from a Dirichlet prior. 2. For each word in the document – Randomly choose a topic from the distribution over topics in step #1. – Randomly choose a word from the corresponding distribution over the vocabulary.
  5. 5. LDA Sample Topics
  6. 6. Post Topic Modeling Phrase Construction • Turbo Topics • Keyphrase Extraction and Ranking by Topics (KERT) • Constructing Topical Hierarchies (CATHY)
  7. 7. Turbo Topics Construction of Topical Phrases as a post-processing step to LDA. 1. Perform LDA topic modeling on the input corpus. 2. Find consecutive words that are of the same topic within the corpus. 3. Perform recursive permutation tests on a back-off model to test significance of phrases
  8. 8. KERT Construction of Topical Phrases as a post-processing step to LDA. 1. Perform LDA topic modeling on the input corpus. 2. Partition each document into K documents (K = #topics) 3. For each topic, mine frequent patterns on the new corpus. 4. Rank the output for each topic based on four heuristic measures • Purity – measure of how often a phrase appears in a topic in relation to other topics • Completeness – threshold filtering of subphrases (support vector vs support vector machines) • Coverage – measure of frequency of phrase in topic • Phrase-ness – ratio of expected occurrence of a phrase to actual occurence learning support vector machines reinforcement learning feature selection conditional random fields classification decision trees : :
  9. 9. KERT: Sample Results
  10. 10. CATHY 1. Create a term co-occurrence network. 2. Cluster the link weights into “topics”. 3. To find subtopics, recursively cluster the subgraphs DBLP Data & Information system … DB DM IR AI …
  11. 11. CATHY: Sample Results
  12. 12. Inferring Phrase and Topic • Bigram Topic Model • Topical N-grams • Phrase-Discovering LDA
  13. 13. Bigram Topic Model 1. Draw a discrete word distribution from a Dirichlet priori for each Topic/word pair. 2. For each document, draw a discrete topic distribution from a Dirichlet prior. Then for each word: – Draw a topic from the document-topic distribution from Step #2 – Draw a word from the topic/word distribution in #1 from the topic drawn and the distribution conditioned on the previous word.
  14. 14. Topical N-Grams 1. Draw discrete word distribution from a Dirichlet prior for each topic 2. Draw Bernoulli distributions from a Beta prior for each topic and each word 3. Draw Discrete distributions from a Dirichlet prior for each topic and each word 4. For each document, draw a discrete document topic distribution from a dirichlet prior. Then for each word in the document – Draw a value, X, from the Bernoulli distribution conditioned on the previous word and its topic – Draw a topic, T, from the document topic distribution – Draw a word from the previous word’s bigram word-topic distribution if X = 1; else draw from the unigram topic distribution from topic T
  15. 15. TNG: Sample Results
  16. 16. Phrase-Discovering LDA • Each sentence is viewed as a as a time-series of words • PD-LDA posits that the generative parameter (the topic) changes periodically in accordance with changepoint indicators. As such topical phrases can be arbitrarily long, but always be drawn from a single topic. The Process is as follows 1. Each token is drawn from a distribution conditioned on the context. 2. The context consists of the phrase topic and the past m words. When m=1, this is analogous to TNG, but for larger contexts, the distributions are Pitman-Yor processes linked together in a hierarchical tree structure.
  17. 17. PD-LDA Results
  18. 18. Separating Phrases and Topics: ToPMine • ToPMine addresses the computational and quality issues of other topical phrase extraction methods by separating phrase finding from topic modeling. • First the corpus is transformed from a bag-of-words to a bag-of-phrases via an efficient agglomerative grouping algorithm. • The resultant bag-of-phrases is then the input to a PhraseLDA, a phrase-based topic model
  19. 19. ToPMine Step 1: Step 2: Good Phrases!
  20. 20. Constraining LDA • Constraining LDA has been performed with positive results. • Sentence LDA constrains each sentence to take on the same topic. • This model is adapted to work on mined phrases, where each phrase shares a single topic (a weaker assumption than sentences).
  21. 21. Visualizing Topic Models
  22. 22. Visualizing Topic Models
  23. 23. Q&A

×