CIS 890 – Information RetrievalProject Proposal“Topic Modeling using Wikipedia Concepts” Svitlana Volkova
Topics Modeling Topics modeling is a classic problem in information retrieval1 Latent Semantic Indexing/Analysis (LSI/LSA) Probabilistic Latent Semantic Indexing/Analysis (pLSI/pLSA) Latent Dirichlet Allocation (LDA) is the most popular topic model with “bag of word” assumption “words are generated independently to each other” 1Wikipedia - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
From LSI and pLSA to LDA polysemy/synonymy -> probability -> exchangeability 2[Sal83] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1983. 3[Dee90] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990. 4[Hof99] T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999. 5[Ble03] Blei, D.M., Ng, A.Y., Jordan, M.I. Latent Dirichlet allocation, Journal of Machine Learning Research, 3, pp.993-1022, 2003.
Language Models Representation probability of the sequence of words each word of both the observed and unseen documents is generated by a randomly chosen topic which is drawn from a distribution with a randomly chosen parameter.
Generative vs. Discriminative:Topics Modeling with Latent Dirichlet Allocation
word is represented as multinomial random variable 𝑤
topic is represented as a multinomial random variable z
document is represented as Dirichlet random variable 𝜃
each corner of the simplex corresponds to a topic – a component of the vector 𝑧;
document is modeled as a point of the simplex - a multimodal distribution over the topics;
a corpus is modeled as a Dirichlet distribution on the simplex6.
Disadvantages of “Bag of word” Assumption TEXT ≠ sequence of discrete word tokens The actual meaning can not be captured by words co-occurrences only Word order is not important for syntax, but it is important for lexical meaning Words order within “near by” context and phrases is critical to capturing meaning of text
Problem Statement How “Information Retrieval” Topic can be represented? What about “Artificial Intelligence”? Issues with using unigrams for topics modeling: Not enough representative for single topic Ambiguous (concepts sharing) system, modeling, information, data, structure… Where can we get additional knowledge? Unigrams -> …, information, search, …, web Unigrams -> agent, …, information, search, …
N-Grams for Topics Modeling Phrases as the whole carry more information than the sum of it’s individual components: e.g. “artificial intelligence”, “natural language processing” They are much more essential in determining the topics of collection than individual words e.g. in Acknowledgment section of paper “National Institutes of Health” and “National Science Foundation” =?
Bigram Topic Models:Wallach’s Model Biological neural network in neuroscience Neural networks in artificial intelligence “neural network” Wallach’s Bigram Topic Model (Wal‘05)7 is based on Hierarchical Dirichlet Language (Pet’94)8 7[Wal’05] Wallach, H. Topic modeling: beyond bag-of-words. NIPS 2005 Workshop on Bayesian Methods for Natural Language Processing, 2005. 8[Pet’94] MacKay, D. J. C., & Peto, L. A hierarchical Dirichlet language model. Natural Language Engineering, 1, 1–19, 1994.
Bigram Topic Models:LDA-Collocation Model LDA-Collocation Model (Ste’05)9 showed how to take advantage of a bigram model in a Bayesian way Can decide whether to generate a bigram or unigram 9[Ste’05] Steyvers, M., & Griffiths, T. Matlab topic modeling toolbox 1.3. http://psiexp.ss.uci.edu/research/programs data/toolbox.htm, 2005
N-gram Topics Models10 HMMLDA captures words dependency HMM -> short-range syntactic LDA -> long-range semantic 10[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf
LDA vs. TNG for Topics Modeling 10[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf
Methods for Collocation Discovery
Counting frequency (Jus‘95)11
11Justeson, J. S., & Katz, S. M. (1995). Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1, 9–27
Variance based collocation (Sma‘93)12
12Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177.
Hypothesis testing -> assess whether or not two words occur together more often than chance:
t-test (Chu’89)13 13Church, K., & Hanks, P. Word association norms, mutual information and lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 76–83), 1989 2 test (Chu’91)14 14Church, K. W., Gale, W., Hanks, P., & Hindle, D. Using statistics in lexical analysis. In Lexical Acquisition: Using On-line Resources to Build a Lexicon (pp. 115–164). Lawrence Erlbaum, 1991 likelihood ratio test (Dun’93)15 15Dunning, T. E. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61–74, 1993.
Mutual information (Hod’96)16
16Hodges, J., Yie, S., Reighart, R., & Boggess, L. An automated system that assists in the generation of document indexes. Natural Language Engineering, 2, 137–160, 1996.
Key Idea -> Wikipedia apply Wikipedia knowledge representation to topics modeling within a given collection Wiki Concepts -> …, information retrieval, search engine, …, web crawler 17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis Wiki Concepts -> intelligent agent, …, tree search, …, games
Science/Math/Technology (SMT) Wikipedia Visualization3,599/6,474/3,164 articles1818http://abeautifulwww.com/NewWikipediaActivityVisualizations_AB91/07WikipediaPS3150DPI.png
Tools for Using Wiki Knowledge Wiki Category Graph and Article Graph JWPL (Java Wikipedia Library) 19 Free Java-based API that allows to access all information contained in Wikipedia NER -> Multilingual NER20 (article + category +interwiki links) 19Torsten Zesch and IrynaGurevych, Analysis of the Wikipedia Category Graph for NLP Applications,In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p. 1--8, April 2007. http://elara.tk.informatik.tu-darmstadt.de/publications/2007/hlt-textgraphs.pdf 20Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 649-657, http://www.aclweb.org/anthology/D/D07/D07-1068
Goal -> to develop model that: Automatically determines:
unigram words (based on the content)
phrases (based on prior knowledge from Wiki category graph)
Simultaneously associates words and Wiki concepts with mixture of topics
Important Milestones Collect NIPS data & preprocess it
Use JWPL Tool for importing Wiki concepts into SQL DB
Apply LDA-COL technique for performing topics modeling within NIPS collection (e.g. Mallet)
Evaluate the actual performance (e.g. compare to N-grams)
Data & Existed Tools 3. Perform Topics Modeling Data Set 4. Evaluate results in comparison to unigram model/ n-grams 2. Preprocessing of collection SQL DB with Wiki Concepts 1. Collect DB of Wiki Concepts/Categories
Conclusions and Open Research Questions
Should I use only Wiki concepts (N-grams) during topics modeling or Wiki + unigrams?
How to adapt proposed approach to another domain with specific knowledge?
Next point -> Using InterWiki link for performing Polylingual Topic Modeling21 21[Mim’2009] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and A. Mccallum, "Polylingual topic models," in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, August 2009, pp. 880-889. [Online]. Available: http://www.aclweb.org/anthology/D/D09/D09-1092.pdf
Importance of N-grams Information retrieval Parsing Machine translation How to discover phrases/collocations?
Collocations = word phrases? Noun phrases: “strong tea”, “weapon of mass destruction” Phrasal verbs: “make up” = ? Other phrases: “rich and powerful” Collocation is a phrase with meaning beyond the individual words (e.g. “white house”) [Man’99] Manning, C., & Schutze, H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999.
Non-easily Explainable Patterns
Why “strong tea” but not “powerful tea”?
(“strong” -> acquired meaning from other active agent)
Why “a stiff breeze” but not “a stiff wind”? …while either “a strong breeze” or “a strong wind” is ok…
Why “broad daylight” but not “bright daylight” or “narrow darkness”?
[Man’99] Manning, C., & Schutze, H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999.