Project Proposal Topics Modeling (Ir)

5,777 views

Published on

Published in: Education, Business, Real Estate
  • Be the first to comment

Project Proposal Topics Modeling (Ir)

  1. 1. CIS 890 – Information RetrievalProject Proposal“Topic Modeling using Wikipedia Concepts”<br />Svitlana Volkova<br />
  2. 2. Topics Modeling<br />Topics modeling is a classic problem in information retrieval1<br />Latent Semantic Indexing/Analysis (LSI/LSA)<br />Probabilistic Latent Semantic Indexing/Analysis (pLSI/pLSA)<br />Latent Dirichlet Allocation (LDA) is the most popular topic model with “bag of word” assumption<br />“words are generated independently to each other”<br />1Wikipedia - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation<br />
  3. 3. From LSI and pLSA to LDA<br />polysemy/synonymy -&gt; probability -&gt; exchangeability<br />2[Sal83] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1983.<br />3[Dee90] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990.<br />4[Hof99] T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999.<br />5[Ble03] Blei, D.M., Ng, A.Y., Jordan, M.I. Latent Dirichlet allocation, Journal of Machine Learning Research, 3, pp.993-1022, 2003.<br />
  4. 4. Language Models Representation<br />probability of the sequence of words<br />each word of both the observed and unseen documents is generated by a randomly chosen topic which is drawn from a distribution with a randomly chosen parameter.<br />
  5. 5. Generative vs. Discriminative:Topics Modeling with Latent Dirichlet Allocation<br /><ul><li>word is represented as multinomial random variable 𝑤
  6. 6. topic is represented as a multinomial random variable z
  7. 7. document is represented as Dirichlet random variable 𝜃
  8. 8. each corner of the simplex corresponds to a topic – a component of the vector 𝑧;
  9. 9. document is modeled as a point of the simplex - a multimodal distribution over the topics;
  10. 10. a corpus is modeled as a Dirichlet distribution on the simplex6.</li></ul>6http://www.cs.berkeley.edu/~jordan<br />
  11. 11. Disadvantages of “Bag of word” Assumption<br />TEXT ≠ sequence of discrete word tokens<br />The actual meaning can not be captured by words co-occurrences only <br />Word order is not important for syntax, but it is important for lexical meaning<br />Words order within “near by” context and phrases is critical to capturing meaning of text<br />
  12. 12. Problem Statement<br />How “Information Retrieval” Topic can be represented?<br />What about “Artificial Intelligence”?<br />Issues with using unigrams for topics modeling:<br />Not enough representative for single topic<br />Ambiguous (concepts sharing)<br />system, modeling, information, data, structure…<br />Where can we get additional knowledge?<br />Unigrams -&gt; …, information, search, …, web<br />Unigrams -&gt; agent, …, information, search, …<br />
  13. 13. N-Grams for Topics Modeling<br />Phrases as the whole carry more information than the sum of it’s individual components:<br />e.g. “artificial intelligence”, “natural language processing”<br />They are much more essential in determining the topics of collection than individual words<br />e.g. in Acknowledgment section of paper “National Institutes of Health” and “National Science Foundation”<br /> =?<br />
  14. 14. Bigram Topic Models:Wallach’s Model<br />Biological neural network in neuroscience<br />Neural networks in artificial intelligence<br />“neural network”<br />Wallach’s Bigram Topic Model (Wal‘05)7 is based on Hierarchical Dirichlet Language (Pet’94)8<br />7[Wal’05] Wallach, H. Topic modeling: beyond bag-of-words. NIPS 2005 Workshop on Bayesian Methods for Natural Language Processing, 2005.<br />8[Pet’94] MacKay, D. J. C., & Peto, L. A hierarchical Dirichlet language model. Natural Language Engineering, 1, 1–19, 1994.<br />
  15. 15. Bigram Topic Models:LDA-Collocation Model<br />LDA-Collocation Model (Ste’05)9 showed how to take advantage of a bigram model in a Bayesian way<br />Can decide whether to generate a bigram or unigram<br />9[Ste’05] Steyvers, M., & Griffiths, T. Matlab topic modeling toolbox 1.3. <br />http://psiexp.ss.uci.edu/research/programs data/toolbox.htm, 2005<br />
  16. 16. N-gram Topics Models10<br />HMMLDA captures words dependency<br />HMM -&gt; short-range syntactic LDA -&gt; long-range semantic <br />10[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf<br />
  17. 17. LDA vs. TNG for Topics Modeling <br />10[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf<br />
  18. 18. Methods for Collocation Discovery<br /><ul><li>Counting frequency (Jus‘95)11</li></ul>11Justeson, J. S., & Katz, S. M. (1995). Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1, 9–27<br /><ul><li>Variance based collocation (Sma‘93)12</li></ul>12Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177. <br /><ul><li>Hypothesis testing -> assess whether or not two words occur together more often than chance:</li></ul>t-test (Chu’89)13<br />13Church, K., & Hanks, P. Word association norms, mutual information and lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 76–83), 1989<br />2 test (Chu’91)14<br />14Church, K. W., Gale, W., Hanks, P., & Hindle, D. Using statistics in lexical analysis. In Lexical Acquisition: Using On-line Resources to Build a Lexicon (pp. 115–164). Lawrence Erlbaum, 1991<br />likelihood ratio test (Dun’93)15<br />15Dunning, T. E. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61–74, 1993.<br /><ul><li>Mutual information (Hod’96)16</li></ul>16Hodges, J., Yie, S., Reighart, R., & Boggess, L. An automated system that assists in the generation of document indexes. Natural Language Engineering, 2, 137–160, 1996.<br />
  19. 19. Key Idea -&gt; Wikipedia<br />apply Wikipedia knowledge representation to topics modeling within a given collection<br />Wiki Concepts -&gt; …, information retrieval, search engine, …, web crawler<br />17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis<br />Wiki Concepts -&gt; intelligent agent, …, tree search, …, games<br />
  20. 20. Science/Math/Technology (SMT) Wikipedia Visualization3,599/6,474/3,164 articles1818http://abeautifulwww.com/NewWikipediaActivityVisualizations_AB91/07WikipediaPS3150DPI.png<br />
  21. 21. Tools for Using Wiki Knowledge<br />Wiki Category Graph and Article Graph<br />JWPL (Java Wikipedia Library) 19<br />Free Java-based API that allows to access all information contained in Wikipedia<br />NER -&gt; Multilingual NER20 (article + category +interwiki links)<br />19Torsten Zesch and IrynaGurevych, Analysis of the Wikipedia Category Graph for NLP Applications,In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p. 1--8, April 2007.<br />http://elara.tk.informatik.tu-darmstadt.de/publications/2007/hlt-textgraphs.pdf<br />20Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 649-657, http://www.aclweb.org/anthology/D/D07/D07-1068<br />
  22. 22. Goal -&gt; to develop model that:<br />Automatically determines:<br /><ul><li>unigram words (based on the content)
  23. 23. phrases (based on prior knowledge from Wiki category graph)</li></ul>Simultaneously associates words and Wiki concepts with mixture of topics<br />
  24. 24. Important Milestones<br />Collect NIPS data & preprocess it<br /><ul><li>http://books.nips.cc/</li></ul>Use JWPL Tool for importing Wiki concepts into SQL DB<br /><ul><li>http://www.ukp.tu-darmstadt.de/software/jwpl/</li></ul>Apply LDA-COL technique for performing topics modeling within NIPS collection (e.g. Mallet)<br /><ul><li>http://mallet.cs.umass.edu/topics.php</li></ul>Evaluate the actual performance (e.g. compare to N-grams)<br />
  25. 25. Data & Existed Tools<br />3. Perform Topics Modeling<br />Data Set<br />4. Evaluate results in comparison to unigram model/ n-grams<br />2. Preprocessing of collection<br />SQL DB with Wiki Concepts<br />1. Collect DB of Wiki Concepts/Categories<br />
  26. 26. Conclusions and Open Research Questions<br /><ul><li>Should I use only Wiki concepts (N-grams) during topics modeling or Wiki + unigrams?
  27. 27. How to adapt proposed approach to another domain with specific knowledge?</li></ul>Next point -&gt; Using InterWiki link for performing Polylingual Topic Modeling21<br />21[Mim’2009] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and A. Mccallum, &quot;Polylingual topic models,&quot; in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing.    Singapore: Association for Computational Linguistics, August 2009, pp. 880-889. [Online]. Available: http://www.aclweb.org/anthology/D/D09/D09-1092.pdf<br />
  28. 28. Importance of N-grams <br />Information retrieval<br />Parsing<br />Machine translation<br />How to discover phrases/collocations?<br />
  29. 29. Collocations = word phrases?<br />Noun phrases:<br />“strong tea”, “weapon of mass destruction”<br />Phrasal verbs:<br />“make up” = ?<br />Other phrases:<br />“rich and powerful”<br />Collocation is a phrase with meaning beyond the individual words (e.g. “white house”)<br />[Man’99] Manning, C., & Schutze, H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999.<br />
  30. 30. Non-easily Explainable Patterns <br /><ul><li>Why “strong tea” but not “powerful tea”?</li></ul> (“strong” -&gt; acquired meaning from other active agent)<br /><ul><li>Why “a stiff breeze” but not “a stiff wind”? …while either “a strong breeze” or “a strong wind” is ok…
  31. 31. Why “broad daylight” but not “bright daylight” or “narrow darkness”?</li></ul>[Man’99] Manning, C., & Schutze, H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999.<br />

×