• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Project Proposal    Topics Modeling (Ir)

Project Proposal Topics Modeling (Ir)






Total Views
Views on SlideShare
Embed Views



1 Embed 13

http://www.slideshare.net 13


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Project Proposal    Topics Modeling (Ir) Project Proposal Topics Modeling (Ir) Presentation Transcript

    • CIS 890 – Information RetrievalProject Proposal“Topic Modeling using Wikipedia Concepts”
      Svitlana Volkova
    • Topics Modeling
      Topics modeling is a classic problem in information retrieval1
      Latent Semantic Indexing/Analysis (LSI/LSA)
      Probabilistic Latent Semantic Indexing/Analysis (pLSI/pLSA)
      Latent Dirichlet Allocation (LDA) is the most popular topic model with “bag of word” assumption
      “words are generated independently to each other”
      1Wikipedia - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
    • From LSI and pLSA to LDA
      polysemy/synonymy -> probability -> exchangeability
      2[Sal83] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1983.
      3[Dee90] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990.
      4[Hof99] T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999.
      5[Ble03] Blei, D.M., Ng, A.Y., Jordan, M.I. Latent Dirichlet allocation, Journal of Machine Learning Research, 3, pp.993-1022, 2003.
    • Language Models Representation
      probability of the sequence of words
      each word of both the observed and unseen documents is generated by a randomly chosen topic which is drawn from a distribution with a randomly chosen parameter.
    • Generative vs. Discriminative:Topics Modeling with Latent Dirichlet Allocation
      • word is represented as multinomial random variable 𝑤
      • topic is represented as a multinomial random variable z
      • document is represented as Dirichlet random variable 𝜃
      • each corner of the simplex corresponds to a topic – a component of the vector 𝑧;
      • document is modeled as a point of the simplex - a multimodal distribution over the topics;
      • a corpus is modeled as a Dirichlet distribution on the simplex6.
    • Disadvantages of “Bag of word” Assumption
      TEXT ≠ sequence of discrete word tokens
      The actual meaning can not be captured by words co-occurrences only
      Word order is not important for syntax, but it is important for lexical meaning
      Words order within “near by” context and phrases is critical to capturing meaning of text
    • Problem Statement
      How “Information Retrieval” Topic can be represented?
      What about “Artificial Intelligence”?
      Issues with using unigrams for topics modeling:
      Not enough representative for single topic
      Ambiguous (concepts sharing)
      system, modeling, information, data, structure…
      Where can we get additional knowledge?
      Unigrams -> …, information, search, …, web
      Unigrams -> agent, …, information, search, …
    • N-Grams for Topics Modeling
      Phrases as the whole carry more information than the sum of it’s individual components:
      e.g. “artificial intelligence”, “natural language processing”
      They are much more essential in determining the topics of collection than individual words
      e.g. in Acknowledgment section of paper “National Institutes of Health” and “National Science Foundation”
    • Bigram Topic Models:Wallach’s Model
      Biological neural network in neuroscience
      Neural networks in artificial intelligence
      “neural network”
      Wallach’s Bigram Topic Model (Wal‘05)7 is based on Hierarchical Dirichlet Language (Pet’94)8
      7[Wal’05] Wallach, H. Topic modeling: beyond bag-of-words. NIPS 2005 Workshop on Bayesian Methods for Natural Language Processing, 2005.
      8[Pet’94] MacKay, D. J. C., & Peto, L. A hierarchical Dirichlet language model. Natural Language Engineering, 1, 1–19, 1994.
    • Bigram Topic Models:LDA-Collocation Model
      LDA-Collocation Model (Ste’05)9 showed how to take advantage of a bigram model in a Bayesian way
      Can decide whether to generate a bigram or unigram
      9[Ste’05] Steyvers, M., & Griffiths, T. Matlab topic modeling toolbox 1.3.
      http://psiexp.ss.uci.edu/research/programs data/toolbox.htm, 2005
    • N-gram Topics Models10
      HMMLDA captures words dependency
      HMM -> short-range syntactic LDA -> long-range semantic
      10[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf
    • LDA vs. TNG for Topics Modeling
      10[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf
    • Methods for Collocation Discovery
      • Counting frequency (Jus‘95)11
      11Justeson, J. S., & Katz, S. M. (1995). Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1, 9–27
      • Variance based collocation (Sma‘93)12
      12Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177.
      • Hypothesis testing -> assess whether or not two words occur together more often than chance:
      t-test (Chu’89)13
      13Church, K., & Hanks, P. Word association norms, mutual information and lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 76–83), 1989
      2 test (Chu’91)14
      14Church, K. W., Gale, W., Hanks, P., & Hindle, D. Using statistics in lexical analysis. In Lexical Acquisition: Using On-line Resources to Build a Lexicon (pp. 115–164). Lawrence Erlbaum, 1991
      likelihood ratio test (Dun’93)15
      15Dunning, T. E. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61–74, 1993.
      • Mutual information (Hod’96)16
      16Hodges, J., Yie, S., Reighart, R., & Boggess, L. An automated system that assists in the generation of document indexes. Natural Language Engineering, 2, 137–160, 1996.
    • Key Idea -> Wikipedia
      apply Wikipedia knowledge representation to topics modeling within a given collection
      Wiki Concepts -> …, information retrieval, search engine, …, web crawler
      Wiki Concepts -> intelligent agent, …, tree search, …, games
    • Science/Math/Technology (SMT) Wikipedia Visualization3,599/6,474/3,164 articles1818http://abeautifulwww.com/NewWikipediaActivityVisualizations_AB91/07WikipediaPS3150DPI.png
    • Tools for Using Wiki Knowledge
      Wiki Category Graph and Article Graph
      JWPL (Java Wikipedia Library) 19
      Free Java-based API that allows to access all information contained in Wikipedia
      NER -> Multilingual NER20 (article + category +interwiki links)
      19Torsten Zesch and IrynaGurevych, Analysis of the Wikipedia Category Graph for NLP Applications,In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p. 1--8, April 2007.
      20Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 649-657, http://www.aclweb.org/anthology/D/D07/D07-1068
    • Goal -> to develop model that:
      Automatically determines:
      • unigram words (based on the content)
      • phrases (based on prior knowledge from Wiki category graph)
      Simultaneously associates words and Wiki concepts with mixture of topics
    • Important Milestones
      Collect NIPS data & preprocess it
      • http://books.nips.cc/
      Use JWPL Tool for importing Wiki concepts into SQL DB
      • http://www.ukp.tu-darmstadt.de/software/jwpl/
      Apply LDA-COL technique for performing topics modeling within NIPS collection (e.g. Mallet)
      • http://mallet.cs.umass.edu/topics.php
      Evaluate the actual performance (e.g. compare to N-grams)
    • Data & Existed Tools
      3. Perform Topics Modeling
      Data Set
      4. Evaluate results in comparison to unigram model/ n-grams
      2. Preprocessing of collection
      SQL DB with Wiki Concepts
      1. Collect DB of Wiki Concepts/Categories
    • Conclusions and Open Research Questions
      • Should I use only Wiki concepts (N-grams) during topics modeling or Wiki + unigrams?
      • How to adapt proposed approach to another domain with specific knowledge?
      Next point -> Using InterWiki link for performing Polylingual Topic Modeling21
      21[Mim’2009] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and A. Mccallum, "Polylingual topic models," in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing.    Singapore: Association for Computational Linguistics, August 2009, pp. 880-889. [Online]. Available: http://www.aclweb.org/anthology/D/D09/D09-1092.pdf
    • Importance of N-grams
      Information retrieval
      Machine translation
      How to discover phrases/collocations?
    • Collocations = word phrases?
      Noun phrases:
      “strong tea”, “weapon of mass destruction”
      Phrasal verbs:
      “make up” = ?
      Other phrases:
      “rich and powerful”
      Collocation is a phrase with meaning beyond the individual words (e.g. “white house”)
      [Man’99] Manning, C., & Schutze, H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999.
    • Non-easily Explainable Patterns
      • Why “strong tea” but not “powerful tea”?
      (“strong” -> acquired meaning from other active agent)
      • Why “a stiff breeze” but not “a stiff wind”? …while either “a strong breeze” or “a strong wind” is ok…
      • Why “broad daylight” but not “bright daylight” or “narrow darkness”?
      [Man’99] Manning, C., & Schutze, H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999.