Successfully reported this slideshow.

Topics Modeling

5,189 views

Published on

Published in: Education, Technology

Topics Modeling

  1. 1. CIS 890 – Information Retrieval Project Final Presentation Topics Modelling with LDA Collocations on NIPS Collection Presenter: Svitlana Volkova Instructor: Doina Caragea
  2. 2. Agenda I. Introduction II. Project Stages III. Topics Modeling  LDA Model  HMMLDA Model  LDA-COL Model IV. NIPS Collection V. Experimental Results VI. Conclusions #
  3. 3. I. Project Overview #
  4. 4. Generative vs. Discriminative Methods Generative approaches produce a probability density model over all variables in a system and manipulate it to compute classification and regression functions Discriminative approaches provide a direct attempt to compute the input to output mappings #
  5. 5. From LSI -> to pLSA -> to LDA polysemy/synonymy -> probability -> exchangeability • [Sal83] Gerard Salton and Michael J. McGill. TF-IDF Introduction to Modern Information Retrieval. Salton and McGill (Sal„83) McGraw-Hill, Inc., New York, NY, USA, 1983. [Dee90] S. Deerwester, S. Dumais, T. Latent Semantic Indexing (LSI) Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal Deerwester et. al.(Dee„90) of the American Society of Information Science, 41(6):391-407, 1990. Probabilistic Latent Semantic Indexing • [Hof99] T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the Hofmann(Hof„99) Twenty-Second Annual International SIGIR Conference, 1999. Latent Dirichlet Allocation (LDA) • [Ble03] Blei, D.M., Ng, A.Y., Jordan, M.I. Latent Dirichlet allocation, Journal of Machine Blei et. al.(Ble„03) # Learning Research, 3, pp.993-1022, 2003.
  6. 6. Topic Models: LDA #
  7. 7. Language Models – probability of the sequence of words Healthy Food Text Mining – each word of both the observed and unseen documents is generated by a randomly chosen topic which is drawn from a distribution. #
  8. 8. Disadvantages of “Bag of word” Assumption • TEXT ≠ sequence of discrete word tokens • The actual meaning can not be captured by words co- occurrences only • Word order is not important for syntax, but it is important for lexical meaning • Words order within “near by” context and phrases is critical to capturing meaning of text #
  9. 9. Problem Statement #
  10. 10. Collocations = word phrases? • Noun phrases: – “strong tea”, “weapon of mass destruction” • Phrasal verbs: – “make up” =? • Other phrases: – “rich and powerful” • Collocation is a phrase with meaning beyond the individual words (e.g. “white house”) [Man’99] Manning, C., & Schutze, H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999. #
  11. 11. Problem Statement – How “Information Retrieval” Topic can be represented? Unigrams -> …, information, search, …, web – What about “Artificial Intelligence”? Unigrams -> agent, …, information, search, … – Issues with using unigrams for topics modeling: • Not enough representative for single topic • Ambiguous (concepts sharing) – system, modeling, information, data, structure… #
  12. 12. II. Project Stages #
  13. 13. Project Stages 1. NIPS Data Collection and Preprocessing  http://books.nips.cc/ 2. Learning topics models on NIPS collection  http://psiexp.ss.uci.edu/research/programs_data/ toolbox.htm - Model 1: LDA - Model 2: HMMLDA - Model 3: LDA-COL 3. Results Comparison for LDA, LDA-COL, HMMLDA and N-grams #
  14. 14. What are the limitations of using wiki concepts? NLP Active Learning (AL) Information Retrieval: 2 Cognitive Science: 1 Natural Language Processing: 8 Artificial Intelligence (AI) Computer Vision Cognitive Science: 3 Object Recognition: 2 Object Recognition: 1 Visual Perception: 1 Information Retrieval: 2 Natural Language Processing: 6 Information Retrieval (IR) Machine Learning (ML) Information Retrieval: 35 Object Recognition: 1 Natural Language Processing: 1  Wiki Concept Graph  Follow links  N-grams distribution on the document is small  What level of concepts‟ abstraction #
  15. 15. III. Topic Models: LDA #
  16. 16. Topics Modeling with Latent Dirichlet Allocation  word is represented as multinomial random variable  topic is represented as a multinomial random variable z  document is represented as Dirichlet random variable #
  17. 17. Topic Simplex  each corner of the simplex corresponds to a topic – a component of the vector ;  document is modeled as a point of the simplex - a multimodal distribution over the topics;  a corpus is modeled as a Dirichlet distribution on the simplex. http://www.cs.berkeley.edu/~jordan #
  18. 18. III. Topic Models: HMMLDA #
  19. 19. Bigram Topic Models: Wallach‟s Model “neural network” • Wallach‟s Bigram Topic Model (Wal„05) is based on Hierarchical Dirichlet Language (Pet‟94) [Wal’05] Wallach, H. Topic modeling: beyond bag-of-words. NIPS 2005 Workshop on Bayesian Methods for Natural Language Processing, 2005. [Pet’94] MacKay, D. J. C., & Peto, L. A hierarchical Dirichlet language model. Natural Language Engineering, 1, 1–19, 1994. #
  20. 20. III. Topic Models: LDA-COL #
  21. 21. LDA-Collocation Model LDA-Collocation Model (Ste‟05) • Can decide whether to generate a bigram or unigram [Ste’05] Steyvers, M., & Griffiths, T. Matlab topic modeling toolbox 1.3. http://psiexp.ss.uci.edu/research/programs data/toolbox.htm, 2005 #
  22. 22. Methods for Collocation Discovery  Counting frequency (Jus„95) Justeson, J. S., & Katz, S. M. (1995). Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1, 9–27  Variance based collocation (Sma„93) Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177.  Hypothesis testing -> assess whether or not two words occur together more often than chance: – t-test (Chu‟89) Church, K., & Hanks, P. Word association norms, mutual information and lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 76–83), 1989 – 2 test (Chu‟91) Church, K. W., Gale, W., Hanks, P., & Hindle, D. Using statistics in lexical analysis. In Lexical Acquisition: Using On-line Resources to Build a Lexicon (pp. 115–164). Lawrence Erlbaum, 1991 – likelihood ratio test (Dun‟93) Dunning, T. E. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61–74, 1993.  Mutual information (Hod‟96) Hodges, J., Yie, S., Reighart, R., & Boggess, L. An automated system that assists in the generation of document indexes. Natural Language Engineering, 2, 137–160, 199 #
  23. 23. Topical N-grams HMMLDA captures words dependency HMM -> short-range syntactic LDA -> long-range semantic [Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng- # icdm07.pdf
  24. 24. Topical N-grams [Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf #
  25. 25. IV. Data Collection: NIPS Abstracts #
  26. 26. NIPS Collection NIPS Collection Characteristics NIPS Collection Characteristics Number of words W = 13649 Number of iterations N = 50 Number of docs D = 1740 LDA hyper parameter ALPHA = 0.5 Number of topics T = 100 LDA hyper parameter BETA = 0.01 Randomly sampled document titles from NIPS Collection #
  27. 27. LDA Model Input/Output WS a 1 x N vector where WS(k) contains the WP a sparse matrix of size W x T; vocabulary index of the k WP(i,j) contains the number of word token, and N is the times word i has been assigned to number of word tokens topic j. LDA DP a sparse D x T matrix; DP(d,j) contains the number of times a Model word token in document d has been assigned to topic j. DS a 1 x N vector where DS(k) contains Z a 1 x N vector containing the the document index of topic assignments where N is the the k word token number of word tokens. Z(k) contains the topic assignment for token k. #
  28. 28. HMMLDA Model Input/Output WS a 1 x N vector where WP a sparse matrix of size W x T; WS(k) contains the WP(i,j) contains the number of times vocabulary index of the k word i has been assigned to topic j. word token, and N is the number of word tokens DP a sparse D x T matrix; DP(d,j) contains the number of times a word token in document d has been assigned to topic j. HMMLDA MP a sparse W x S matrix where S is the number of HMM states. MP(i,j) Model contains the number of times word i has been assigned to HMM state j. Z a 1 x N vector containing the topic assignments where N is the number of word tokens. Z(k) contains the topic DS a 1 x N vector where assignment for token k. DS(k) contains the document index of the k word token X a 1 x N vector containing the HMM state assignments where N is the number of word tokens. X(k) contains the assignment of the k word token to a HMM state. #
  29. 29. LDA-COL Model Input/Output WS a 1 x N vector WP a sparse matrix of size W x T; where WS(k) contains WP(i,j) contains the number of times the vocabulary index of word i has been assigned to topic j. the k word token, and N is the number of word DP a sparse D x T matrix; DP(d,j) tokens contains the number of times a word token in document d has been DS a 1 x N vector assigned to topic j. where DS(k) contains the document index of LDA-COL WC a 1 x W vector where WC(k) contains the number of times word k the k word token Model led to a collocation with the next word in the word stream. WW a W x W sparse matrix where W(i,j) contains the count of Z a 1 x N vector containing the topic the number of times assignments where N is the number that word i follows word of word tokens. Z(k) contains the topic j in the word stream. assignment for token k. SI a 1 x N vector C a 1 x N vector containing the where SI(k)=1 only if topic/collocation assignments where the k word can form a N is the number of word tokens. collocation with the (k- C(k)=0 when token k was assigned to 1) word and SI(k)=0 the topic model. C(k)=1 when token k otherwise. was assigned to a collocation with word token k-1. #
  30. 30. V. Experimental Results #
  31. 31. Experiment Setup 1. 100 Topics 2. Gibbs Sampling – 50 iterations 3. Optimized Parameters LDA HMMLDA LDA-COL ALPHA = 0.5 ALPHA = 0.5 BETA = 0.01 BETA = 0.01 BETA = 0.01 ALPHA = 0.5 GAMMA = 0.1 GAMMA0 = 0.1 GAMMA1 = 0.1 [Gri’04] Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228- 5235. #
  32. 32. LDA Model Results #
  33. 33. Hidden Markov Model with Latent Dirichlet Allocation (HMMLDA) Model Results [Hsu’06] Style and topic language model adaptation using HMM-LDA (2006) by B J Hsu,, J Glass in Proceedings of Empirical Methods on Natural Language Processing (EMNLP #
  34. 34. LDA-COL Model Results #
  35. 35. LDA vs. HMMLDA vs. LDA-COL #
  36. 36. LDAs vs. Topical N-grams [Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf #
  37. 37. LDAs vs. Topical N-grams #
  38. 38. VI. Conclusions #
  39. 39. Conclusions I. HMMLDA showed the worst results, because stop words removal was not done II. LDA-COL had the best performance in comparison to LDA and HMMLDA, but worse than topical n-gram models Future Work  Polylingual Topic Models [Mim’2009] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and A. Mccallum, "Polylingual topic models," in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, August 2009, pp. 880-889, http://www.aclweb.org/anthology/D/D09/D09-1092.pdf #
  40. 40. Acknowledgments University of California, Irvine. Department of Cognitive Sciences for MatLab Topics Modeling Toolbox Dr .Caragea Questions #

×