Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- A Topic Model for Traffic Speed Dat... by Tomonari Masada 412 views
- Topic model an introduction by Yueshen Xu 1578 views
- Topic model, LDA and all that by Zhibo Xiao 2735 views
- Visualzing Topic Models by Turi, Inc. 509 views
- Présentation de restitution co-conc... by Datalab_PDL 2057 views
- LDA Approved 1st Integrated Townshi... by Shikha Sharma 69 views

No Downloads

Total views

5,189

On SlideShare

0

From Embeds

0

Number of Embeds

29

Shares

0

Downloads

206

Comments

0

Likes

9

No embeds

No notes for slide

- 1. CIS 890 – Information Retrieval Project Final Presentation Topics Modelling with LDA Collocations on NIPS Collection Presenter: Svitlana Volkova Instructor: Doina Caragea
- 2. Agenda I. Introduction II. Project Stages III. Topics Modeling LDA Model HMMLDA Model LDA-COL Model IV. NIPS Collection V. Experimental Results VI. Conclusions #
- 3. I. Project Overview #
- 4. Generative vs. Discriminative Methods Generative approaches produce a probability density model over all variables in a system and manipulate it to compute classification and regression functions Discriminative approaches provide a direct attempt to compute the input to output mappings #
- 5. From LSI -> to pLSA -> to LDA polysemy/synonymy -> probability -> exchangeability • [Sal83] Gerard Salton and Michael J. McGill. TF-IDF Introduction to Modern Information Retrieval. Salton and McGill (Sal„83) McGraw-Hill, Inc., New York, NY, USA, 1983. [Dee90] S. Deerwester, S. Dumais, T. Latent Semantic Indexing (LSI) Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal Deerwester et. al.(Dee„90) of the American Society of Information Science, 41(6):391-407, 1990. Probabilistic Latent Semantic Indexing • [Hof99] T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the Hofmann(Hof„99) Twenty-Second Annual International SIGIR Conference, 1999. Latent Dirichlet Allocation (LDA) • [Ble03] Blei, D.M., Ng, A.Y., Jordan, M.I. Latent Dirichlet allocation, Journal of Machine Blei et. al.(Ble„03) # Learning Research, 3, pp.993-1022, 2003.
- 6. Topic Models: LDA #
- 7. Language Models – probability of the sequence of words Healthy Food Text Mining – each word of both the observed and unseen documents is generated by a randomly chosen topic which is drawn from a distribution. #
- 8. Disadvantages of “Bag of word” Assumption • TEXT ≠ sequence of discrete word tokens • The actual meaning can not be captured by words co- occurrences only • Word order is not important for syntax, but it is important for lexical meaning • Words order within “near by” context and phrases is critical to capturing meaning of text #
- 9. Problem Statement #
- 10. Collocations = word phrases? • Noun phrases: – “strong tea”, “weapon of mass destruction” • Phrasal verbs: – “make up” =? • Other phrases: – “rich and powerful” • Collocation is a phrase with meaning beyond the individual words (e.g. “white house”) [Man’99] Manning, C., & Schutze, H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999. #
- 11. Problem Statement – How “Information Retrieval” Topic can be represented? Unigrams -> …, information, search, …, web – What about “Artificial Intelligence”? Unigrams -> agent, …, information, search, … – Issues with using unigrams for topics modeling: • Not enough representative for single topic • Ambiguous (concepts sharing) – system, modeling, information, data, structure… #
- 12. II. Project Stages #
- 13. Project Stages 1. NIPS Data Collection and Preprocessing http://books.nips.cc/ 2. Learning topics models on NIPS collection http://psiexp.ss.uci.edu/research/programs_data/ toolbox.htm - Model 1: LDA - Model 2: HMMLDA - Model 3: LDA-COL 3. Results Comparison for LDA, LDA-COL, HMMLDA and N-grams #
- 14. What are the limitations of using wiki concepts? NLP Active Learning (AL) Information Retrieval: 2 Cognitive Science: 1 Natural Language Processing: 8 Artificial Intelligence (AI) Computer Vision Cognitive Science: 3 Object Recognition: 2 Object Recognition: 1 Visual Perception: 1 Information Retrieval: 2 Natural Language Processing: 6 Information Retrieval (IR) Machine Learning (ML) Information Retrieval: 35 Object Recognition: 1 Natural Language Processing: 1 Wiki Concept Graph Follow links N-grams distribution on the document is small What level of concepts‟ abstraction #
- 15. III. Topic Models: LDA #
- 16. Topics Modeling with Latent Dirichlet Allocation word is represented as multinomial random variable topic is represented as a multinomial random variable z document is represented as Dirichlet random variable #
- 17. Topic Simplex each corner of the simplex corresponds to a topic – a component of the vector ; document is modeled as a point of the simplex - a multimodal distribution over the topics; a corpus is modeled as a Dirichlet distribution on the simplex. http://www.cs.berkeley.edu/~jordan #
- 18. III. Topic Models: HMMLDA #
- 19. Bigram Topic Models: Wallach‟s Model “neural network” • Wallach‟s Bigram Topic Model (Wal„05) is based on Hierarchical Dirichlet Language (Pet‟94) [Wal’05] Wallach, H. Topic modeling: beyond bag-of-words. NIPS 2005 Workshop on Bayesian Methods for Natural Language Processing, 2005. [Pet’94] MacKay, D. J. C., & Peto, L. A hierarchical Dirichlet language model. Natural Language Engineering, 1, 1–19, 1994. #
- 20. III. Topic Models: LDA-COL #
- 21. LDA-Collocation Model LDA-Collocation Model (Ste‟05) • Can decide whether to generate a bigram or unigram [Ste’05] Steyvers, M., & Griffiths, T. Matlab topic modeling toolbox 1.3. http://psiexp.ss.uci.edu/research/programs data/toolbox.htm, 2005 #
- 22. Methods for Collocation Discovery Counting frequency (Jus„95) Justeson, J. S., & Katz, S. M. (1995). Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1, 9–27 Variance based collocation (Sma„93) Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177. Hypothesis testing -> assess whether or not two words occur together more often than chance: – t-test (Chu‟89) Church, K., & Hanks, P. Word association norms, mutual information and lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 76–83), 1989 – 2 test (Chu‟91) Church, K. W., Gale, W., Hanks, P., & Hindle, D. Using statistics in lexical analysis. In Lexical Acquisition: Using On-line Resources to Build a Lexicon (pp. 115–164). Lawrence Erlbaum, 1991 – likelihood ratio test (Dun‟93) Dunning, T. E. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61–74, 1993. Mutual information (Hod‟96) Hodges, J., Yie, S., Reighart, R., & Boggess, L. An automated system that assists in the generation of document indexes. Natural Language Engineering, 2, 137–160, 199 #
- 23. Topical N-grams HMMLDA captures words dependency HMM -> short-range syntactic LDA -> long-range semantic [Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng- # icdm07.pdf
- 24. Topical N-grams [Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf #
- 25. IV. Data Collection: NIPS Abstracts #
- 26. NIPS Collection NIPS Collection Characteristics NIPS Collection Characteristics Number of words W = 13649 Number of iterations N = 50 Number of docs D = 1740 LDA hyper parameter ALPHA = 0.5 Number of topics T = 100 LDA hyper parameter BETA = 0.01 Randomly sampled document titles from NIPS Collection #
- 27. LDA Model Input/Output WS a 1 x N vector where WS(k) contains the WP a sparse matrix of size W x T; vocabulary index of the k WP(i,j) contains the number of word token, and N is the times word i has been assigned to number of word tokens topic j. LDA DP a sparse D x T matrix; DP(d,j) contains the number of times a Model word token in document d has been assigned to topic j. DS a 1 x N vector where DS(k) contains Z a 1 x N vector containing the the document index of topic assignments where N is the the k word token number of word tokens. Z(k) contains the topic assignment for token k. #
- 28. HMMLDA Model Input/Output WS a 1 x N vector where WP a sparse matrix of size W x T; WS(k) contains the WP(i,j) contains the number of times vocabulary index of the k word i has been assigned to topic j. word token, and N is the number of word tokens DP a sparse D x T matrix; DP(d,j) contains the number of times a word token in document d has been assigned to topic j. HMMLDA MP a sparse W x S matrix where S is the number of HMM states. MP(i,j) Model contains the number of times word i has been assigned to HMM state j. Z a 1 x N vector containing the topic assignments where N is the number of word tokens. Z(k) contains the topic DS a 1 x N vector where assignment for token k. DS(k) contains the document index of the k word token X a 1 x N vector containing the HMM state assignments where N is the number of word tokens. X(k) contains the assignment of the k word token to a HMM state. #
- 29. LDA-COL Model Input/Output WS a 1 x N vector WP a sparse matrix of size W x T; where WS(k) contains WP(i,j) contains the number of times the vocabulary index of word i has been assigned to topic j. the k word token, and N is the number of word DP a sparse D x T matrix; DP(d,j) tokens contains the number of times a word token in document d has been DS a 1 x N vector assigned to topic j. where DS(k) contains the document index of LDA-COL WC a 1 x W vector where WC(k) contains the number of times word k the k word token Model led to a collocation with the next word in the word stream. WW a W x W sparse matrix where W(i,j) contains the count of Z a 1 x N vector containing the topic the number of times assignments where N is the number that word i follows word of word tokens. Z(k) contains the topic j in the word stream. assignment for token k. SI a 1 x N vector C a 1 x N vector containing the where SI(k)=1 only if topic/collocation assignments where the k word can form a N is the number of word tokens. collocation with the (k- C(k)=0 when token k was assigned to 1) word and SI(k)=0 the topic model. C(k)=1 when token k otherwise. was assigned to a collocation with word token k-1. #
- 30. V. Experimental Results #
- 31. Experiment Setup 1. 100 Topics 2. Gibbs Sampling – 50 iterations 3. Optimized Parameters LDA HMMLDA LDA-COL ALPHA = 0.5 ALPHA = 0.5 BETA = 0.01 BETA = 0.01 BETA = 0.01 ALPHA = 0.5 GAMMA = 0.1 GAMMA0 = 0.1 GAMMA1 = 0.1 [Gri’04] Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228- 5235. #
- 32. LDA Model Results #
- 33. Hidden Markov Model with Latent Dirichlet Allocation (HMMLDA) Model Results [Hsu’06] Style and topic language model adaptation using HMM-LDA (2006) by B J Hsu,, J Glass in Proceedings of Empirical Methods on Natural Language Processing (EMNLP #
- 34. LDA-COL Model Results #
- 35. LDA vs. HMMLDA vs. LDA-COL #
- 36. LDAs vs. Topical N-grams [Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf #
- 37. LDAs vs. Topical N-grams #
- 38. VI. Conclusions #
- 39. Conclusions I. HMMLDA showed the worst results, because stop words removal was not done II. LDA-COL had the best performance in comparison to LDA and HMMLDA, but worse than topical n-gram models Future Work Polylingual Topic Models [Mim’2009] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and A. Mccallum, "Polylingual topic models," in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, August 2009, pp. 880-889, http://www.aclweb.org/anthology/D/D09/D09-1092.pdf #
- 40. Acknowledgments University of California, Irvine. Department of Cognitive Sciences for MatLab Topics Modeling Toolbox Dr .Caragea Questions #

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment