Generative vs. Discriminative
Generative approaches produce a probability
density model over all variables in a system and
manipulate it to compute classification and
Discriminative approaches provide a direct
attempt to compute the input to output
From LSI -> to pLSA -> to LDA
polysemy/synonymy -> probability -> exchangeability
• [Sal83] Gerard Salton and Michael J. McGill.
TF-IDF Introduction to Modern Information Retrieval.
Salton and McGill (Sal„83) McGraw-Hill, Inc., New York, NY, USA, 1983.
[Dee90] S. Deerwester, S. Dumais, T.
Latent Semantic Indexing (LSI) Landauer, G. Furnas, and R. Harshman.
Indexing by latent semantic analysis. Journal
Deerwester et. al.(Dee„90) of the American Society of Information
Science, 41(6):391-407, 1990.
Probabilistic Latent Semantic Indexing • [Hof99] T. Hofmann. Probabilistic latent
semantic indexing. Proceedings of the
Hofmann(Hof„99) Twenty-Second Annual International SIGIR
Latent Dirichlet Allocation (LDA) • [Ble03] Blei, D.M., Ng, A.Y., Jordan, M.I.
Latent Dirichlet allocation, Journal of Machine
Blei et. al.(Ble„03)
Learning Research, 3, pp.993-1022, 2003.
– probability of the sequence of words
– each word of both the observed and unseen
documents is generated by a randomly chosen
topic which is drawn from a distribution.
Disadvantages of “Bag of word”
• TEXT ≠ sequence of discrete word tokens
• The actual meaning can not be captured by words co-
• Word order is not important for syntax, but it is important
for lexical meaning
• Words order within “near by” context and phrases is
critical to capturing meaning of text
Collocations = word phrases?
• Noun phrases:
– “strong tea”, “weapon of mass destruction”
• Phrasal verbs:
– “make up” =?
• Other phrases:
– “rich and powerful”
• Collocation is a phrase with meaning beyond
the individual words (e.g. “white house”)
[Man’99] Manning, C., & Schutze, H. Foundations of statistical natural
language processing. Cambridge, MA: MIT Press, 1999.
– How “Information Retrieval” Topic can be represented?
Unigrams -> …, information, search, …, web
– What about “Artificial Intelligence”?
Unigrams -> agent, …, information, search, …
– Issues with using unigrams for topics modeling:
• Not enough representative for single topic
• Ambiguous (concepts sharing)
– system, modeling, information, data, structure…
1. NIPS Data Collection and Preprocessing
2. Learning topics models on NIPS collection
- Model 1: LDA
- Model 2: HMMLDA
- Model 3: LDA-COL
3. Results Comparison for LDA, LDA-COL,
HMMLDA and N-grams
What are the limitations of using
NLP Active Learning (AL)
Information Retrieval: 2 Cognitive Science: 1
Natural Language Processing: 8
Artificial Intelligence (AI) Computer Vision
Cognitive Science: 3 Object Recognition: 2
Object Recognition: 1 Visual Perception: 1
Information Retrieval: 2
Natural Language Processing: 6
Information Retrieval (IR) Machine Learning (ML)
Information Retrieval: 35 Object Recognition: 1
Natural Language Processing: 1
Wiki Concept Graph
N-grams distribution on the document is small
What level of concepts‟ abstraction #
Topics Modeling with Latent
word is represented as multinomial random variable
topic is represented as a multinomial random variable z
document is represented as Dirichlet random variable
each corner of the
simplex corresponds to a
topic – a component of
the vector ;
document is modeled as
a point of the simplex - a
over the topics;
a corpus is modeled as a
Dirichlet distribution on
Bigram Topic Models:
• Wallach‟s Bigram Topic Model (Wal„05) is based on
Hierarchical Dirichlet Language (Pet‟94)
[Wal’05] Wallach, H. Topic modeling: beyond bag-of-words. NIPS 2005
Workshop on Bayesian Methods for Natural Language Processing, 2005.
[Pet’94] MacKay, D. J. C., & Peto, L. A hierarchical Dirichlet language
model. Natural Language Engineering, 1, 1–19, 1994. #
• Can decide whether to generate a bigram or unigram
[Ste’05] Steyvers, M., & Griffiths, T. Matlab topic modeling toolbox 1.3.
http://psiexp.ss.uci.edu/research/programs data/toolbox.htm, 2005
Methods for Collocation Discovery
Counting frequency (Jus„95)
Justeson, J. S., & Katz, S. M. (1995). Technical terminology: some linguistic properties and an algorithm for
identification in text. Natural Language Engineering, 1, 9–27
Variance based collocation (Sma„93)
Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177.
Hypothesis testing -> assess whether or not two words
occur together more often than chance:
– t-test (Chu‟89)
Church, K., & Hanks, P. Word association norms, mutual information and lexicography. In Proceedings of
the 27th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 76–83), 1989
– 2 test (Chu‟91)
Church, K. W., Gale, W., Hanks, P., & Hindle, D. Using statistics in lexical analysis. In Lexical Acquisition:
Using On-line Resources to Build a Lexicon (pp. 115–164). Lawrence Erlbaum, 1991
– likelihood ratio test (Dun‟93)
Dunning, T. E. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics,
19, 61–74, 1993.
Mutual information (Hod‟96)
Hodges, J., Yie, S., Reighart, R., & Boggess, L. An automated system that assists in the generation of
document indexes. Natural Language Engineering, 2, 137–160, 199 #
HMMLDA captures words dependency
HMM -> short-range syntactic LDA -> long-range
[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic
Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International
Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng- #
[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams:
Phrase and Topic Discovery, with an Application to Information Retrieval,
Proceedings of the 7th IEEE International Conference on Data Mining (ICDM),
2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf
NIPS Collection Characteristics NIPS Collection Characteristics
Number of words W = 13649 Number of iterations N = 50
Number of docs D = 1740 LDA hyper parameter ALPHA = 0.5
Number of topics T = 100 LDA hyper parameter BETA = 0.01
Randomly sampled document titles from NIPS Collection #
LDA Model Input/Output
WS a 1 x N vector where
WS(k) contains the WP a sparse matrix of size W x T;
vocabulary index of the k WP(i,j) contains the number of
word token, and N is the times word i has been assigned to
number of word tokens topic j.
LDA DP a sparse D x T matrix; DP(d,j)
contains the number of times a
Model word token in document d has
been assigned to topic j.
DS a 1 x N vector
where DS(k) contains Z a 1 x N vector containing the
the document index of topic assignments where N is the
the k word token number of word tokens. Z(k)
contains the topic assignment for
HMMLDA Model Input/Output
WS a 1 x N vector where WP a sparse matrix of size W x T;
WS(k) contains the WP(i,j) contains the number of times
vocabulary index of the k word i has been assigned to topic j.
word token, and N is the
number of word tokens DP a sparse D x T matrix; DP(d,j)
contains the number of times a word
token in document d has been
assigned to topic j.
HMMLDA MP a sparse W x S matrix where S is
the number of HMM states. MP(i,j)
Model contains the number of times word i
has been assigned to HMM state j.
Z a 1 x N vector containing the topic
assignments where N is the number
of word tokens. Z(k) contains the topic
DS a 1 x N vector where assignment for token k.
DS(k) contains the
document index of the k
word token X a 1 x N vector containing the HMM
state assignments where N is the
number of word tokens. X(k) contains
the assignment of the k word token to
a HMM state.
LDA-COL Model Input/Output
WS a 1 x N vector WP a sparse matrix of size W x T;
where WS(k) contains WP(i,j) contains the number of times
the vocabulary index of word i has been assigned to topic j.
the k word token, and N
is the number of word DP a sparse D x T matrix; DP(d,j)
tokens contains the number of times a word
token in document d has been
DS a 1 x N vector assigned to topic j.
where DS(k) contains
the document index of LDA-COL WC a 1 x W vector where WC(k)
contains the number of times word k
the k word token
Model led to a collocation with the next word
in the word stream.
WW a W x W sparse
matrix where W(i,j)
contains the count of Z a 1 x N vector containing the topic
the number of times assignments where N is the number
that word i follows word of word tokens. Z(k) contains the topic
j in the word stream. assignment for token k.
SI a 1 x N vector C a 1 x N vector containing the
where SI(k)=1 only if topic/collocation assignments where
the k word can form a N is the number of word tokens.
collocation with the (k- C(k)=0 when token k was assigned to
1) word and SI(k)=0 the topic model. C(k)=1 when token k
otherwise. was assigned to a collocation with
word token k-1. #
Hidden Markov Model with Latent
Dirichlet Allocation (HMMLDA)
[Hsu’06] Style and topic language model adaptation using HMM-LDA (2006) by B J Hsu,,
J Glass in Proceedings of Empirical Methods on Natural Language Processing (EMNLP #
LDAs vs. Topical N-grams
[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with
an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data
Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf
I. HMMLDA showed the worst results, because
stop words removal was not done
II. LDA-COL had the best performance in
comparison to LDA and HMMLDA, but worse
than topical n-gram models
Polylingual Topic Models
[Mim’2009] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and A. Mccallum,
"Polylingual topic models," in Proceedings of the 2009 Conference on Empirical Methods
in Natural Language Processing. Singapore: Association for Computational Linguistics,
August 2009, pp. 880-889, http://www.aclweb.org/anthology/D/D09/D09-1092.pdf #
University of California, Irvine.
Department of Cognitive Sciences for
MatLab Topics Modeling Toolbox