α is a prior on the topic-distribution of documents (of a corpus)
α is a corpus-level parameter (is chosen once)
α is a force on the topic combinations
Amount of smoothing determined by α
Higher α more smoothing less „distinct“ topics
Low α the pressure is to pick for each document a topic distribution favoring just a few topics
Recommended value: α = 50/T (or less if T is very small)
High α Low α Each doc’s topic distribution θ is a smooth mix of all topics Each doc’s topic distribution θ must favor few topics Topic-distr. of Doc1 = (1/3, 1/3, 1/3) Topic-distr. of Doc2 = (1, 0, 0) Doc1 Doc2
Gibbs Sampling for LDA Probability that topic j is chosen for word w i , conditioned on all other assigned topics of words in this doc and all other observed vars. Count number of times a word token w i was assigned to a topic j across all docs Count number of times a topic j was already assigned to some word token in doc d i unnormalized! => divide the probability of assigning topic j to word wi by the sum over all topics T
Gibbs sampling is used to estimate topic assignment for each word of each doc
Factors affecting topic assignments
How likely is a word w for a topic j?
Probability of word w under topic j
How dominante is a topic j in a doc d?
Probability that topic j has under the current topic distribution for document d
Once many tokens of a word have been assigned to topic j (across documents), the probability of assigning any particular token of that word to topic j increases all other topics become less likely for word w ( Explaining Away ).
Once a topic j has been used multiple times in one document, it will increase the probability that any word from that document will be assigned to topic j all other documents become less likely for topic j ( Explaining Away ).
Gibbs Sampling Convergence Black = topic 1 White = topic2
Each iteration updates count-matrices
count-matrices stop changing
Gibbs samples start to approximate the target distribution (i.e., the posterior distribution over z)
Gibbs sampling estimates posterior distribution of z. But we need word-distribution φ of each topic and topic-distribution θ of each document.
num of times word wi was related with topic j num of times all other words were related with topic j num of times topic j was related with doc d num of times all other topics were related with doc d predictive distributions of sampling a new token of word i from topic j , predictive distributions of sampling a new token in document d from topic j
For each doc d and each word w of that doc an author x is sampled from the doc‘s author distribution/set a d .
For each doc d and each word w of that doc a topic z is sampled from the topic distribution θ (x) of the author x which has been assigned to that word.
From the word-distribution φ (z) of each sampled topic z a word w is sampled.
P( w | z, φ (z) ) P( z | x, θ (x) )
AT Model Latent Variables Latent Variables: 2) Author-distribution of each topic determines which topics are used by which authors count matrix C AT 1) Author-Topic assignment for each word 3) Word-distribution of each topic count matrix C WT ?
Matrix Representation of Author-Topic-Model source: http://www.ics.uci.edu/~smyth/kddpapers/UCI_KD-D_author_topic_preprint.pdf θ (x) φ (z) a d observed observed latent latent
Add one fictitious author for each document; a d +1
uniform or non-uniform distribution over authors (including the fictitious author)
Each word is either sampled from a real author‘s or the fictitious author‘s topic distribution.
i.e., we learn topic-distribution for real-authors and for fictitious „author“ (= documents).
Problem reported in (Hong, 2010): topic distribution of each twitter message learnt via AT-model was worse than LDA with USER schema sparse messages and not all words of one message are used to learn document‘s topic distribution.
Predictive Power of different models (Rosen-Zvi, 2005) Experiment: Trainingsdata: 1 557 papers Testdata:183 papers (102 are single-authored papers). They choose test data documents in such a way that each author of a test set document also appears in the training set as an author.
Author-Recipients-Topic (ART) Model (McCallum, 2004)
Words per message
Authors per message
Recipients per message
Sample for each word
a recipient-author pair AND
a topic conditioned on the receiver-author pair‘s topic distribution θ (A,R)
Learn 2 corpus-level variables:
Author-recipient-pair distribution for each topic
Word-distribution for each topic
2 count matrices:
, R , x P( z | x, a d , θ (A,R) ) P( w | z, φ (z) )
Gibbs Sampling ART-Model Random Start: Sample author-recipient pair for each word Sample topic for each word Compute for each word w i : Number of recipients of message to which word w i belongs Number of times topic t was assigned to an author-recipient-pair Number of times current word token was assigned to topic t Number of times all other topics were assigned to an author-recipient-pair Number of times all other words were assigned to topic t Number of words * beta
Word-topic assignments are drawn from a document’s topic distribution θ which is restricted to the topic distribution Λ of the labels observed in d. Topic distribution of a label l is the same as topic distribution of all documents containing label l.
The document’s labels Λ are first generate using a Bernoulli coin toss for each topic k with a labeling prior φ .
Constraining the topic model to use only those topics that correspond to a document’s (observed) label set.
Topic assignments are limited to the document’s labels
One-to-one correspondence between LDA’s latent topics and user tags/labels
Discovery of groups is guided by the emerging topics
Discovery of topics is guided by the emerging groups
GT-model is an extension of the blockstructure model group-membership is conditioned on a latent variable associated with the attributes of the relation (i.e., the words) latent variable represents the topics which have generated the words.
GT model discovers topics relevant to relationships between entities in the social network
David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent Dirichlet Allocation. Journal of Machine Learning Research 3: 993-1022 (2003).
Dietz, L., Bickel, S. and Scheffer, T. (2007). Unsupervised prediction of citation influences. Proc. ICML, 2007.
Thomas Hoffmann, Probabilistic Latent Semantic Analysis, Proc. of Uncertainty in Artificial Intelligence, UAI'99, (1999).
Thomas L. Griffiths, Joshua B. Tenenbaum, Mark Steyvers, Topics in semantic representation, (2007).
Michal Rosen-Zvi, Chaitanya Chemudugunta, Thomas Griffiths, Padhraic Smyth, Irvine Mark Steyvers, Learning author-topic models from text corpora, (2010).
Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers and Padhraic Smyth, The author-topic model for authors and documents, In Proceedings of the 20th conference on Uncertainty in artificial intelligence (2004).
Andrew Mccallum, Andres Corrada-Emmanuel, Xuerui Wang, The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email, Tech-Report, (2004).
Nishith Pathak, Colin Delong, Arindam Banerjee, Kendrick Erickson, Social Topic Models for Community Extraction, In The 2nd SNA-KDD Workshop ’08, (2008).
Steyvers and Griffiths, Probabilistic Topic Models, (2006).
Ramage, Daniel and Hall, David and Nallapati, Ramesh and Manning, Christopher D., Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora, EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (2009)
Xuerui Wang, Natasha Mohanty, Andrew McCallum, Group and topic discovery from relations and text, (2005).
Hanna M. Wallach, David Mimno and Andrew McCallum, Rethinking LDA: Why Priors Matter (2009)