Thanks for your explanation about LDA, but I'm still a little confused about defining topic for each word and choose the word. My question is how we do that? Did we just choose it randomly?
= probabilistic models for uncovering the underlying semantic structure of a document collection based on a hierarchical Bayesian analysis of the original texts (Blei, 2003)
Aim: discover patterns of word-use and connect documents that exhibit similar patterns
Idea: documents are mixtures of topics and a topic is a probability distribution over words
5.
Topic Models Topic 1 Topic 2 3 latent variables: Word distribution per topic (word-topic-matrix) Topic distribution per doc (topic-doc-matrix) Topic word assignment (Steyvers, 2006)
α is a prior on the topic-distribution of documents (of a corpus)
α is a corpus-level parameter (is chosen once)
α is a force on the topic combinations
Amount of smoothing determined by α
Higher α more smoothing less „distinct“ topics
Low α the pressure is to pick for each document a topic distribution favoring just a few topics
Recommended value: α = 50/T (or less if T is very small)
High α Low α Each doc’s topic distribution θ is a smooth mix of all topics Each doc’s topic distribution θ must favor few topics Topic-distr. of Doc1 = (1/3, 1/3, 1/3) Topic-distr. of Doc2 = (1, 0, 0) Doc1 Doc2
20.
Gibbs Sampling for LDA Probability that topic j is chosen for word w i , conditioned on all other assigned topics of words in this doc and all other observed vars. Count number of times a word token w i was assigned to a topic j across all docs Count number of times a topic j was already assigned to some word token in doc d i unnormalized! => divide the probability of assigning topic j to word wi by the sum over all topics T
C WT = Count number of times a word token wi was assigned to a topic j
C DT = Count number of times a topic j was already assigned to some word token in doc di
First Iteration:
For each word token, the count matrices C WT and C DT are first decremented by one for the entries that correspond to the current topic assignment
Then, a new topic is sampled from the current topic-distribution of a doc and the count matrices C WT and C DT are incremented with the new topic assignment.
Each Gibbs sample consists the set of topic assignments to all N word tokens in the corpus, achieved by a single pass through all documents
Gibbs sampling is used to estimate topic assignment for each word of each doc
Factors affecting topic assignments
How likely is a word w for a topic j?
Probability of word w under topic j
How dominante is a topic j in a doc d?
Probability that topic j has under the current topic distribution for document d
Once many tokens of a word have been assigned to topic j (across documents), the probability of assigning any particular token of that word to topic j increases all other topics become less likely for word w ( Explaining Away ).
Once a topic j has been used multiple times in one document, it will increase the probability that any word from that document will be assigned to topic j all other documents become less likely for topic j ( Explaining Away ).
26.
Gibbs Sampling Convergence Black = topic 1 White = topic2
Random Start
N iterations
Each iteration updates count-matrices
Convergence:
count-matrices stop changing
Gibbs samples start to approximate the target distribution (i.e., the posterior distribution over z)
Ignore some number of samples at the beginning (Burn-In period)
Consider only every n th sample when averaging values to compute an expectation
Why?
successive Gibbs-samples are not independent they form a Markov chain with some amount of correlation
The stationary distribution of the Markov chain is the desired joint distribution over the latent variables, but it may take a while for that stationary distribution to be reached
Techniques that may reduce autocorrelation between several latent variables are simulated annealing, collapsed Gibbs sampling or blocked Gibbs sampling;
Gibbs sampling estimates posterior distribution of z. But we need word-distribution φ of each topic and topic-distribution θ of each document.
num of times word wi was related with topic j num of times all other words were related with topic j num of times topic j was related with doc d num of times all other topics were related with doc d predictive distributions of sampling a new token of word i from topic j , predictive distributions of sampling a new token in document d from topic j
For each doc d and each word w of that doc an author x is sampled from the doc‘s author distribution/set a d .
Sample topic
For each doc d and each word w of that doc a topic z is sampled from the topic distribution θ (x) of the author x which has been assigned to that word.
Sample word
From the word-distribution φ (z) of each sampled topic z a word w is sampled.
P( w | z, φ (z) ) P( z | x, θ (x) )
31.
AT Model Latent Variables Latent Variables: 2) Author-distribution of each topic determines which topics are used by which authors count matrix C AT 1) Author-Topic assignment for each word 3) Word-distribution of each topic count matrix C WT ?
32.
Matrix Representation of Author-Topic-Model source: http://www.ics.uci.edu/~smyth/kddpapers/UCI_KD-D_author_topic_preprint.pdf θ (x) φ (z) a d observed observed latent latent
Add one fictitious author for each document; a d +1
uniform or non-uniform distribution over authors (including the fictitious author)
Each word is either sampled from a real author‘s or the fictitious author‘s topic distribution.
i.e., we learn topic-distribution for real-authors and for fictitious „author“ (= documents).
Problem reported in (Hong, 2010): topic distribution of each twitter message learnt via AT-model was worse than LDA with USER schema sparse messages and not all words of one message are used to learn document‘s topic distribution.
37.
Predictive Power of different models (Rosen-Zvi, 2005) Experiment: Trainingsdata: 1 557 papers Testdata:183 papers (102 are single-authored papers). They choose test data documents in such a way that each author of a test set document also appears in the training set as an author.
38.
Author-Recipients-Topic (ART) Model (McCallum, 2004)
Observed Variables:
Words per message
Authors per message
Recipients per message
Sample for each word
a recipient-author pair AND
a topic conditioned on the receiver-author pair‘s topic distribution θ (A,R)
Learn 2 corpus-level variables:
Author-recipient-pair distribution for each topic
Word-distribution for each topic
2 count matrices:
Pair-topic
Word-topic
, R , x P( z | x, a d , θ (A,R) ) P( w | z, φ (z) )
39.
Gibbs Sampling ART-Model Random Start: Sample author-recipient pair for each word Sample topic for each word Compute for each word w i : Number of recipients of message to which word w i belongs Number of times topic t was assigned to an author-recipient-pair Number of times current word token was assigned to topic t Number of times all other topics were assigned to an author-recipient-pair Number of times all other words were assigned to topic t Number of words * beta
Word-topic assignments are drawn from a document’s topic distribution θ which is restricted to the topic distribution Λ of the labels observed in d. Topic distribution of a label l is the same as topic distribution of all documents containing label l.
The document’s labels Λ are first generate using a Bernoulli coin toss for each topic k with a labeling prior φ .
Constraining the topic model to use only those topics that correspond to a document’s (observed) label set.
Topic assignments are limited to the document’s labels
One-to-one correspondence between LDA’s latent topics and user tags/labels
Discovery of groups is guided by the emerging topics
Discovery of topics is guided by the emerging groups
GT-model is an extension of the blockstructure model group-membership is conditioned on a latent variable associated with the attributes of the relation (i.e., the words) latent variable represents the topics which have generated the words.
GT model discovers topics relevant to relationships between entities in the social network
for each event (an interaction between entities) pick the topic t of the event and then generates all the words describing the event according to the topics’s word-distribution φ
for each entity s, which interacts within this event, the group assignment g is chosen conditionally from a particular multinomial (discrete) distribution θ over groups for each topic t.
For each event we have a matrix V which stores whether groups of 2 entities behaved the same or not during an event.
Number of events (=interactions between entities) Number of entities
To generate email e d a community c d is chosen uniformly at random
Based the community c d , the author a d and the set of recipients ρ d are chosen
To generate every word w (d,i) in that email, a recipient r (d,i) is chosen uniformly at random from the set of recipients ρ d
Based on the community c d , author a d and recipient r (d,i), a topic z (d,i) is chosen
The word w (d,i) itself is chosen based on the topic z (d,i)
Gibbs-sampling: alternates between updating latent communities c conditioned on other variables, and updating recipient-topic tuples (r, z) for each word conditioned on other variables.
Topics of a citing document are a “weighted sum” of documents it cites.
The weights of the terms capture the notion of the influence
Generative process
For each word of the citing publication d a cited publication c’ is picked from the set of all cited publications γ .
For each word in the citing publication d a topic is picked according to the current topic distribution which is a mix of the topic distribution of the assigned cited documents c’ .
D contains only nodes with outgoing citation links (the citing publications)
C contains nodes with incoming links (the cited publications). Documents in the original citation graph with incoming and outgoing links are represented as two nodes
Problem: bidirectional interdependence of links and topics caused by the topical atmosphere
Publications originated in one research area (such as Gibbs sampling, which originated in physics) will also be associated with topics they are often cited by (such as machine learning).
Problem: enforces each word in a citing publication to be associated with a cited publication noise
Copycat Model enforces each word in a citing publication to be associated with a cited publication this introduces noise
A citing publication may choose to draw a word’s topic from a topic mixture of a citing publication θ c (the topical atmosphere) or from it’s own topic mixture ψ d .
The choice is modeled by a flip of an unfair coin s. The parameter λ of the coin is learned by the model, given an asymmetric beta prior, which prefers the topic mixture θ of a cited publication.
The parameter λ yields an estimate for how well a publication fits to all its citations
i nnovation topic mixture of a citing publication distribution of citation inﬂuences parameter of the coin ﬂip, choosing to draw topics from θ or ψ
David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent Dirichlet Allocation. Journal of Machine Learning Research 3: 993-1022 (2003).
Dietz, L., Bickel, S. and Scheffer, T. (2007). Unsupervised prediction of citation influences. Proc. ICML, 2007.
Thomas Hoffmann, Probabilistic Latent Semantic Analysis, Proc. of Uncertainty in Artificial Intelligence, UAI'99, (1999).
Thomas L. Griffiths, Joshua B. Tenenbaum, Mark Steyvers, Topics in semantic representation, (2007).
Michal Rosen-Zvi, Chaitanya Chemudugunta, Thomas Griffiths, Padhraic Smyth, Irvine Mark Steyvers, Learning author-topic models from text corpora, (2010).
Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers and Padhraic Smyth, The author-topic model for authors and documents, In Proceedings of the 20th conference on Uncertainty in artificial intelligence (2004).
Andrew Mccallum, Andres Corrada-Emmanuel, Xuerui Wang, The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email, Tech-Report, (2004).
Nishith Pathak, Colin Delong, Arindam Banerjee, Kendrick Erickson, Social Topic Models for Community Extraction, In The 2nd SNA-KDD Workshop ’08, (2008).
Steyvers and Griffiths, Probabilistic Topic Models, (2006).
Ramage, Daniel and Hall, David and Nallapati, Ramesh and Manning, Christopher D., Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora, EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (2009)
Xuerui Wang, Natasha Mohanty, Andrew McCallum, Group and topic discovery from relations and text, (2005).
Hanna M. Wallach, David Mimno and Andrew McCallum, Rethinking LDA: Why Priors Matter (2009)
Thanks