Topic Models

Topic Models Claudia Wagner Graz, 16.9.2010

Semantic Representation of Text a) Network Model (nodes and edges) b) Space Model (points and proximity) c) Probabilistic Models (words belong to a set of probabilistic topics) (Griffiths, 2007)

Topic Models = probabilistic models for uncovering the underlying semantic structure of a document collection based on a hierarchical Bayesian analysis of the original texts (Blei, 2003) Aim: discover patterns of word-use and connect documents that exhibit similar patterns Idea: documents are mixtures of topics and a topic is a probability distribution over words

Topic Models source: http://www.cs.umass.edu/~wallach/talks/priors.pdf

Topic Models Topic 1 Topic 2 3 latent variables: Word distribution per topic (word-topic-matrix) Topic distribution per doc (topic-doc-matrix) Topic word assignment (Steyvers, 2006)

Summary Observed variables: Word-distribution per document 3 latent variables Topic distribution per document : P(z) = θ (d) Word distribution per topic: P(w, z) = φ (z) Word-Topic assignment: P(z|w) Training: Learn latent variables on trainings-collection of documents Test: Predict topic distribution θ (d) of an unseen document d

Topic Models pLSA (Hoffmann, 1999) LDA (Blei, 2003) Author Model (McCallum, 1999) Author-Topic Model (Rosen-Zvi, 2004) Author-Recipient Topic Model (McCallum, 2004) Group-Topic Model (Wang, 2005) Community-Author-Recipient Topic Model (Pathak, 2009) Semi-Supervised Topic Models Labeled LDA (Ramage, 2009)

pLSA (Hoffmann, 1999) Problem: Not a proper generative model for new documents! Why? Because we do not learn any corpus-level parameter  we learn for each doc of the trainingsset a topic-distribution number of documents number of words P( z | θ ) P( w | z ) Topic distribution of a document

Latent Dirichlet Allocation (LDA) (Blei, 2003) Advantage: We learn topic distribution of a corpus  we can predict topic distribution of an unseen document of this corpus by observing its words Hyper-parameters α and β are corpus-level parameters  are only sampled once P( w | z, φ (z) ) P( φ (z) | β ) number of documents number of words

Dirichlet Prior α α is a prior on the topic-distribution of documents (of a corpus) α is a corpus-level parameter (is chosen once) α is a force on the topic combinations Amount of smoothing determined by α Higher α  more smoothing  less „distinct“ topics Low α  the pressure is to pick for each document a topic distribution favoring just a few topics Recommended value: α = 50/T (or less if T is very small) High α Low α Each doc’s topic distribution θ is a smooth mix of all topics Each doc’s topic distribution θ must favor few topics Topic-distr. of Doc1 = (1/3, 1/3, 1/3) Topic-distr. of Doc2 = (1, 0, 0) Doc1 Doc2

Dirichlet Prior β β is a prior on the word-distribution β is a corpus-level parameter (is chosen once) β is a force on the word combinations Amount of smoothing determined by β Higher β  more smoothing Low β  the pressure is to pick for each topic w word distribution favoring just a few words Recommended values: β = 0.01 High β Low β Topic-distr. of Doc1 = (1/3, 1/3, 1/3) Word-distr. of Topic2 = (1, 0, 0) Topic1 Topic2

Matrix Representation of LDA observed latent latent θ (d) φ (z)

Statistical Inference and Parameter Estimation Key problem: Compute posterior distribution of the hidden variables given a document Posterior distribution is intractable for exact inference (Blei, 2003) Latent Vars Observed Vars and Priors

Statistical Inference and Parameter Estimation How can we estimate posterior distribution of hidden variables given a corpus of trainings-documents ? Direct (e.g. via expectation maximization, variational inference or expectation propagation algorithms) Indirect  i.e. estimate the posterior distribution over z (i.e. P(z)) Gibbs sampling, a form of Markov chain Monte Carlo, is often used to estimate the posterior probability over a high-dimensional random variable z

Markov Chain Example Random var X refers to the weather X t is value of var X at time point t State space of X = {sunny, rain} Transition probability matrix: P(sunny|sunny) = 0.9 P(sunny|rain) = 0.1 P(rain|sunny) = 0.5 P(rain|rain) = 0.5 Today ist sunny. What will be the wheather tomorrow? The day after tomorrow? source: http://en.wikipedia.org/wiki/Examples_of_Markov_chains

Markov Chain Example With increasing number of days n predictions for the weather tend towards a “steady state vector” q. q is independent from initial conditions it must be unchanged when transformed by P . This makes q an eigenvector (with eigenvalue 1), and means it can be derived from P

Gibbs Sampling generates a sequence of samples from the joint probability distribution of two or more random variables. Aim: compute posterior distribution over latent variable z Pre-request: we must know the conditional probability of z P( z i = j | z -i , w i , d i , . ) Why do we need to estimate P(z|w) via random walk? z is a high-dimensional random variable If num of topics T = 50 and num of words = 1000 We must visit 50 1000 points and compute P(z) for all of them.

Gibbs Sampling for LDA Random start Iterative For each word we compute How dominante is a topic z in the doc d? How often was the topic z already used in doc d? How likely is a word for a topic z? How often was the word w already assigned to topic z?

Run Gibbs Sampling Example (1) 1 1 2 2 2 2 1 1 1 2 1 2 1 2 1 1 2 1 2 2 1 2 1 2 Random topic assignments 2 count-matrices: C WT  Words per topic C DT  Topics per document 1 2 Stream 2 2 River 1 2 Loan 6 3 bank 2 3 money topic2 topic1 topic2 topic1 4 4 doc1 4 4 doc2 4 4 doc3

Gibbs Sampling for LDA Probability that topic j is chosen for word w i , conditioned on all other assigned topics of words in this doc and all other observed vars. Count number of times a word token w i was assigned to a topic j across all docs Count number of times a topic j was already assigned to some word token in doc d i unnormalized! => divide the probability of assigning topic j to word wi by the sum over all topics T

Run Gibbs Sampling Start: assign each word token to a random topic C WT = Count number of times a word token wi was assigned to a topic j C DT = Count number of times a topic j was already assigned to some word token in doc di First Iteration: For each word token, the count matrices C WT and C DT are first decremented by one for the entries that correspond to the current topic assignment Then, a new topic is sampled from the current topic-distribution of a doc and the count matrices C WT and C DT are incremented with the new topic assignment. Each Gibbs sample consists the set of topic assignments to all N word tokens in the corpus, achieved by a single pass through all documents

Run Gibbs Sampling Example (2) 1 2 2 2 2 1 1 1 2 1 2 1 2 1 1 2 1 2 2 1 2 1 2 First Iteration: Decrement C DT and C WT for current topic j Sample new topic from the current topic-distribution of a doc 3 2 2 5 3 1 2 Stream 2 2 River 1 2 Loan 6 3 bank 2 3 money topic2 topic1 topic2 topic1 4 4 doc1 4 4 doc2 4 4 doc3

Run Gibbs Sampling Example (2) 1 2 2 2 2 1 1 1 2 1 2 1 2 1 1 2 1 2 2 1 2 1 2 First Iteration: Decrement C DT and C WT for current topic j Sample new topic from the current topic-distribution of a doc 2 4 2 5 5 6 1 2 Stream 2 2 River 1 2 Loan 6 3 bank 3 2 money topic2 topic1 topic2 topic1 5 3 doc1 4 4 doc2 4 4 doc3

Run Gibbs Sampling Example (3) α = 50/T = 25 and β = 0.01 “ Bank” is assigned to Topic 2 How often were all other topics used in doc d i How often was topic j used in doc d i

Summary: Run Gibbs Sampling Gibbs sampling is used to estimate topic assignment for each word of each doc Factors affecting topic assignments How likely is a word w for a topic j? Probability of word w under topic j How dominante is a topic j in a doc d? Probability that topic j has under the current topic distribution for document d Once many tokens of a word have been assigned to topic j (across documents), the probability of assigning any particular token of that word to topic j increases  all other topics become less likely for word w ( Explaining Away ). Once a topic j has been used multiple times in one document, it will increase the probability that any word from that document will be assigned to topic j  all other documents become less likely for topic j ( Explaining Away ).

Gibbs Sampling Convergence Black = topic 1 White = topic2 Random Start N iterations Each iteration updates count-matrices Convergence: count-matrices stop changing Gibbs samples start to approximate the target distribution (i.e., the posterior distribution over z)

Gibbs Sampling Convergence Ignore some number of samples at the beginning (Burn-In period) Consider only every n th sample when averaging values to compute an expectation Why? successive Gibbs-samples are not independent  they form a Markov chain with some amount of correlation The stationary distribution of the Markov chain is the desired joint distribution over the latent variables, but it may take a while for that stationary distribution to be reached Techniques that may reduce autocorrelation between several latent variables are simulated annealing, collapsed Gibbs sampling or blocked Gibbs sampling;

Gibbs Sampling Parameter Estimation Gibbs sampling estimates posterior distribution of z. But we need word-distribution φ of each topic and topic-distribution θ of each document. num of times word wi was related with topic j num of times all other words were related with topic j num of times topic j was related with doc d num of times all other topics were related with doc d predictive distributions of sampling a new token of word i from topic j , predictive distributions of sampling a new token in document d from topic j

Author-Topic (AT)Model (Rosen-Zvi, 2004) Aim: discover patterns of word-use and connect authors that exhibit similar patterns Idea/Intuition: Words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture Each author == distribution over topics Each topic == distribution over words Each document with multiple authors == distribution over topics that is a mixture of the distributions associated with the authors.

AT-Model Algorithm Sample author For each doc d and each word w of that doc an author x is sampled from the doc‘s author distribution/set a d . Sample topic For each doc d and each word w of that doc a topic z is sampled from the topic distribution θ (x) of the author x which has been assigned to that word. Sample word From the word-distribution φ (z) of each sampled topic z a word w is sampled. P( w | z, φ (z) ) P( z | x, θ (x) )

AT Model Latent Variables Latent Variables: 2) Author-distribution of each topic  determines which topics are used by which authors  count matrix C AT 1) Author-Topic assignment for each word 3) Word-distribution of each topic  count matrix C WT ?

Matrix Representation of Author-Topic-Model source: http://www.ics.uci.edu/~smyth/kddpapers/UCI_KD-D_author_topic_preprint.pdf θ (x) φ (z) a d observed observed latent latent

Example (1) 1 1 2 2 2 2 1 1 1 2 1 2 1 2 1 1 2 1 2 1 2 1 2 Random topic-author assignments 2 count-matrices: C WT  Words per topic C AT  Authors per topic 1 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 3 3 2 2 2 1 2 stream 2 2 river 1 2 loan 6 3 bank 2 3 money topic2 topic1 8 8 author2 topic2 topic1 0 4 author1 4 0 author3

Gibbs Sampling for Author-Topic-Model Estimate posterior distribution of 2 random variables: z and x. For each word, we draw an author x i and a topic z i (OR a pair (z i ; x i ) as a block) conditioned on all other variables Blocked Gibbs sampling improves convergence of the Gibbs sampler when the variables are highly dependent Count number of times an author k was already assigned to topic j. Count number of times a word token w i was assigned to a topic j across all docs

Problems of the AT Model AT model learns author‘s topic distribution for a document-corpus But we don‘t learn topic distribution of documents AT model cannot model idiosyncratic aspects of a document

AT Model with Fictitious Authors Add one fictitious author for each document; a d +1 uniform or non-uniform distribution over authors (including the fictitious author) Each word is either sampled from a real author‘s or the fictitious author‘s topic distribution. i.e., we learn topic-distribution for real-authors and for fictitious „author“ (= documents). Problem reported in (Hong, 2010): topic distribution of each twitter message learnt via AT-model was worse than LDA with USER schema  sparse messages and not all words of one message are used to learn document‘s topic distribution.

Predictive Power of different models (Rosen-Zvi, 2005) Experiment: Trainingsdata: 1 557 papers Testdata:183 papers (102 are single-authored papers). They choose test data documents in such a way that each author of a test set document also appears in the training set as an author.

Author-Recipients-Topic (ART) Model (McCallum, 2004) Observed Variables: Words per message Authors per message Recipients per message Sample for each word a recipient-author pair AND a topic conditioned on the receiver-author pair‘s topic distribution θ (A,R) Learn 2 corpus-level variables: Author-recipient-pair distribution for each topic Word-distribution for each topic 2 count matrices: Pair-topic Word-topic , R , x P( z | x, a d , θ (A,R) ) P( w | z, φ (z) )

Gibbs Sampling ART-Model Random Start: Sample author-recipient pair for each word Sample topic for each word Compute for each word w i : Number of recipients of message to which word w i belongs Number of times topic t was assigned to an author-recipient-pair Number of times current word token was assigned to topic t Number of times all other topics were assigned to an author-recipient-pair Number of times all other words were assigned to topic t Number of words * beta

Labeled LDA (Ramage, 2009) Word-topic assignments are drawn from a document’s topic distribution θ which is restricted to the topic distribution Λ of the labels observed in d. Topic distribution of a label l is the same as topic distribution of all documents containing label l. The document’s labels Λ are first generate using a Bernoulli coin toss for each topic k with a labeling prior φ . Constraining the topic model to use only those topics that correspond to a document’s (observed) label set. Topic assignments are limited to the document’s labels One-to-one correspondence between LDA’s latent topics and user tags/labels

Group-Topic Model (Wang, 2005) Discovery of groups is guided by the emerging topics Discovery of topics is guided by the emerging groups GT-model is an extension of the blockstructure model  group-membership is conditioned on a latent variable associated with the attributes of the relation (i.e., the words)  latent variable represents the topics which have generated the words. GT model discovers topics relevant to relationships between entities in the social network

Group-Topic Model (Wang, 2005) Generative process for each event (an interaction between entities) pick the topic t of the event and then generates all the words describing the event according to the topics’s word-distribution φ for each entity s, which interacts within this event, the group assignment g is chosen conditionally from a particular multinomial (discrete) distribution θ over groups for each topic t. For each event we have a matrix V which stores whether groups of 2 entities behaved the same or not during an event. Number of events (=interactions between entities) Number of entities

CART Model (Pathak, 2008) Generative process To generate email e d a community c d is chosen uniformly at random Based the community c d , the author a d and the set of recipients ρ d are chosen To generate every word w (d,i) in that email, a recipient r (d,i) is chosen uniformly at random from the set of recipients ρ d Based on the community c d , author a d and recipient r (d,i), a topic z (d,i) is chosen The word w (d,i) itself is chosen based on the topic z (d,i) Gibbs-sampling: alternates between updating latent communities c conditioned on other variables, and updating recipient-topic tuples (r, z) for each word conditioned on other variables.

Copycat Model (Dietz, 2007) Topics of a citing document are a “weighted sum” of documents it cites. The weights of the terms capture the notion of the influence Generative process For each word of the citing publication d a cited publication c’ is picked from the set of all cited publications γ . For each word in the citing publication d a topic is picked according to the current topic distribution which is a mix of the topic distribution of the assigned cited documents c’ .

Copycat Model (Dietz, 2007) Example: A publication c is cited by two publications d1 and d2. The topic mixture of c is not only about all words in the cited publication c, but also about some words in d1 and d2, which are associated with c. This way, the topic mixture of c is influenced by the citing publications d1 and d2! The topic distribution of the cited document c in turn influences the association of words in d1 and d2 to c. All tokens that are associated with a cited publication are called the topical atmosphere of a cited publication d1 c d2 cites

Copycat Model (Dietz, 2007) Bi-partite citation graph 2 disjoint node sets D and C D contains only nodes with outgoing citation links (the citing publications) C contains nodes with incoming links (the cited publications). Documents in the original citation graph with incoming and outgoing links are represented as two nodes

Copycat Model (Dietz, 2007) Problem: bidirectional interdependence of links and topics caused by the topical atmosphere Publications originated in one research area (such as Gibbs sampling, which originated in physics) will also be associated with topics they are often cited by (such as machine learning). Problem: enforces each word in a citing publication to be associated with a cited publication  noise

Citation InfluenceModel (Dietz, 2007) Copycat Model enforces each word in a citing publication to be associated with a cited publication  this introduces noise A citing publication may choose to draw a word’s topic from a topic mixture of a citing publication θ c (the topical atmosphere) or from it’s own topic mixture ψ d . The choice is modeled by a flip of an unfair coin s. The parameter λ of the coin is learned by the model, given an asymmetric beta prior, which prefers the topic mixture θ of a cited publication. The parameter λ yields an estimate for how well a publication fits to all its citations i nnovation topic mixture of a citing publication distribution of citation inﬂuences parameter of the coin ﬂip, choosing to draw topics from θ or ψ

References David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent Dirichlet Allocation. Journal of Machine Learning Research 3: 993-1022 (2003). Dietz, L., Bickel, S. and Scheffer, T. (2007). Unsupervised prediction of citation influences. Proc. ICML, 2007. Thomas Hoffmann, Probabilistic Latent Semantic Analysis, Proc. of Uncertainty in Artificial Intelligence, UAI'99, (1999). Thomas L. Griffiths, Joshua B. Tenenbaum, Mark Steyvers, Topics in semantic representation, (2007). Michal Rosen-Zvi, Chaitanya Chemudugunta, Thomas Griffiths, Padhraic Smyth, Irvine Mark Steyvers, Learning author-topic models from text corpora, (2010). Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers and Padhraic Smyth, The author-topic model for authors and documents, In Proceedings of the 20th conference on Uncertainty in artificial intelligence (2004). Andrew Mccallum, Andres Corrada-Emmanuel, Xuerui Wang, The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email, Tech-Report, (2004). Nishith Pathak, Colin Delong, Arindam Banerjee, Kendrick Erickson, Social Topic Models for Community Extraction, In The 2nd SNA-KDD Workshop ’08, (2008). Steyvers and Griffiths, Probabilistic Topic Models, (2006). Ramage, Daniel and Hall, David and Nallapati, Ramesh and Manning, Christopher D., Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora, EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (2009) Xuerui Wang, Natasha Mohanty, Andrew McCallum, Group and topic discovery from relations and text, (2005). Hanna M. Wallach, David Mimno and Andrew McCallum, Rethinking LDA: Why Priors Matter (2009)

Topic Models

More Related Content

What's hot

Viewers also liked

Similar to Topic Models

More from Claudia Wagner

Topic Models