supervised and relational topic models

1,314 views

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,314
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
54
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

supervised and relational topic models

  1. 1. Supervised and Relational Topic Models David M. Blei Department of Computer Science Princeton University October 5, 2009 Joint work with Jonathan Chang and Jon McAuliffe
  2. 2. Topic modeling • Large electronic archives of document collections require new statistical tools for analyzing text. • Topic models have emerged as a powerful technique for unsupervised analysis of large document collections. • Topic models posit latent topics in text using hidden random variables, and uncover that structure with posterior inference. • Useful for tasks like browsing, search, information retrieval, etc.
  3. 3. Examples of topic modeling contractual employment female markets criminal expectation industrial men earnings discretion gain local women investors justice promises jobs see sec civil expectations employees sexual research process breach relations note structure federal enforcing unfair employer managers see supra agreement discrimination firm officer note economic harassment risk parole perform case gender large inmates
  4. 4. Examples of topic modeling online scheduling quantum task Quantum lower bounds by polynomials competitive On the power of bounded concurrency I: finite automata automata approximation nc tasks Dense quantum coding and quantum finite automata s Classical physics and the Church--Turing Thesis automaton points languages distance convex n routing Nearly optimal algorithms and bounds for multilayer channel routing machine functions adaptive How bad is selfish routing? domain polynomial networks network Authoritative sources in a hyperlinked environment networks Balanced sequences and optimal routing degree log protocol protocols degrees polynomials algorithm network packets link learning learnable statistical constraint examples dependencies Module algebra classes local On XML integrity constraints in the presence of DTDs An optimal algorithm for intersecting line segments in the plane graph Closure properties of constraints Recontamination does not help to search a graph graphs consistency Dynamic functional dependencies and database aging A new approach to the maximum-flow problem edge tractable The time complexity of maximum matching by simulated annealing minimum the,of database vertices constraints a, is algebra and boolean logic m relational logics merging n query networks algorithm theories sorting languages multiplication time log bound system learning consensus systems knowledge objects logic performance reasoning messages protocol programs analysis verification circuit asynchronous systems distributed language trees regular sets networks Single-class bounds of multi-class queuing networks tree queuing The maximum concurrent flow problem search asymptotic Contention in shared memory algorithms compression database productform Linear probing with a nonuniform address distribution transactions server retrieval concurrency Magic Functions: In Memoriam: Bernard M. Dwork 1923--1998 proof restrictions property formulas A mechanical proof of the Church-Rosser theorem program firstorder Timed regular expressions On the power and limitations of strictness analysis resolution decision abstract temporal queries
  5. 5. Examples of topic modeling 1880 1890 1900 1910 1920 1930 1940 electric electric apparatus air apparatus tube air machine power steam water tube apparatus tube power company power engineering air glass apparatus engine steam engine apparatus pressure air glass steam electrical engineering room water mercury laboratory two machine water laboratory glass laboratory rubber machines two construction engineer gas pressure pressure iron system engineer made made made small battery motor room gas laboratory gas mercury wire engine feet tube mercury small gas 1950 1960 1970 1980 1990 2000 tube tube air high materials devices apparatus system heat power high device glass temperature power design power materials air air system heat current current chamber heat temperature system applications gate instrument chamber chamber systems technology high small power high devices devices light laboratory high flow instruments design silicon pressure instrument tube control device material rubber control design large heat technology
  6. 6. Examples of topic modeling neurons brain stimulus motor memory visual activated subjects synapses tyrosine phosphorylation cortical left ltp activation phosphorylation p53 task surface glutamate kinase cell cycle proteins tip synaptic activity protein cyclin binding rna image neurons regulation domain dna sample materials computer domains rna polymerase organic problem device receptor cleavage information polymer science amino acids research scientists receptors cdna site computers polymers funding molecules physicists support says ligand sequence problems laser particles nih research ligands isolated optical physics program people protein sequence light apoptosis sequences surface particle electrons experiment genome liquid quantum wild type dna surfaces stars mutant enzyme sequencing fluid mutations enzymes model reaction astronomers united states mutants iron active site reactions universe women cells mutation reduction molecule galaxies universities cell molecules expression magnetic galaxy cell lines plants magnetic field transition state students bone marrow plant spin superconductivity gene education genes superconducting pressure mantle arabidopsis bacteria high pressure crust sun bacterial pressures upper mantle solar wind host fossil record core meteorites earth resistance development birds inner core ratios planets mice parasite embryos fossils planet gene dinosaurs antigen virus drosophila species disease fossil t cells hiv genes forest mutations antigens aids expression forests families earthquake co2 immune response infection populations mutation earthquakes carbon viruses ecosystems fault carbon dioxide ancient images methane patients genetic found disease cells population impact data water ozone treatment proteins populations million years ago volcanic atmospheric drugs differences africa clinical researchers deposits climate measurements variation stratosphere protein magma ocean eruption ice concentrations found volcanism changes climate change
  7. 7. Supervised topic models • These applications of topic modeling work in the same way. • Fit a model using a likelihood criterion. Then, hope that the resulting model is useful for the task at hand. • Supervised topic models and relational topic models fit topics explicitly to perform prediction. • Useful for building topic models that can • Predict the rating of a review • Predict the category of an image • Predict the links emitted from a document
  8. 8. Outline 1 Unsupervised topic models 2 Supervised topic models 3 Relational topic models
  9. 9. Probabilistic modeling 1 Treat data as observations that arise from a generative probabilistic process that includes hidden variables • For documents, the hidden variables reflect the thematic structure of the collection. 2 Infer the hidden structure using posterior inference • What are the topics that describe this collection? 3 Situate new data into the estimated model. • How does this query or new document fit into the estimated topic structure?
  10. 10. Intuition behind LDA Simple intuition: Documents exhibit multiple topics.
  11. 11. Generative model Topic proportions and Topics Documents assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, • Each document is a random mixture of corpus-wide topics • Each word is drawn from one of those topics
  12. 12. The posterior distribution Topic proportions and Topics Documents assignments • In reality, we only observe the documents • Our goal is to infer the underlying topic structure
  13. 13. Latent Dirichlet allocation Per-word Dirichlet topic assignment parameter Per-document Observed Topic topic proportions word Topics hyperparameter α θd Zd,n Wd,n βk η N D K Each piece of the structure is a random variable.
  14. 14. Latent Dirichlet allocation α θd Zd,n Wd,n βk η N D K βk ∼ Dir(η) k = 1 . . . K θd ∼ Dir(α) d = 1 . . . D Zd,n | θd ∼ Mult(1, θd ) d = 1 . . . D, n = 1...N Wd,n | θd , zd,n , β1:K ∼ Mult(1, βzd,n ) d = 1 . . . D, n = 1...N
  15. 15. Latent Dirichlet allocation α θd Zd,n Wd,n βk η N D K 1 Draw each topic βk ∼ Dir(η), for k ∈ {1, . . . , K }. 2 For each document: 1 Draw topic proportions θd ∼ Dir(α). 2 For each word: 1 Draw Zd,n ∼ Mult(θd ). 2 Draw Wd,n ∼ Mult(βzd,n ).
  16. 16. Latent Dirichlet allocation α θd Zd,n Wd,n βk η N D K • From a collection of documents, infer • Per-word topic assignment zd,n • Per-document topic proportions θd • Per-corpus topic distributions βk • Use posterior expectations to perform the task at hand, e.g., information retrieval, document similarity, etc.
  17. 17. Latent Dirichlet allocation α θd Zd,n Wd,n βk η N D K • Computing the posterior is intractable: N p(θ | α) n=1 p(zn | θ)p(wn | zn , β1:K ) N K θ p(θ | α) n=1 z=1 p(zn | θ)p(wn | zn , β1:K ) • Several approximation techniques have been developed.
  18. 18. Latent Dirichlet allocation α θd Zd,n Wd,n βk η N D K • Mean field variational methods (Blei et al., 2001, 2003) • Expectation propagation (Minka and Lafferty, 2002) • Collapsed Gibbs sampling (Griffiths and Steyvers, 2002) • Collapsed variational inference (Teh et al., 2006)
  19. 19. Example inference 0.4 0.3 Probability 0.2 0.1 0.0 1 8 16 26 36 46 56 66 76 86 96 Topics
  20. 20. Example topics “Genetics” “Evolution” “Disease” “Computers” human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations
  21. 21. Used in exploratory tools of document collections
  22. 22. LDA summary • LDA is a powerful model for • Visualizing the hidden thematic structure in large corpora • Generalizing new data to fit into that structure • LDA is a mixed membership model (Erosheva, 2004) that builds on the work of Deerwester et al. (1990) and Hofmann (1999). • For document collections and other grouped data, this might be more appropriate than a simple finite mixture. • The same model was independently invented for population genetics analysis (Pritchard et al., 2000).
  23. 23. LDA summary • Modular : It can be embedded in more complicated models. • General: The data generating distribution can be changed. • Variational inference is fast; lets us to analyze large data sets. • See Blei et al., 2003 for details and a quantitative comparison. See my web-site for code and other papers. • Jonathan Chang’s excellent R package “lda” contains Gibbs sampling code for this model and many others.
  24. 24. Supervised topic models • But LDA is an unsupervised model. How can we build a topic model that is good at the task we care about? • Many data are paired with response variables. • User reviews paired with a number of stars • Web pages paired with a number of “diggs” • Documents paired with links to other documents • Images paired with a category • Supervised topic models are topic models of documents and responses, fit to find topics predictive of the response.
  25. 25. Supervised LDA α θd Zd,n Wd,n βk K N Yd D η, σ 2 1 Draw topic proportions θ | α ∼ Dir(α). 2 For each word • Draw topic assignment zn | θ ∼ Mult(θ). • Draw word wn | zn , β1:K ∼ Mult(βzn ). 3 Draw response variable y | z1:N , η, σ 2 ∼ N η z , σ 2 , where ¯ N ¯ z = (1/N) n=1 zn .
  26. 26. Supervised LDA α θd Zd,n Wd,n βk K N Yd D η, σ 2 • The response variable y is drawn after the document because it depends on z1:N , an assumption of partial exchangeability. • Consequently, y is necessarily conditioned on the words. • In a sense, this blends generative and discriminative modeling.
  27. 27. Supervised LDA α θd Zd,n Wd,n βk K N Yd D η, σ 2 • Given a set of document-response pairs, fit the model parameters by maximum likelihood. • Given a new document, compute a prediction of its response. • Both of these activities hinge on variational inference.
  28. 28. Variational inference (in general) • Variational methods are a deterministic alternative to MCMC. • Let x1:N be observations and z1:M be latent variables • Our goal is to compute the posterior distribution p(z1:M , x1:N ) p(z1:M | x1:N ) = p(z1:M , x1:N )dz1:M • For many interesting distributions, the marginal likelihood of the observations is difficult to efficiently compute
  29. 29. Variational inference • Use Jensen’s inequality to bound the log prob of the observations: log p(x1:N ) = log p(z1:M , x1:N )dz1:M qν (z1:M ) = log p(z1:M , x1:N ) dz1:M qν (z1:M ) ≥ Eqν [log p(z1:M , x1:N )] − Eqν [log qν (z1:M )] • We have introduced a distribution of the latent variables with free variational parameters ν. • We optimize those parameters to tighten this bound. • This is the same as finding the member of the family qν that is closest in KL divergence to p(z1:M | x1:N ).
  30. 30. Mean-field variational inference • Factorization of qν determines complexity of optimization • In mean field variational inference qν is fully factored M qν (z1:M ) = qνm (zm ). m=1 • The latent variables are independent. • Each is governed by its own variational parameter νm . • In the true posterior they can exhibit dependence (often, this is what makes exact inference difficult).
  31. 31. MFVI and conditional exponential families • Suppose the distribution of each latent variable conditional on all other variables is in the exponential family: p(zm | z−m , x) = hm (zm ) exp{gm (z−m , x)T zm − am (gi (z−m , x))} • Assume qν is fully factorized, and each factor is in the same exponential family as the corresponding conditional: qνm (zm ) = hm (zm ) exp{νm zm − am (νm )} T
  32. 32. MFVI and conditional exponential families • Variational inference is the following coordinate ascent algorithm νm = Eqν [gm (Z−m , x)] • Notice the relationship to Gibbs sampling.
  33. 33. Variational inference • Alternative to MCMC; replace sampling with optimization. • Deterministic approximation to posterior distribution. • Uses established optimization methods (block coordinate ascent; Newton-Raphson; interior-point). • Faster, more scalable than MCMC for large problems. • Biased, whereas MCMC is not. • Emerging as a useful framework for fully Bayesian and empirical Bayesian inference problems. Many open issues! • Good papers: Beal’s Ph.D. thesis, Wainwright and Jordan (2009)
  34. 34. Variational inference in sLDA α θd Zd,n Wd,n βk K N Yd D η, σ 2 • In sLDA the variational bound is N E[log p(θ | α)] + n=1 E[log p(Zn | θ)] + N n=1 E[log p(wn | Zn , β1:K )] + E[log p(y | Z1:N , η, σ 2 )] + H(q) • As in Blei, Ng, and Jordan (2003), we use the fully-factorized variational distribution N q(θ, z1:N | γ, φ1:N ) = q(θ | γ) n=1 q(zn | φn ),
  35. 35. Variational inference in sLDA • The distinguishing term is E[log p(y | Z1:N , η, σ 2 )] 1 y 2 − 2yη E Z + η E Z Z ¯ ¯¯ η = − log 2πσ 2 − 2 2σ 2 • The first expectation is ¯ ¯ 1 N E Z = φ := N n=1 φn . • The second expectation is ¯¯ 1 N N E ZZ = N2 n=1 m=n φn φm + n=1 diag{φn } . • Linear in φn , which leads to an easy coordinate ascent algorithm.
  36. 36. Maximum likelihood estimation • The M-step is an MLE under expected sufficient statistics. • Define • y = y1:D is the response vector ¯ • A is the D × K matrix whose rows are Zd . • MLE of the coefficients solve the expected normal equations −1 E A A η = E[A] y ⇒ ηnew ← E A A ˆ E[A] y • The MLE of the variance is −1 ˆ2 σnew ← (1/D){y y − y E[A] E A A E[A] y }
  37. 37. Prediction • We have fit SLDA parameters to a corpus, using variational EM. • We have a new document w1:N with unknown response value. • First, run variational inference in the unsupervised LDA model, to obtain γ and φ1:N for the new document. (LDA ⇔ integrating unobserved Y out of SLDA.) • Predict y using SLDA expected value: E Y | w1:N , α, β1:K , η, σ 2 ≈ η Eq Z = η φ. ¯ ¯
  38. 38. Example: Movie reviews least bad more awful his both problem guys has featuring their motion unfortunately watchable than routine character simple supposed its films dry many perfect worse not director offered while fascinating flat one will charlie performance power dull movie characters paris between complex ● ● ● ●● ● ● ● ● −30 −20 −10 have not 0 one however 10 20 like about from cinematography you movie there screenplay was all which performances just would who pictures some they much effective out its what picture • 10-topic sLDA model on movie reviews (Pang and Lee, 2005). • Response: Number of stars associated with each review • Each component of coefficient vector η is associated with a topic.
  39. 39. Predictive R2 (SLDA is red.) 0.5 q q q q q q q q q 0.4 Predictive R2 0.3 q q q q q q q q q 0.2 q q 0.1 0.0 5 10 15 20 25 30 35 40 45 50 Number of topics
  40. 40. Held out likelihood (SLDA is red.) −6.37 q q q q Per−word held out log likelihood −6.38 q q q q q q q −6.39 q q q q q −6.40 q q −6.41 q −6.42 q 5 10 15 20 25 30 35 40 45 50 Number of topics
  41. 41. Diverse response types with GLMs • Want to work with response variables that don’t live in the reals. • binary / multiclass classification • count data • waiting time • Model the response response with a generalized linear model ζy − A(ζ) p(y | ζ, δ) = h(y , δ) exp , δ where ζ = η z . ¯ • Complicates inference, but allows for flexible modeling.
  42. 42. CVPR 2009 Submission #318. CONFIDE 759 highway coast (highw Example: Multi-class classification 760 CVPR 761 car, sign, road 756 757 756 Correct classification with predicted annotations Correct classification car, sand bea #318 762 758 757 with predicted annotations CVPR 763 CVPR 2009 Submission #318. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBU #318 759 758 highway CVPR 2009 Submission #318. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. 764 760 759 inside city highway street (insid 765 761 car, sign, road 756 Correct classification760 Incorrect classification (correct 766 762 757 buildings, car, sidewalk 761 with predicted annotations with predicted annotations tre car, sign, road 594 window, 767 763 occluded 758 image classification on the LabelMe dataset 762 image classification on the UIUC!Sport dataset 595 0.78 768 764 759 763 596 769 0.66 highway inside city coast (highw 0.76 765 764 760 tall building 597 inside city (t 770 766 inside city 0.74 761 765 0.64 car, sign, road buildings, car, sidewalk 598 sand bea car, 771 767 762 trees, buildings 766 599 tree, car, sid 772 buildings, car, sidewalk average accuracy average accuracy 0.72 occluded, window 768 767 763 0.62 600 773 769 0.7 764 768 601 774 inside city 770 tall building street (inside 0.68 765 0.6 769 602 775 street 771 tall building highway (str 766 770 603 0.66 776 buildings, car, sidewalk 772 trees, buildings window, tree 767 0.58 771 604 777 tree, car, sidewalk occluded, window trees, buildings occluded car, window 773 772 0.64 768 605 778 0.56 774 occluded, window 773 60 769 80 606 20 40 779 100 120 topics 20 40 775 60 80 100 120 topics 770 Fei!Fei and 774 tall building2005 street inside city (t multi!class sLDA with annotations 780 # of components multi!class sLDA Perona, Bosch et al., 2006 607 776 775 forest mountain (fo 771 street 608 781 mparisons of average accuracy over all classes based on 5 random train/test subsets. multi-class sLDA with annotations and trees, buildings 777 tree, car, sidewalk tree, car, side 772 776 609 782 LDA (red curves in color) are both our models. left. Accuracy as a function of the number of topics on the LabelMe dataset. occluded, window 778 773 tree trunk,777 trees, tree, car, sidewalk 610 snowy moun 783 acy as a function of the number of topics on the UIUC-Sport dataset. ground778 grass 779 trunk (SLDA for image classification, with Chong Wang) 774 784 lass sLDA: This is the multi-class sLDA model, 775 780 779 purely generative approach. On one hand, a large number 611 612 785 street forest highway (str ed in this paper. 781 780 of topics increases the possibility of overfitting; on the other 776 coast 613 open countr 786 782 forest 777 hand, car, sidewalk 781 tree, it provides more latent features for building the clas- 614 window, lass sLDA with annotations: This is multi-class 787 tree trunk, trees, car, sifier. beach, 782 783 ground trees, grass with annotations, described788 778 this paper. in sand cloud tree trunk, 615 water, bu sea 784 783 779 ground grass 616 789 785 ing is performed on unlabeled and unannotated 780 784 617 790 Image Annotation. In the case of multi-class sLDA with forest 786 785 coast mountain (fo 781 618 791 annotations, we can use the same trained model for coast mountain 787 786 image highway (mo
  43. 43. Supervised topic models • SLDA enables model-based regression where the predictor “variable” is a text document. • It can easily be used wherever LDA is used in an unsupervised fashion (e.g., images, genes, music). • SLDA is a supervised dimension-reduction technique, whereas LDA performs unsupervised dimension reduction. • LDA + regression compared to sLDA is like principal components regression compared to partial least squares. • Paper: Blei and McAuliffe, NIPS 2007.
  44. 44. Relational topic models 966 902 1673 1253 1140 1432 1590 1481 964 981 ... 120 1060 ... 831 2259 837 474 ... 436 264 722 1743 965 442 ... 375 660 1335 640 Utilizing prior concepts for 109 1959 learning 254 885 The inductive learning problem 2272 1489 635 Irrelevant features and the consists of learning a concept 801 2192 172 256 subset selection problem given examples and 381 1285 547 651 89 683 We address the problem of nonexamples of the concept. To 2033 534 177 632 finding a subset of features that perform this learning task, 1270 1592 524 634 686 allows a supervised induction inductive learning algorithms bias 1020 208 119 algorithm to induce small high- their learning method... 1642 1176 1317 1698 accuracy concepts... 539 1568 430 236 994 ... 1284 2593 223 313 1426 1304 1165 992 1792 2557 541 1188 2343 1377 2487 2197 1001 2137 1637 1617 Learning with many irrelevant An evolutionary approach to ... 1674 911 1483 1123 52 1569 features learning in robots 1695 1354 1039 In many domains, an appropriate Evaluation and selection of Evolutionary learning methods 603 inductive bias is the MIN- biases in machine learning have been found to be useful in 1680 1207 FEATURES bias, which prefers In this introduction, we define the several areas in the development 288 1355 1047 1465 1040 136 consistent hypotheses definable term bias as it is used in machine of intelligent robots. In the 75 1089 478 1010 over as few features as learning systems. We motivate approach described here, 1348 1420 possible... the importance of automated evolutionary... 479 585 methods for evaluating... 806 2122 227 1651 1345 692 92 396 218 1061 178 Using a genetic algorithm to ... 2299 960 1854 378 1578 learn strategies for collision 2291 ... 1344 418 1539 286 1963 avoidance and local 649 1855 1138 449 303 335 navigation ... 2042 2290 1290 1678 Improving tactical plans with Navigation through obstacles 2300 147 1627 1275 2195 ... genetic algorithms such as mine fields is an 1121 2636 2091 1027 1238 The problem of learning decision important capability for 2447 1644 rules for sequential tasks is autonomous underwater vehicles. 344 2583 2012 addressed, focusing on the One way to produce robust 426 2438 problem of learning tactical plans behavior... from a simple flight simulator 1244 where a plane must avoid a 2617 missile... 2213 1234 1944 • Many data sets contain connected observations. • For example: • Citation networks of documents • Hyperlinked networks of web-pages. • Friend-connected social network profiles

×