Your SlideShare is downloading. ×
0
Probabilistic Topic Models          David M. Blei    Department of Computer Science         Princeton University        Au...
Information overload                       As more information becomes                       available, it becomes more   ...
Topic modelingTopic modeling provides methods for automatically organizing,understanding, searching, and summarizing large...
Discover topics from a corpus       “Genetics”    “Evolution” “Disease”        “Computers”           human        evolutio...
Model the evolution of topics over time                       "Theoretical Physics"                                       ...
Model connections between topics                                                                                          ...
Find hierarchies of topics                                                                                                ...
Annotate images                          Automatic image annotation tain people              Automatic predictedwaterpatte...
Discover influential articles                                                                                              ...
Table 2Predict links between articles + Regression for two documents   Top eight link predictions made by RTM (ψ ) and LDA...
Characterize political decisions             tax credit,budget authority,energy,outlays,tax                  county,eligib...
Organize and browse large corpora
This tutorial  • What are topic models?  • What kinds of things can they do?  • How do I compute with a topic model?  • Wh...
Related subjectsTopic modeling is a case study in modern machine learning withprobabilistic models. It touches on  • Direc...
If you remember one picture... Assumptions                 Inference   Discovered structure                 algorithm Data
Organization  • Introduction to topic modeling      •   Latent Dirichlet allocation      •   Open source implementations a...
Introduction to Topic Modeling
Probabilistic modeling  1   Data are assumed to be observed from a generative probabilistic      process that includes hid...
Latent Dirichlet allocation (LDA)       Simple intuition: Documents exhibit multiple topics.
Generative model for LDA                                              Topic proportions and       Topics              Docu...
The posterior distribution                                                Topic proportions and     Topics               D...
The posterior distribution                                                Topic proportions and     Topics                ...
LDA as a graphical model                              Per-word      Proportions                          topic assignment ...
LDA as a graphical model                               Per-word       Proportions                           topic assignme...
LDA as a graphical model                                   Per-word        Proportions                               topic...
LDA      α        θd      Zd,n    Wd,n                βk          η                                        N              ...
LDA        α        θd     Zd,n     Wd,n               βk        η                                        N               ...
Example inference       α       θd     Zd,n   Wd,n              βk         η                                     N        ...
Example inference                                  0.4                                  0.3                    Probability...
Example inference      “Genetics”    “Evolution” “Disease”        “Computers”          human        evolution     disease ...
Example inference (II)
Example inference (II)          problem           model        selection        species         problems            rate  ...
2600                                                                                           PerplexityHeld out perplexi...
Used to explore and browse document collections
Aside: The Dirichlet distribution  • The Dirichlet distribution is an exponential family distribution over    the simplex,...
α=1                                    1                                       2                                       3  ...
α = 10                                     1                                       2                                      ...
α = 100                                     1                                       2                                     ...
α=1                                    1                                       2                                       3  ...
α = 0.1                                     1                                       2                                     ...
α = 0.01                                     1                                       2                                    ...
α = 0.001                                     1                                       2                                   ...
Why does LDA “work”?Why does the LDA posterior put “topical” words together?  • Word probabilities are maximized by dividi...
Summary of LDA       α         θd     Zd,n    Wd,n               βk         η                                        N    ...
Implementations of LDAThere are many available implementations of topic modeling—    LDA-C∗         A C implementation of ...
Example: LDA in R (Jonathan Chang)    perspective identifying tumor suppressor genes in human...    letters global warming...
1                2                3                4              5  dna          protein            water            says...
Open source document browser (with Allison Chaney)
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
kddvince
Upcoming SlideShare
Loading in...5
×

kddvince

546

Published on

kddvince

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
546
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "kddvince"

  1. 1. Probabilistic Topic Models David M. Blei Department of Computer Science Princeton University August 22, 2011
  2. 2. Information overload As more information becomes available, it becomes more difficult to find and discover what we need. We need new tools to help us organize, search, and understand these vast amounts of information.
  3. 3. Topic modelingTopic modeling provides methods for automatically organizing,understanding, searching, and summarizing large electronic archives. 1 Discover the hidden themes that pervade the collection. 2 Annotate the documents according to those themes. 3 Use annotations to organize, summarize, and search the texts.
  4. 4. Discover topics from a corpus “Genetics” “Evolution” “Disease” “Computers” human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations
  5. 5. Model the evolution of topics over time "Theoretical Physics" "Neuroscience" FORCE OXYGEN o o o o LASER o o o o o o o o o NERVE o o o o o o o o o o o o o o o o o o o o o o o o o o o RELATIVITY o o o o o o o o o o o o o o o o o o o o o o o o o o o o o NEURON o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o 1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980 2000
  6. 6. Model connections between topics neurons brain stimulus motor memory visual activated subjects synapses tyrosine phosphorylation cortical left ltp activation phosphorylation p53 task surface glutamate kinase cell cycle proteins tip synaptic activity protein cyclin binding rna image neurons regulation domain dna sample materials computer domains rna polymerase organic problem device receptor cleavage information polymer science amino acids research scientists receptors cdna site computers polymers funding molecules physicists support says ligand sequence problems laser particles nih research ligands isolated optical physics program people protein sequence light apoptosis sequences surface particle electrons experiment genome liquid quantum wild type dna surfaces stars mutant enzyme sequencing fluid mutations enzymes model reaction astronomers united states mutants iron active site reactions universe women cells mutation reduction molecule galaxies universities cell molecules expression magnetic galaxy cell lines plants magnetic field transition state students bone marrow plant spin superconductivity gene education genes superconducting pressure mantle arabidopsis bacteria high pressure crust sun bacterial pressures upper mantle solar wind host fossil record core meteorites earth resistance development birds inner core ratios planets mice parasite embryos fossils planet gene dinosaurs antigen virus drosophila species disease fossil t cells hiv genes forest mutations antigens aids expression forests families earthquake co2 immune response infection populations mutation earthquakes carbon viruses ecosystems fault carbon dioxide ancient images methane patients genetic found disease cells population impact data water ozone treatment proteins populations million years ago volcanic atmospheric drugs differences africa clinical researchers deposits climate measurements variation stratosphere protein magma ocean eruption ice concentrations found volcanism changes climate change
  7. 7. Find hierarchies of topics online scheduling quantum task Quantum lower bounds by polynomials competitive On the power of bounded concurrency I: finite automata automata approximation nc tasks Dense quantum coding and quantum finite automata s Classical physics and the Church--Turing Thesis automaton points languages distance convex n routing Nearly optimal algorithms and bounds for multilayer channel routing machine functions adaptive How bad is selfish routing? domain polynomial networks network Authoritative sources in a hyperlinked environment networks Balanced sequences and optimal routing degree log protocol protocols degrees polynomials algorithm network packets link learning learnable statistical constraint examples dependencies Module algebra classes local On XML integrity constraints in the presence of DTDs An optimal algorithm for intersecting line segments in the plane graph Closure properties of constraints Recontamination does not help to search a graph graphs consistency Dynamic functional dependencies and database aging A new approach to the maximum-flow problem edge tractable The time complexity of maximum matching by simulated annealing minimum the,of database vertices constraints a, is algebra and boolean logic m relational logics merging n query networks algorithm theories sorting languages multiplication time log bound system learning consensus systems knowledge objects logic performance reasoning messages protocol programs analysis verification circuit asynchronous systems distributed language trees regular sets networks Single-class bounds of multi-class queuing networks tree queuing The maximum concurrent flow problem search asymptotic Contention in shared memory algorithms compression database productform Linear probing with a nonuniform address distribution transactions server retrieval concurrency Magic Functions: In Memoriam: Bernard M. Dwork 1923--1998 proof restrictions property formulas A mechanical proof of the Church-Rosser theorem program firstorder Timed regular expressions On the power and limitations of strictness analysis resolution decision abstract temporal queries
  8. 8. Annotate images Automatic image annotation tain people Automatic predictedwaterpattern textile display predicted caption: tree predicted caption: image annotation leaves branch predicted caption: caption: people market tree mountain people sky birds nest leaves branch tree birds nest p putomatic image mountainnest caption: flowers tree coralpredicted caption: image anno sky water tree mountain people annotation hills tree predicted caption: predicted caption: e coral SKY WATER TREE sky water buildings people fish branch Automatic predicted caption: predicted predicted caption: predicted caption: SCOTLAND WATER people marketWATER BUILDING birds scotland water ocean leaves water tree p SKY patternbuildings people mountain s sky water textile display predicted caption: predicted caption:HILLS TREEpredicted caption: FLOWER MOUNTAIN PEOPLE sky water tree mountain people birds nest leaves branch tree PEOPLE WATER people market pattern textile display predicted caption: predicted caption: predicted caption: fish water ocean tree coral sky water buildings people mountain scotland water flowers hills tree predicted caption: predicted caption: predicted caption: Probabilistic modelsof text andpredicted caption: predicted caption: caption: predicted predicted caption: p.5/53 images – pr tain peoplefish water ocean leaves branch treePEOPLE sky water pattern textile display water flowersleavestree FISH WATERtree coral birds nest OCEAN people market tree mountain scotland BIRDS NEST TREE MARKET PATTERN sky water buildings peoplemountain people birds nest hills branch tree pe TREE CORAL TEXTILE DISPLAY BRANCH LEAVES
  9. 9. Discover influential articles Derek E. Wildman et al., Implications of Natural Selection in Shaping 99.4% Nonsynonymous DNA Identity between Humans and Chimpanzees: Enlarging Genus Homo, PNAS (2003) [178 citations] 0.030 Jared M. Diamond, Distributional Ecology of New Guinea Birds. Science (1973) 0.025 [296 citations] WeightedInfluence 0.020 William K. Gregory, The New Anthropogeny: Twenty-Five Stages of 0.015 Vertebrate Evolution, from Silurian Chordate to Man, Science (1933) [3 citations] 0.010 0.005 0.000 1880 1900 1920 1940 1960 1980 2000 Year W. B. Scott, The Isthmus of Panama in Its Relation to the Animal Life of North and South America, Science (1916) [3 citations]
  10. 10. Table 2Predict links between articles + Regression for two documents Top eight link predictions made by RTM (ψ ) and LDA e (italicized) from Cora. The models were fit with 10 topics. Boldfaced titles indicate actual documents cited by or citing each document. Over the whole corpus, RTM improves precision over LDA + Regression by 80% when evaluated on the first 20 documents retrieved. Markov chain Monte Carlo convergence diagnostics: A comparative review Minorization conditions and convergence rates for Markov chain Monte Carlo Rates of convergence of the Hastings and Metropolis algorithms RTM (ψe ) Possible biases induced by MCMC convergence diagnostics Bounding convergence time of the Gibbs sampler in Bayesian image restoration Self regenerative Markov chain Monte Carlo Auxiliary variable methods for Markov chain Monte Carlo with applications Rate of Convergence of the Gibbs Sampler by Gaussian Approximation Diagnosing convergence of Markov chain Monte Carlo algorithms Exact Bound for the Convergence of Metropolis Chains LDA + Regression Self regenerative Markov chain Monte Carlo Minorization conditions and convergence rates for Markov chain Monte Carlo Gibbs-markov models Auxiliary variable methods for Markov chain Monte Carlo with applications Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical Models Mediating instrumental variables A qualitative framework for probabilistic inference Adaptation for Self Regenerative MCMC Competitive environments evolve better solutions for complex tasks Coevolving High Level Representations A Survey of Evolutionary Strategies R
  11. 11. Characterize political decisions tax credit,budget authority,energy,outlays,tax county,eligible,ballot,election,jurisdiction bank,transfer,requires,holding company,industrial housing,mortgage,loan,family,recipient energy,fuel,standard,administrator,lamp student,loan,institution,lender,school medicare,medicaid,child,chip,coverage defense,iraq,transfer,expense,chapter business,administrator,bills,business concern,loan transportation,rail,railroad,passenger,homeland security cover,bills,bridge,transaction,following bills,tax,subparagraph,loss,taxable loss,crop,producer,agriculture,trade head,start,child,technology,award computer,alien,bills,user,collection science,director,technology,mathematics,bills coast guard,vessel,space,administrator,requires child,center,poison,victim,abuse land,site,bills,interior,river energy,bills,price,commodity,market surveillance,director,court,electronic,flood child,fire,attorney,internet,bills drug,pediatric,product,device,medical human,vietnam,united nations,call,people bills,iran,official,company,sudan coin,inspector,designee,automobile,lebanon producer,eligible,crop,farm,subparagraph people,woman,american,nation,school veteran,veterans,bills,care,injury dod,defense,defense and appropriation,military,subtitle
  12. 12. Organize and browse large corpora
  13. 13. This tutorial • What are topic models? • What kinds of things can they do? • How do I compute with a topic model? • What are some unsanswered questions in this field? • How can I learn more?
  14. 14. Related subjectsTopic modeling is a case study in modern machine learning withprobabilistic models. It touches on • Directed graphical models • Conjugate priors and nonconjugate priors • Time series modeling • Modeling with graphs • Hierarchical Bayesian methods • Approximate posterior inference (MCMC, variational methods) • Exploratory and descriptive data analysis • Model selection and Bayesian nonparametric methods • Mixed membership models • Prediction from sparse and noisy inputs
  15. 15. If you remember one picture... Assumptions Inference Discovered structure algorithm Data
  16. 16. Organization • Introduction to topic modeling • Latent Dirichlet allocation • Open source implementations and tools • Beyond latent Dirichlet allocation • Modeling richer assumptions • Supervised topic modeling • Bayesian nonparametric topic modeling • Algorithms • Gibbs sampling • Variational inference • Online variational inference • Discussion, open questions, and resources
  17. 17. Introduction to Topic Modeling
  18. 18. Probabilistic modeling 1 Data are assumed to be observed from a generative probabilistic process that includes hidden variables. • In text, the hidden variables are the thematic structure. 2 Infer the hidden structure using posterior inference • What are the topics that describe this collection? 3 Situate new data into the estimated model. • How does a new document fit into the topic structure?
  19. 19. Latent Dirichlet allocation (LDA) Simple intuition: Documents exhibit multiple topics.
  20. 20. Generative model for LDA Topic proportions and Topics Documents assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, • Each topic is a distribution over words • Each document is a mixture of corpus-wide topics • Each word is drawn from one of those topics
  21. 21. The posterior distribution Topic proportions and Topics Documents assignments • In reality, we only observe the documents • The other structure are hidden variables
  22. 22. The posterior distribution Topic proportions and Topics Documents assignments • Our goal is to infer the hidden variables • I.e., compute their distribution conditioned on the documents p(topics, proportions, assignments | documents)
  23. 23. LDA as a graphical model Per-word Proportions topic assignment parameter Per-document Observed Topic topic proportions word Topics parameter α θd Zd,n Wd,n βk η N D K • Encodes our assumptions about the data • Connects to algorithms for computing with data • See Pattern Recognition and Machine Learning (Bishop, 2006).
  24. 24. LDA as a graphical model Per-word Proportions topic assignment parameter Per-document Observed Topic topic proportions word Topics parameter α θd Zd,n Wd,n βk η N D K • Nodes are random variables; edges indicate dependence. • Shaded nodes are observed. • Plates indicate replicated variables.
  25. 25. LDA as a graphical model Per-word Proportions topic assignment parameter Per-document Observed Topic topic proportions word Topics parameter α θd Zd,n Wd,n βk η N D K K D N p(βi | η) p(θd | α) p(zd,n | θd )p(wd,n | β1:K , zd,n ) i=1 d=1 n=1
  26. 26. LDA α θd Zd,n Wd,n βk η N D K • This joint defines a posterior. • From a collection of documents, infer • Per-word topic assignment zd,n • Per-document topic proportions θd • Per-corpus topic distributions βk • Then use posterior expectations to perform the task at hand, e.g., information retrieval, document similarity, exploration, ...
  27. 27. LDA α θd Zd,n Wd,n βk η N D KApproximate posterior inference algorithms • Mean field variational methods (Blei et al., 2001, 2003) • Expectation propagation (Minka and Lafferty, 2002) • Collapsed Gibbs sampling (Griffiths and Steyvers, 2002) • Collapsed variational inference (Teh et al., 2006) • Online variational inference (Hoffman et al., 2010)Also see Mukherjee and Blei (2009) and Asuncion et al. (2009).
  28. 28. Example inference α θd Zd,n Wd,n βk η N D K • Data: The OCR’ed collection of Science from 1990–2000 • 17K documents • 11M words • 20K unique terms (stop words and rare words removed) • Model: 100-topic LDA model using variational inference.
  29. 29. Example inference 0.4 0.3 Probability 0.2 0.1 0.0 1 8 16 26 36 46 56 66 76 86 96 Topics
  30. 30. Example inference “Genetics” “Evolution” “Disease” “Computers” human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations
  31. 31. Example inference (II)
  32. 32. Example inference (II) problem model selection species problems rate male forest mathematical constant males ecology number distribution females fish new time sex ecological mathematics number species conservation university size female diversity two values evolution population first value populations natural numbers average population ecosystems work rates sexual populations time data behavior endangered mathematicians density evolutionary tropical chaos measured genetic forests chaotic models reproductive ecosystem
  33. 33. 2600 PerplexityHeld out perplexity 2400 2200 2000 1800 B LEI , N G , AND J ORDAN 1600 1400 0 10 20 30 40 50 60 70 80 90 100 Nematode abstracts Associated Press Number of Topics 3400 7000 Smoothed Unigram Smoothed Unigram Smoothed Mixt. Unigrams Smoothed Mixt. Unigrams 3200 6500 LDA LDA Fold in pLSI Fold in pLSI 3000 6000 2800 5500 Perplexity 2600 Perplexity 5000 2400 4500 2200 4000 2000 1800 3500 1600 3000 1400 2500 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 120 140 160 180 200 Number of Topics Number of Topics 7000 Smoothed Unigram Figure Smoothed Mixt. Unigrams 9: Perplexity results on the nematode (Top) and AP (Bottom) corpora for LDA, the unigram 6500 LDA model, mixture of unigrams, and pLSI. Fold in pLSI 6000 − log p(wd ) perplexity = exp d 5500 Perplexity d Nd 5000 1010 4500 4000 3500 3000
  34. 34. Used to explore and browse document collections
  35. 35. Aside: The Dirichlet distribution • The Dirichlet distribution is an exponential family distribution over the simplex, i.e., positive vectors that sum to one Γ( αi ) p(θ | α) = i θiαi −1 . i Γ(αi ) i • It is conjugate to the multinomial. Given a multinomial observation, the posterior distribution of θ is a Dirichlet. • The parameter α controls the mean shape and sparsity of θ. • The topic proportions are a K dimensional Dirichlet. The topics are a V dimensional Dirichlet.
  36. 36. α=1 1 2 3 4 5 1.0 0.8 0.6 0.4 q q q q q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q q 6 7 8 9 10 1.0 0.8 0.6 value 0.4 q q q q q q q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q q 11 12 13 14 15 1.0 0.8 0.6 q 0.4 q q q q q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q q q q 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  37. 37. α = 10 1 2 3 4 5 1.0 0.8 0.6 0.4 q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 6 7 8 9 10 1.0 0.8 0.6 value 0.4 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 11 12 13 14 15 1.0 0.8 0.6 0.4 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  38. 38. α = 100 1 2 3 4 5 1.0 0.8 0.6 0.4 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 6 7 8 9 10 1.0 0.8 0.6 value 0.4 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 11 12 13 14 15 1.0 0.8 0.6 0.4 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  39. 39. α=1 1 2 3 4 5 1.0 0.8 0.6 0.4 q q q q q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q q 6 7 8 9 10 1.0 0.8 0.6 value 0.4 q q q q q q q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q q 11 12 13 14 15 1.0 0.8 0.6 q 0.4 q q q q q 0.2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 q q q q 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  40. 40. α = 0.1 1 2 3 4 5 1.0 q q 0.8 q q 0.6 q 0.4 q 0.2 q q q q q q q q q q q 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 6 7 8 9 10 1.0 0.8 q q 0.6 value q 0.4 q q q q q q 0.2 q q q q q q q q q q q q q q 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q 11 12 13 14 15 1.0 q q 0.8 q 0.6 q 0.4 q q q q q 0.2 q q q q q q q q q q q q q 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  41. 41. α = 0.01 1 2 3 4 5 1.0 q q q q 0.8 q 0.6 0.4 q 0.2 q q 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 6 7 8 9 10 1.0 q q q q q 0.8 0.6 value 0.4 0.2 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 11 12 13 14 15 1.0 q q q 0.8 q 0.6 q q 0.4 q 0.2 q q 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  42. 42. α = 0.001 1 2 3 4 5 1.0 q q q q q 0.8 0.6 0.4 0.2 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 6 7 8 9 10 1.0 q q q q q 0.8 0.6 value 0.4 0.2 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 11 12 13 14 15 1.0 q q q q q 0.8 0.6 0.4 0.2 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item
  43. 43. Why does LDA “work”?Why does the LDA posterior put “topical” words together? • Word probabilities are maximized by dividing the words among the topics. (More terms means more mass to be spread around.) • In a mixture, this is enough to find clusters of co-occurring words. • In LDA, the Dirichlet on the topic proportions can encourage sparsity, i.e., a document is penalized for using many topics. • Loosely, this can be thought of as softening the strict definition of “co-occurrence” in a mixture model. • This flexibility leads to sets of terms that more tightly co-occur.
  44. 44. Summary of LDA α θd Zd,n Wd,n βk η N D K • LDA can • visualize the hidden thematic structure in large corpora • generalize new data to fit into that structure • Builds on Deerwester et al. (1990) and Hofmann (1999) It is a mixed membership model (Erosheva, 2004). Relates to multinomial PCA (Jakulin and Buntine, 2002) • Was independently invented for genetics (Pritchard et al., 2000)
  45. 45. Implementations of LDAThere are many available implementations of topic modeling— LDA-C∗ A C implementation of LDA HDP∗ A C implementation of the HDP (“infinite LDA”) Online LDA∗ A python package for LDA on massive data LDA in R∗ Package in R for many topic models LingPipe Java toolkit for NLP and computational linguistics Mallet Java toolkit for statistical NLP TMVE∗ A python package to build browsers from topic models∗ available at www.cs.princeton.edu/∼blei/
  46. 46. Example: LDA in R (Jonathan Chang) perspective identifying tumor suppressor genes in human... letters global warming report leslie roberts article global.... research news a small revolution gets under way the 1990s.... a continuing series the reign of trial and error draws to a close... making deep earthquakes in the laboratory lab experimenters... quick fix for freeways thanks to a team of fast working... feathers fly in grouse population dispute researchers... .... 245 1897:1 1467:1 1351:1 731:2 800:5 682:1 315:6 3668:1 14:1 260 4261:2 518:1 271:6 2734:1 2662:1 2432:1 683:2 1631:7 279 2724:1 107:3 518:1 141:3 3208:1 32:1 2444:1 182:1 250:1 266 2552:1 1993:1 116:1 539:1 1630:1 855:1 1422:1 182:3 2432:1 233 1372:1 1351:1 261:1 501:1 1938:1 32:1 14:1 4067:1 98:2 148 4384:1 1339:1 32:1 4107:1 2300:1 229:1 529:1 521:1 2231:1 193 569:1 3617:1 3781:2 14:1 98:1 3596:1 3037:1 1482:12 665:2 .... docs <- read.documents("mult.dat") K <- 20 alpha <- 1/20 eta <- 0.001 model <- lda.collapsed.gibbs.sampler(documents, K, vocab, 1000, alpha, eta)
  47. 47. 1 2 3 4 5 dna protein water says mantle gene cell climate researchers highsequence cells atmospheric new earth genes proteins temperature university pressuresequences receptor global just seismic human fig surface science crust genome binding ocean like temperature genetic activity carbon work earths analysis activation atmosphere first lower two kinase changes years earthquakes 6 7 8 9 10 end time materials dna disease article data surface rna cancer start two high transcription patients science model structure protein human readers fig temperature site gene service system molecules binding medical news number chemical sequence studies card different molecular proteins drug circle results fig specific normal letters rate university sequences drugs 11 12 13 14 15 years species protein cells space million evolution structure cell solar ago population proteins virus observations age evolutionary two hiv earth university university amino infection stars north populations binding immune university early natural acid human mass fig studies residues antigen sun evidence genetic molecular infected astronomers record biology structural viral telescope 16 17 18 19 20 fax cells energy research neuronsmanager cell electron science brainscience gene state national cells aaas genes light scientific activityadvertising expression quantum scientists fig sales development physics new channels member mutant electrons states universityrecruitment mice high university cortexassociate fig laser united neuronalwashington biology magnetic health visual
  48. 48. Open source document browser (with Allison Chaney)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×