Nicolas loeff lda

892 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
892
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
22
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Nicolas loeff lda

  1. 1. Latent DirichletAllocationD. Blei, A. Ng, M. JordanIncludes some slides adapted from J.Ramos at Rutgers, M.Steyvers and M. Rozen-Zvi at UCI, L. Fei Fei at UIUC.
  2. 2. Overview What is so special about text? Classification methods LSI Unigram / Mixture of Unigram Probabilistic LSI (Aspect Model) LDA model Geometric interpretation
  3. 3. What is so special about text? No obvious relation between features High dimensionality, (often larger vocabulary, V, than the number of features!) Importance of speed
  4. 4. The need for dimensionalityreduction Representation:  Presenting documents as vectors in the words space - ‘bag of words’ representation  It is a sparse representation, V>>|D| A need to define conceptual closeness
  5. 5. Bag of words China is forecasting a trade surplus of $90bnOf all the sensory impressions proceeding to (£51bn) to $100bn this year, a threefoldthe brain, the visual experiences are the increase on 2004s $32bn. The Commercedominant ones. Our perception of the world Ministry said the surplus would be created byaround us is based essentially on the a predicted 30% jump in exports to $750bn,messages that reach the brain from our eyes. compared with a 18% rise in imports toFor a long time it was thought that the retinal $660bn. The figures are likely to furtherimage was transmitted pointbrain, to visual sensory, by point China, trade, annoy the US, which has long argued thatcenters in the brain; the cerebral cortex was visual, perception,a movie screen, so to speak, upon which the surplus, commerce, Chinas exports are unfairly helped by a retinal, cerebral cortex, deliberately undervalued yuan. Beijing exports, imports, US,image in the eye was projected. Through the agrees the surplus is too high, but says thediscoveries of Hubelcell, optical eye, and Wiesel we now yuan, bank, domestic, yuan is only one factor. Bank of Chinaknow that behind the origin of the visual nerve, imageperception in the brain there is a considerably foreign, increase, governor Zhou Xiaochuan said the country also needed to do more to boost domesticmore complicated course Wiesel By Hubel, of events. trade, value demand so more goods stayed within thefollowing the visual impulses along their path country. China increased the value of theto the various cell layers of the optical cortex, yuan against the dollar by 2.1% in July andHubel and Wiesel have been able to permitted it to trade within a narrow band, butdemonstrate that the message about the the US wants the yuan to be allowed to tradeimage falling on the retina undergoes a step- freely. However, Beijing has made it clearwise analysis in a system of nerve cells that it will take its time and tread carefullystored in columns. In this system each cell before allowing the yuan to rise further inhas its specific function and is responsible for value.a specific detail in the pattern of the retinalimage.
  6. 6. Bag of words Order of words in document can be ignored, only count matters. Probability theory: Exchangability (includes IID) (Aldous, 1985). Exchangable RVs have a representation as mixture distribution (de Finetti, 1990).
  7. 7. What does this have to do withVision? Object Bag of ‘words’
  8. 8. TF-IDF Weighing Scheme (Saltonand McGill, 1983) Given corpus D, word w, document d, calculate wd = fw, d · log (|D|/fw, D)  Many varieties of basic scheme. Search procedure: Scan each d, compute each wi, d, return set D’ that maximizes Σi wi, d
  9. 9. A Spatial Representation: LatentSemantic Analysis (Deerwester, 1990) Document/Term count matrix High dimensional space, not as high as |V| Doc1 Doc2 Doc3 … LOVE 34 0 3 SVD SOUL SOUL 12 0 2 LOVERESEAR 0 19 6 CHSCIENC 0 16 1 RESEARCH E … … … SCIENCE … • Each word is a single point in semantic space (dimensionality reduction) • Similarity measured by cosine of angle between word vectors
  10. 10. Feature Vector representation From: Modeling the Internet and the Web: Probabilistic methods and Algorithms, Pierre Baldi, Paolo Frasconi, Padhraic Smyth
  11. 11. Classification: assigning words to topicsDifferent models for data:Prediction ofCategoricaloutput e.g., SVM Discrete Classifier, modeling the boundaries between different classes of the data Density Estimator: modeling the distribution of the data points Generative themselves Models e.g. NB
  12. 12. Generative Models – Latentsemantic structure Distribution over words Latent Structure P (w ) = ∑ P ( w, )  Inferring latent structure P (w | ) P () Words P ( | w ) = P(w )
  13. 13. Topic Models Unsupervised learning of topics (“gist”) of documents:  articles/chapters  conversations  emails  .... any verbal context Topics are useful latent structures to explain semantic association
  14. 14. Probabilistic Generative Model Each document is a probability distribution over topics Each topic is a probability distribution over words
  15. 15. Generative Process mo ey DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 ne on y m bank1 money1 river2 bank1 money1 bank1 loan1 money1 loan stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 ba nk .8 money1 river2 bank1 money1 bank1 loan1 bank1 money1 nk loan stream2 bank ba loan .3 TOPIC 1 DOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1 loan1 river2 stream2 loan1 bank2 river2 bank2 bank1 stream2 .2 river2 loan1 bank2 stream2 bank2 money1 loan1 river2 stream2 er k bank2 stream2 bank2 money1 river2 stream2 loan1 bank2 river2 riv ban bank2 money1 bank1 stream2 river2 bank2 stream2 bank2 river stream .7 money1 r rive k ban m ea str TOPIC 2 Bayesian approach: use priors Mixture weights ~ Dirichlet( α ) Mixture Mixture Mixture components ~ Dirichlet( β ) components weights
  16. 16. Vision: Topic = Object categories
  17. 17. Simple Model: Unigram Words of document are drawn IID from a single multinomial distribution:
  18. 18. Unigram Mixture Model First choose topic z, then generate words conditionally independent given topic.
  19. 19. Unigram Mixture Model First choose topic z, then generate words conditionally independent given topic.
  20. 20. Unigram Mixture Model First choose topic z, then generate words conditionally independent given topic.
  21. 21. Probabilistic Latent SemanticIndexing (Hoffman, 1999) Document d in training set, and word wn are conditionally independent given topic. Not truly generative (dummy r.v. d). Number of parameters grows with size of corpus (overfitting). Document may contain several topics.
  22. 22. Vision app.: Sivic et al., 2005 d z w N D “face”
  23. 23. LDA
  24. 24. LDA
  25. 25. LDA
  26. 26. LDA
  27. 27. LDA
  28. 28. Vision app.: Fei Fei Li, 2005“beach” c π z w ND
  29. 29. Example: Word density distribution
  30. 30. A geometric interpretation
  31. 31. LDA Topics sampled repeatedly in each Document (like pLSI). But, number of parameters does not grow with size of corpus. Problem: Inference.
  32. 32. LDA - Inference Coupling between Dirchlet distribuions makes inference intractable. Blei, 2001: Variational Approximation
  33. 33. LDA - Inference Other procedures:  Monte Carlo Markov Chin (Griffith et al., 2002)  Expectation Propagation (Minka et al., 2002)
  34. 34. Experiments Perplexity: Inverse of geometric mean per-word likelihood (monotonically decreasing function of likelihood): Idea: Lower Perplexity implies better generalization.
  35. 35. Experiments – Nematode corpus
  36. 36. Experiments – AP corpus
  37. 37. Polysemy PRINTING PLAY TEAM JUDGE HYPOTHESIS STUDY PAPER PLAYS GAME TRIAL EXPERIMENT TEST PRINT STAGE BASKETBALL COURT SCIENTIFIC STUDYING PRINTED AUDIENCE PLAYERS CASE OBSERVATIONS HOMEWORK TYPE THEATER PLAYER JURY SCIENTISTS NEED PROCESS ACTORS PLAY ACCUSED EXPERIMENTS CLASS INK DRAMA PLAYING GUILTY SCIENTIST MATH PRESS SHAKESPEARE SOCCER DEFENDANT EXPERIMENTAL TRY IMAGE ACTOR PLAYED JUSTICE TEST TEACHER PRINTER THEATRE BALL EVIDENCE METHOD WRITE PRINTS PLAYWRIGHT TEAMS WITNESSES HYPOTHESES PLAN PRINTERS PERFORMANCE BASKET CRIME TESTED ARITHMETIC COPY DRAMATIC FOOTBALL LAWYER EVIDENCE ASSIGNMENT COPIES COSTUMES SCORE WITNESS BASED PLACE FORM COMEDY COURT ATTORNEY OBSERVATION STUDIED OFFSET TRAGEDY GAMES HEARING SCIENCE CAREFULLY GRAPHIC CHARACTERS TRY INNOCENT FACTS DECIDE SURFACE SCENES COACH DEFENSE DATA IMPORTANT PRODUCED OPERA GYM CHARGE RESULTS NOTEBOOKCHARACTERS PERFORMED SHOT CRIMINAL EXPLANATION REVIEW
  38. 38. Choosing number of topics Subjective interpretability Bayesian model selection  Griffiths & Steyvers (2004) Generalization test Non-parametric Bayesian statistics  Infinite models; models that grow with size of data  Teh, Jordan, Teal, & Blei (2004)  Blei, Griffiths, Jordan, Tenenbaum (2004)

×