Lda and it's applications



LDA is usually used for topic modelling.

  1. 1. LDA and it’s applications AI HACKERS
  2. 2. What is LDA?  LDA stands for latent dirichlet allocation  It is basically of distribution of words in topic k (let’s say 50) with probability of topic k occurring in document d (let’s say 5000)  Mechanism - It uses special kind of distribution called Dirichlet Distribution which is nothing but multi—variate generalization of Beta distribution of probability density function
  3. 3. LDA in layman terms Sentence 1: I spend the evening watching football Sentence 2: I ate nachos and guacamole. Sentence 3: I spend the evening watching football while eating nachos and guacamole. LDA might say something like: Sentence A is 100% about Topic 1 Sentence B is 100% Topic 2 Sentence C is 65% is Topic 1, 35% Topic 2 But also tells that Topic 1 is about football (50%), evening (50%), topic 2 is about nachos (50%), guacamole (50)%
  4. 4. Bayesian Network Example
  5. 5. LDA is Bayesian Network of Probability Density function
  6. 6. LDA history Andrew NgDavid Blei Michael I Jordan
  7. 7. A simple LDA
  8. 8. Packages used in python  sudo pip install nltk  sudo pip install genism  sudo pip intall stop-words
  9. 9. Stop words  Stop words are commonly occurring words which doesn’t contribute to topic modelling.  the, and, or  However, sometimes, removing stop words affect topic modelling  For e.g., Thor The Ragnarok is a single topic but we use stop words mechanism, then it will be removed.
  10. 10. Porter’s Stemmer algorithm  A common NLP technique to reduce topically similar words to their root. For e.g., “stemming,” “stemmer,” “stemmed,” all have similar meanings; stemming reduces those terms to “stem.”  Important for topic modeling, which would otherwise view those terms as separate entities and reduce their importance in the model.  It's a bunch of rules for reducing a word:  sses -> es  ies -> i  ational -> ate  tional -> tion  s -> ∅  when conflicts, the longest rule wins  Bad idea unless you customize it.
  11. 11. Porter’s Stemmer algorithm -Flowchart Arabic Stemming Process Simple Stemming Process
  12. 12. Lemmatization  It goes one step further than stemming.  It obtains grammatically correct words and distinguishes words by their word sense with the use of a vocabulary (e.g., type can mean write or category).  It is a much more difficult and expensive process than stemming.
  13. 13. Lemmatization - Example
  14. 14. Bag of Words
  15. 15. Word2Vec
  16. 16. CBOW v/s SKIP-GRAM
  17. 17. LDA 2 VEC – what really happens? LDA2VEC model adds in skipgrams. A word predicts another word in the same window, as in word2vec, but also has the notion of a context vector which only changes at the document level as in LDA.
  18. 18. Lda2Vec – Pytorch code  Source:  Go to 20newsgroups/.  Run get_windows.ipynb to prepare data.  Run python for training.  Run explore_trained_model.ipynb.  To use this on your data you need to edit get_windows.ipynb. Also there are hyperparameters in 20newsgroups/, utils/, utils/
  19. 19. Thank ou