What is LDA?
LDA stands for latent dirichlet allocation
It is basically of distribution of words in topic k (let’s say 50) with probability of
topic k occurring in document d (let’s say 5000)
Mechanism - It uses special kind of distribution called Dirichlet Distribution which
is nothing but multi—variate generalization of Beta distribution of probability
LDA in layman terms
Sentence 1: I spend the evening watching football
Sentence 2: I ate nachos and guacamole.
Sentence 3: I spend the evening watching football while eating nachos and guacamole.
LDA might say something like:
Sentence A is 100% about Topic 1
Sentence B is 100% Topic 2
Sentence C is 65% is Topic 1, 35% Topic 2
But also tells that
Topic 1 is about football (50%), evening (50%),
topic 2 is about nachos (50%), guacamole (50)%
LDA is Bayesian Network of Probability
Andrew NgDavid Blei Michael I Jordan
A simple LDA
Packages used in python
sudo pip install nltk
sudo pip install genism
sudo pip intall stop-words
Stop words are commonly occurring words which doesn’t contribute to topic
the, and, or
However, sometimes, removing stop words affect topic modelling
For e.g., Thor The Ragnarok is a single topic but we use stop words mechanism, then it
will be removed.
Porter’s Stemmer algorithm
A common NLP technique to reduce topically similar words to their root. For e.g., “stemming,” “stemmer,”
“stemmed,” all have similar meanings; stemming reduces those terms to “stem.”
Important for topic modeling, which would otherwise view those terms as separate entities and reduce
their importance in the model.
It's a bunch of rules for reducing a word:
sses -> es
ies -> i
ational -> ate
tional -> tion
s -> ∅
when conflicts, the longest rule wins
Bad idea unless you customize it.
Porter’s Stemmer algorithm -Flowchart
Arabic Stemming Process
Simple Stemming Process
It goes one step further than stemming.
It obtains grammatically correct words and distinguishes words by their word
sense with the use of a vocabulary (e.g., type can mean write or category).
It is a much more difficult and expensive process than stemming.
LDA 2 VEC –
what really happens?
LDA2VEC model adds in skipgrams.
A word predicts another word in the same window,
as in word2vec, but also has the notion of a context vector
which only changes at the document level as in LDA.
Lda2Vec – Pytorch code
Go to 20newsgroups/.
Run get_windows.ipynb to prepare data.
Run python train.py for training.
To use this on your data you need to edit get_windows.ipynb. Also there are
hyperparameters in 20newsgroups/train.py, utils/training.py, utils/lda2vec_loss.py.