2. What is LDA?
LDA stands for latent dirichlet allocation
It is basically of distribution of words in topic k (let’s say 50) with probability of
topic k occurring in document d (let’s say 5000)
Mechanism - It uses special kind of distribution called Dirichlet Distribution which
is nothing but multi—variate generalization of Beta distribution of probability
density function
3. LDA in layman terms
Sentence 1: I spend the evening watching football
Sentence 2: I ate nachos and guacamole.
Sentence 3: I spend the evening watching football while eating nachos and guacamole.
LDA might say something like:
Sentence A is 100% about Topic 1
Sentence B is 100% Topic 2
Sentence C is 65% is Topic 1, 35% Topic 2
But also tells that
Topic 1 is about football (50%), evening (50%),
topic 2 is about nachos (50%), guacamole (50)%
8. Packages used in python
sudo pip install nltk
sudo pip install genism
sudo pip intall stop-words
9. Stop words
Stop words are commonly occurring words which doesn’t contribute to topic
modelling.
the, and, or
However, sometimes, removing stop words affect topic modelling
For e.g., Thor The Ragnarok is a single topic but we use stop words mechanism, then it
will be removed.
10. Porter’s Stemmer algorithm
A common NLP technique to reduce topically similar words to their root. For e.g., “stemming,” “stemmer,”
“stemmed,” all have similar meanings; stemming reduces those terms to “stem.”
Important for topic modeling, which would otherwise view those terms as separate entities and reduce
their importance in the model.
It's a bunch of rules for reducing a word:
sses -> es
ies -> i
ational -> ate
tional -> tion
s -> ∅
when conflicts, the longest rule wins
Bad idea unless you customize it.
12. Lemmatization
It goes one step further than stemming.
It obtains grammatically correct words and distinguishes words by their word
sense with the use of a vocabulary (e.g., type can mean write or category).
It is a much more difficult and expensive process than stemming.
17. LDA 2 VEC –
what really happens?
https://arxiv.org/pdf/1605.02019.pdf
LDA2VEC model adds in skipgrams.
A word predicts another word in the same window,
as in word2vec, but also has the notion of a context vector
which only changes at the document level as in LDA.
18. Lda2Vec – Pytorch code
Source: https://github.com/TropComplique/lda2vec-pytorch
Go to 20newsgroups/.
Run get_windows.ipynb to prepare data.
Run python train.py for training.
Run explore_trained_model.ipynb.
To use this on your data you need to edit get_windows.ipynb. Also there are
hyperparameters in 20newsgroups/train.py, utils/training.py, utils/lda2vec_loss.py.