2. 1 2 3
Problem Statement
How to summarize large
no. of documents
(research paper, news
articles, reviews, tweets
etc.) ?
Is there exist any way
through which documents
can be clustered in a fuzzy
way ?
How can we find
documents which talk
about same subjects and
abstractions ?
3. 1 2 3
Applications
Understanding high level
concepts in a research
article
Extract topics/ abstractions
about reviews
Tagging documents with
appropriate tags
4. Nomenclature
❖ Document - data in the form of text/ text data
❖ Corpus - collection of documents
❖ Bag of words - document representation in the form of multiset
❖ Topics - collection of highly correlation words
❖ Term Frequency - No. of times a word appear in a document
❖ Document Frequency - No. of documents that contain a word
5. Topic Modelling
Topic Modeling is a technique
to extract the hidden topics
from large volumes of text.
https://github.com/chdoig/pytexas2015-topic-modeling
6. 1 2 3
Word Representation
Term Frequency Term Frequency
Inverse Document
Frequency
Word Embeddings
8. Natural Language Processing
Stop Words: Frequently occurring words like (the, in, but etc.)
Lemmatization: Derive root word with the help of vocabulary and morphological analysis.
Ex. Running - Run
Part of Speech Tagging: Tag a word with appropriate Part of Speech
Coreference Resolution: Process of finding relational links among the words within a
sentence
10. Pros and Cons - LSA
Pros
● LSA is easy to train and tune (no
hyperparameters except rank)
● Embeddings usually work fine in
downstream tasks such as clusterization,
classification, regression, similarity-search
Cons:
● Major drawback is that embeddings are not
interpretable
● The probabilistic model of LSA does not
match observed data: LSA assumes that
words and documents form a joint Gaussian
model (ergodic hypothesis), while a Poisson
distribution has been observed
11. Latent Dirichlet Allocation
Generative Process:
● Assume that document consists
of N words
● Do for each word
○ Select a topic using
document topic
distribution
○ Draw a word from the
topic using topic word
distribution
https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158
12. Collapsed Gibbs Sampling - LDA
● Go through each document, and randomly assign each word to one of the K topics.
● This random assignment has already given as both the distribution
● for each document d…
○ For each word w in d…
■ for each topic t,
1. p(topic t | document d)
2. p(word w | topic t)
3. Reassign w a new topic = p(topic t | document d) * p(word w | topic t), we’re assuming that all topic
assignments except for the current word in question are correct, and then updating the assignment
of the current word using our model of how documents are generated.
13. LDA - Posterior using Gibbs Sampling
This method is to integral out random variables except for word topic {z_mn} and draw each z_mn from posterior.
where n_mz is a word count of document m with topic z, n_tz is a count of word t with topic z, n_z is a word count with
topic z and -mn means “except z_mn.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3.Jan (2003): 993-1022.
The posterior of z_mn is the following:
15. Word2Vec
- Representation of word into n dimensional
vector
- Created a huge buzz in the NLP community
- Commendable results, simple algorithm
- Can perform simple mathematical operation to
achieve
King - Man + Women = Queen
19. Evaluation - Keeping Human in loop
Word intrusion : For each trained topic, take first ten
words, substitute one of them with another, randomly
chosen word (intruder!) and see whether a human can
reliably tell which one it was.
Topic intrusion: Subjects are shown the title and a
snippet from a document. Along with the document
they are presented with four topics. Three of those
topics are the highest probability topics assigned to
that document. The remaining intruder topic is chosen
randomly from the other low-probability topics in the
model
20. References
1. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research
3.Jan (2003): 993-1022.
2. Blei, D. M. D., Edu, B. B., Ng, A. Y., Edu, A. S., Jordan, M. I., Edu, J. B. Baldwin, T. (2014). Online learning for latent dirichlet
allocation.
3. Reed, C. (2012). Latent Dirichlet Allocation: Towards a Deeper Understanding. Tutorial, (January)
4. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, J. D. (2007). Distributed Representations of Words and
Phrases and their Compositionality
5. T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space (2013)
6. E Moody, Christopher. (2016). Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec.