This is a tutorial on topic modelling techniques - that informs the reader about the basic ingredients of all topic models, and allows them to develop a new model in the end.
Topic Modelling: Tutorial on Usage and Applications
1. Образец заголовка
Tutorial on Topic Modelling
by Ayush Jain
Prepared as an assignment for CS410: Text Information Systems in Spring
2. Образец заголовка
Topic Models
• Discover hidden themes that
pervade the collec2on
• Tag the documents on the basis
of these themes
• Organize, summarize and search
the documents on the basis of
these themes
3. Образец заголовкаTakeaways from this tutorial
• What are probabilistic topic models?
• What kind of things can they do?
• How do we train/infer a topic model?
• How do we evaluate a topic model?
4. Образец заголовкаTools
• Topic models are a special application of
probability theory. In particular, they touch
– Probabilistic graphical Models
– Conjugate and non-conjugate priors
– Approximate posterior inference
– Exploratory data analysis
5. Образец заголовка
The Key Steps in every Topic
Model
Make assump2ons
Collect Data
Infer posterior
Evaluate
Predict
6. Образец заголовкаOutline
• Latent Dirichlet Allocation – Application of
key steps
– Graphical Model encoding the assumptions
– Inference Algorithms – Gibbs Sampling
• Topic Models for more complex tasks
– Rating prediction
• A completely novel topic model
incorporating sentiments (that we’ll
develop!)
7. Образец заголовкаLatent Dirichlet Allocation
• Already covered in course
• Application of the key steps
– Make assumptions
• Each topic is a distribution over words
• Each document is a mixture of topics
• Each word is drawn from a topic
8. Образец заголовкаLatent Dirichlet Allocation
• Graphical Model
• Encodes assump2ons
• Allows us to break down the joint probability into product of condi2onals
10. Образец заголовкаLatent Dirichlet Allocation
• Application of the key steps
– Make assumptions (II)
• Choose probability distributions
– Choosing conjugate distributions makes life easier!
» Eg: Multinomial and Dirichlet are conjugate
distributions
11. Образец заголовкаAside: Conjugate Distributions
• Dirichlet Distribution:
: Probability of seeing different sides of die
• Multinomial Distribution:
– The number of occurrences of different sides (W) of the die is
distributed in a multinomial manner
• Posterior distribution:
θ
p(W |θ) is mul2nomial
xi: The number of 2mes side i was observed
12. Образец заголовкаLatent Dirichlet Allocation
• Application of the key steps
– Make assumptions (II)
• Choose probability distributions
– Choosing conjugate distributions makes life easier!
» Eg: Multinomial and Dirichlet are conjugate
distributions
– Collect Data
• Corpus on which you want to detect themes
13. Образец заголовкаLatent Dirichlet Allocation
• Application of the key steps
– Infer Posterior
• Probabilistic graphical models provide algorithms
– Mean field variational methods
– Expectation Propagation (similar to EM)
– Gibbs Sampling (most popular)
– Variational Inference
14. Образец заголовкаAside: Gibbs Sampling
– Used when samples need to be drawn from a
joint distribution, but the joint distribution is
difficult to approximate
– Sample X=(x1, …, xn) from joint pdf p(x1, …, xn)
– Conditional distributions are relatively
strighforward
– Procedure:
• Begin with some initial X(i)
• Sample xj
(i+1) from
p(xj
(i+1) | x1
(i+1) ,.. , xj-1
(i+1) , xj+1
(i) , .., xn
(i+1) )
• Repeat
15. Образец заголовкаLatent Dirichlet Allocation
• Application of the key steps
– Infer Posterior (Gibbs Sampling)
• Here, X is all parameters to be inferred
– Per-word topic assignment zd,n
– Per-document topic proportions d
– Per-corpus topic-word distributions k
• Extremely high dimensional!
• Solution:
– Integrate out and
– Conjugate distributions make the integration
straightforward!
θ
β
θ β
16. Образец заголовкаLatent Dirichlet Allocation
• Application of the key steps
– Infer Posterior (Gibbs Sampling)
• After all computation:
• nd,:
k, -(d,n): The number of words in document d that
belong to topic k, except for n-th word
• v: Index of the n-th word in d-th document in the
vocabulary
• Linear time in the number of tokens!
P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,:
k,−(d,n)
+αk( )
n:,v
k,−(d,n)
+ βv
n:,r
k,−(d,n)
+ βr
r=1
V
∑
17. Образец заголовкаLatent Dirichlet Allocation
• Application of the key steps
– Infer Posterior (Gibbs Sampling)
• After all computation:
• Linear time in the number of tokens!
• Further improvements that use the sparsity of the
problem when corpus and number of topics is
large
P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,:
k,−(d,n)
+αk( )
n:,v
k,−(d,n)
+ βv
n:,r
k,−(d,n)
+ βr
r=1
V
∑
18. Образец заголовкаTopic Models: Evaluation
• Underlying topics are subjective
– Makes the evaluation difficult
– Workaround: Look at application and evaluate
• Document classification
• Information Retrieval
• Rating Prediction
19. Образец заголовкаTopic Models: Evaluation
• Use the trained model to predict probabilities
of seeing unseen documents
– Better models would give high probability
• Even better:
– Predict the probability of second half of
documents using first halves as the corpus
– Does not require documents to be held out
20. Образец заголовкаBeyond LDA: Rating Prediction
• Predict ratings associated with text
• Additional assumption:
• Rating is conditional on the topic assignment to different
words
• Graphical Model:
21. Образец заголовкаBeyond LDA: Rating Prediction
• Topics
– Least, problem, unfortunately, supposed, worse, flat, dull
– Bad, guys, watchable, not, one, movie
– Both, motion, simple, perfect, fascinating, power
– Cinematography, screenplay, performances, pictures, effective, sound
• Notice how the assumption affects the extracted topics
– Because of the dependence of the overall rating on number of words in
different topics, topics are collections of words that appear in similarly
ranked documents
– Topics express sentiment but loose their original meaning!
22. Образец заголовкаBeyond LDA: Rating Prediction
• Latent Aspect Rating Prediction
– Joint Topic and Sentiment Modelling
Genera&ve Model
1. Choose aspects and words for
each aspect
Wdij
23. Образец заголовкаBeyond LDA: Rating Prediction
• Latent Aspect Rating Prediction
– Joint Topic and Sentiment Modelling
Genera&ve Model
1. Choose aspects and words for
each aspect
2. Calculate aspect ra2ng based
on aspect words
sdi = βijWdij
j=1
n
∑
24. Образец заголовкаBeyond LDA: Rating Prediction
• Latent Aspect Rating Prediction
– Joint Topic and Sentiment Modelling
Genera&ve Model
1. Choose aspects and words
for each aspect
2. Calculate aspect ra2ng
based on aspect words
3. Overall ra2ng is weighted
sum of aspect ra2ngs
rd ~ N αdi βijWdij
j=1
n
∑ ,δ2
i=1
k
∑
"
#
$$
%
&
''
27. Образец заголовкаBeyond LDA: Rating Prediction
• Latent Aspect Rating Prediction
– Results
• Requires keyword supervision – Any way to remove? (Think LDA!)
28. Образец заголовкаBeyond LDA: Rating Prediction
• Latent Aspect Rating Prediction without
Aspect Keyword Supervision
– Aspect Modelling Module from LDA included
29. Образец заголовка
Beyond LDA: Topic Phrase
Mining
• Motivation:
– machine learning is a phrase and should be assigned
to one topic
• Assigning machine to “Industry” and learning to “Education”
is incorrect
• Approach:
– Extract high frequency phrases
• If a phrase is infrequent, so is any super-phrase
• If a document does not contain a frequent phrase of length n,
it also does not contain any of length > n
• Use hierarchical clustering to find frequent phrases
– Apply LDA on phrase tokens
31. Образец заголовкаSentiment Analysis
• Make Assumptions
– Each (topic, sentiment) pair has a vocabulary
• ‘quick delivery’ has more probability for (service, +) than for
(service, -) or (food quality, +)
– Each (topic, rating) pair has a sentiment distribution
• + sentiments for food quality are more likely to appear in
highly rated reviews
• A 4-star rated restaurant is likely to have good food quality
even if it does not provide wireless
– Each review has
• Overall rating
• Topic distribution: Different users might talk about different
aspects in their reviews
33. Образец заголовкаSentiment Analysis
• Graphical Model Genera&ve Process
1. Choose word distribu2on for
all (topic, sen2ments)
2. Choose sen2ment distribu2on
for all (topic, ra2ng)
34. Образец заголовкаSentiment Analysis
• Graphical Model Genera&ve Process
1. Choose word distribu2on for
all (topic, sen2ments)
2. Choose sen2ment distribu2on
for all (topic, ra2ng)
3. For each review
• Choose ra2ng
35. Образец заголовкаSentiment Analysis
• Graphical Model Genera&ve Process
1. Choose word distribu2on for
all (topic, sen2ments)
2. Choose sen2ment distribu2on
for all (topic, ra2ng)
3. For each review
• Choose ra2ng
• Choose topic distribu2on
36. Образец заголовкаSentiment Analysis
• Graphical Model Genera&ve Process
1. Choose word distribu2on for
all (topic, sen2ments)
2. Choose sen2ment distribu2on
for all (topic, ra2ng)
3. For each review
• Choose ra2ng
• Choose topic distribu2on
• For each word in review:
• Choose topic
37. Образец заголовкаSentiment Analysis
• Graphical Model Genera&ve Process
1. Choose word distribu2on for
all (topic, sen2ments)
2. Choose sen2ment distribu2on
for all (topic, ra2ng)
3. For each review
• Choose ra2ng
• Choose topic distribu2on
• For each word in review:
• Choose topic
• Choose sen2ment
38. Образец заголовкаSentiment Analysis
• Graphical Model Genera&ve Process
1. Choose word distribu2on for
all (topic, sen2ments)
2. Choose sen2ment distribu2on
for all (topic, ra2ng)
3. For each review
• Choose ra2ng
• Choose topic distribu2on
• For each word in review:
• Choose topic
• Choose sen2ment
• Choose word
39. Образец заголовкаSentiment Analysis
• Inference Parameters to be inferred
1. Per document topic distribu2on
2. Ra2ng distribu2on
3. Sen2ment distribu2on
4. Word distribu2ons
Use Collapsed Gibbs Sampling!
Integrate out and φ π
42. Образец заголовкаSentiment Analysis
• Evaluation – Yelp
– Opinion Summarization
• For all reviews of this restaurant
– 15% words assigned to topic “Vegetarian”
– 5% to “Breakfast” (Eggs) with sentiment 0.78
– 3% to “Staff Attitude” with sentiment 0.82
43. Образец заголовкаTopic Modelling: Future Work
• Missing Links
– Model selection: Which model to pick for
which applications
– Incorporating linguistic structure/NLP:
• How can our knowledge of language help?
– Bag of words:
• Most models are based on the unigram bag of
words model
• Context is lost – words like good or nice are often
associated with certain words within context, eg:
‘good standard of living’, ‘nice view from the hotel’