Topic Modelling: Tutorial on Usage and Applications

Образец заголовка
Tutorial on Topic Modelling
by Ayush Jain
Prepared as an assignment for CS410: Text Information Systems in Spring

Topic Models
•  Discover hidden themes that
pervade the collec2on
•  Tag the documents on the basis
of these themes
•  Organize, summarize and search
the documents on the basis of
these themes

Образец заголовкаTakeaways from this tutorial
•  What are probabilistic topic models?
•  What kind of things can they do?
•  How do we train/infer a topic model?
•  How do we evaluate a topic model?

Образец заголовкаTools
•  Topic models are a special application of
probability theory. In particular, they touch
– Probabilistic graphical Models
– Conjugate and non-conjugate priors
– Approximate posterior inference
– Exploratory data analysis

The Key Steps in every Topic
Model
Make assump2ons
Collect Data
Infer posterior
Evaluate
Predict

Образец заголовкаOutline
•  Latent Dirichlet Allocation – Application of
key steps
– Graphical Model encoding the assumptions
– Inference Algorithms – Gibbs Sampling
•  Topic Models for more complex tasks
– Rating prediction
•  A completely novel topic model
incorporating sentiments (that we’ll
develop!)

Образец заголовкаLatent Dirichlet Allocation
•  Already covered in course
•  Application of the key steps
– Make assumptions
•  Each topic is a distribution over words
•  Each document is a mixture of topics
•  Each word is drawn from a topic

•  Graphical Model
•  Encodes assump2ons
•  Allows us to break down the joint probability into product of condi2onals

•  Graphical Model

– Make assumptions (II)
•  Choose probability distributions
–  Choosing conjugate distributions makes life easier!
»  Eg: Multinomial and Dirichlet are conjugate
distributions

Образец заголовкаAside: Conjugate Distributions
•  Dirichlet Distribution:
: Probability of seeing different sides of die
•  Multinomial Distribution:
–  The number of occurrences of different sides (W) of the die is
distributed in a multinomial manner
•  Posterior distribution:
θ
p(W |θ) is mul2nomial
xi: The number of 2mes side i was observed

– Make assumptions (II)
•  Choose probability distributions
–  Choosing conjugate distributions makes life easier!
»  Eg: Multinomial and Dirichlet are conjugate
distributions
– Collect Data
•  Corpus on which you want to detect themes

– Infer Posterior
•  Probabilistic graphical models provide algorithms
–  Mean field variational methods
–  Expectation Propagation (similar to EM)
–  Gibbs Sampling (most popular)
–  Variational Inference

Образец заголовкаAside: Gibbs Sampling
– Used when samples need to be drawn from a
joint distribution, but the joint distribution is
difficult to approximate
– Sample X=(x1, …, xn) from joint pdf p(x1, …, xn)
– Conditional distributions are relatively
strighforward
– Procedure:
•  Begin with some initial X(i)
•  Sample xj
(i+1) from
p(xj
(i+1) | x1
(i+1) ,.. , xj-1
(i+1) , xj+1
(i) , .., xn
(i+1) )
•  Repeat

– Infer Posterior (Gibbs Sampling)
•  Here, X is all parameters to be inferred
–  Per-word topic assignment zd,n
–  Per-document topic proportions d
–  Per-corpus topic-word distributions k
•  Extremely high dimensional!
•  Solution:
–  Integrate out and
–  Conjugate distributions make the integration
straightforward!
θ
β
θ β

•  After all computation:
•  nd,:
k, -(d,n): The number of words in document d that
belong to topic k, except for n-th word
•  v: Index of the n-th word in d-th document in the
vocabulary
•  Linear time in the number of tokens!
P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,:
k,−(d,n)
+αk( )
n:,v
k,−(d,n)
+ βv
n:,r
k,−(d,n)
+ βr
r=1
V
∑

•  After all computation:
•  Linear time in the number of tokens!
•  Further improvements that use the sparsity of the
problem when corpus and number of topics is
large
P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,:
k,−(d,n)
+αk( )
n:,v
k,−(d,n)
+ βv
n:,r
k,−(d,n)
+ βr
r=1
V
∑

Образец заголовкаTopic Models: Evaluation
•  Underlying topics are subjective
– Makes the evaluation difficult
– Workaround: Look at application and evaluate
•  Document classification
•  Information Retrieval
•  Rating Prediction

Образец заголовкаTopic Models: Evaluation
•  Use the trained model to predict probabilities
of seeing unseen documents
– Better models would give high probability
•  Even better:
– Predict the probability of second half of
documents using first halves as the corpus
– Does not require documents to be held out

Образец заголовкаBeyond LDA: Rating Prediction
•  Predict ratings associated with text
•  Additional assumption:
•  Rating is conditional on the topic assignment to different
words
•  Graphical Model:

•  Topics
–  Least, problem, unfortunately, supposed, worse, flat, dull
–  Bad, guys, watchable, not, one, movie
–  Both, motion, simple, perfect, fascinating, power
–  Cinematography, screenplay, performances, pictures, effective, sound
•  Notice how the assumption affects the extracted topics
–  Because of the dependence of the overall rating on number of words in
different topics, topics are collections of words that appear in similarly
ranked documents
–  Topics express sentiment but loose their original meaning!

•  Latent Aspect Rating Prediction
–  Joint Topic and Sentiment Modelling
Genera&ve Model
1.  Choose aspects and words for
each aspect

Wdij

Genera&ve Model
1.  Choose aspects and words for
each aspect
2.  Calculate aspect ra2ng based
on aspect words

sdi = βijWdij
j=1
n
∑

Genera&ve Model
1.  Choose aspects and words
for each aspect
2.  Calculate aspect ra2ng
based on aspect words
3.  Overall ra2ng is weighted
sum of aspect ra2ngs

rd ~ N αdi βijWdij
j=1
n
∑ ,δ2
i=1
k
∑
"
#
$$
%
&
''

Genera&ve Model

E-Step: Infer aspect ra2ngs
and aspect weights

M-Step: Update
sd
αd
µ,Σ,β,δ( )

–  Results
•  Detects sentiments without supervision

–  Results
•  Requires keyword supervision – Any way to remove? (Think LDA!)

•  Latent Aspect Rating Prediction without
Aspect Keyword Supervision
–  Aspect Modelling Module from LDA included

Beyond LDA: Topic Phrase
Mining
•  Motivation:
–  machine learning is a phrase and should be assigned
to one topic
•  Assigning machine to “Industry” and learning to “Education”
is incorrect
•  Approach:
–  Extract high frequency phrases
•  If a phrase is infrequent, so is any super-phrase
•  If a document does not contain a frequent phrase of length n,
it also does not contain any of length > n
•  Use hierarchical clustering to find frequent phrases
–  Apply LDA on phrase tokens

Образец заголовкаSentiment Analysis
•  Let’s build our own simple model using the
key steps!
•  Use case:

•  Make Assumptions
–  Each (topic, sentiment) pair has a vocabulary
•  ‘quick delivery’ has more probability for (service, +) than for
(service, -) or (food quality, +)
–  Each (topic, rating) pair has a sentiment distribution
•  + sentiments for food quality are more likely to appear in
highly rated reviews
•  A 4-star rated restaurant is likely to have good food quality
even if it does not provide wireless
–  Each review has
•  Overall rating
•  Topic distribution: Different users might talk about different
aspects in their reviews

•  Graphical Model Genera&ve Process
1.  Choose word distribu2on for
all (topic, sen2ments)

2.  Choose sen2ment distribu2on
for all (topic, ra2ng)

3.  For each review
•  Choose ra2ng

•  Choose ra2ng
•  Choose topic distribu2on

•  Choose ra2ng
•  For each word in review:
•  Choose topic

•  Choose ra2ng
•  Choose topic
•  Choose sen2ment

•  Choose ra2ng
•  Choose topic
•  Choose sen2ment
•  Choose word

•  Inference Parameters to be inferred
1.  Per document topic distribu2on
2.  Ra2ng distribu2on
3.  Sen2ment distribu2on
4.  Word distribu2ons
Use Collapsed Gibbs Sampling!
Integrate out and φ π

•  Evaluation – Yelp
–  Sandwich: sandwich, slaw, primanti, coleslaw, cole, market, pastrami,
reuben, bro, mayo, famous, cheesesteak, rye, zucchini, swiss, sammy,
peppi, burgh, messi
–  Vietnamese: pho, noodl, bowl, soup, broth, sprout, vermicelli, peanut,
lemongrass, leaf
–  Payment options: server, check, custom, card, return, state, credit,
coupon, accept, tip, treat, gift, refill
–  Location: locat, park, street, drive, hill, window, south, car, downtown,
number, corner, distance
–  Ambience: crowd, fun, group, rock, play, loud, music, young, sing, club,
ticket, meet, entertain, dance, band, song

– Rating prediction

– Opinion Summarization
•  For all reviews of this restaurant
–  15% words assigned to topic “Vegetarian”
–  5% to “Breakfast” (Eggs) with sentiment 0.78
–  3% to “Staff Attitude” with sentiment 0.82

Образец заголовкаTopic Modelling: Future Work
•  Missing Links
– Model selection: Which model to pick for
which applications
– Incorporating linguistic structure/NLP:
•  How can our knowledge of language help?
– Bag of words:
•  Most models are based on the unigram bag of
words model
•  Context is lost – words like good or nice are often
associated with certain words within context, eg:
‘good standard of living’, ‘nice view from the hotel’

Образец заголовкаTopic Modelling
Questions?

Образец заголовкаTopic Modelling
Thank You!

Topic Modelling: Tutorial on Usage and Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Topic Modelling: Tutorial on Usage and Applications

Similar to Topic Modelling: Tutorial on Usage and Applications (20)

Recently uploaded

Recently uploaded (20)

Topic Modelling: Tutorial on Usage and Applications