Образец заголовка
Tutorial on Topic Modelling
by Ayush Jain
Prepared as an assignment for CS410: Text Information Systems in Spring
Образец заголовка
Topic Models
•  Discover	hidden	themes	that	
pervade	the	collec2on	
•  Tag	the	documents	on	the	basis	
of	these	themes	
•  Organize,	summarize	and	search	
the	documents	on	the	basis	of	
these	themes
Образец заголовкаTakeaways from this tutorial
•  What are probabilistic topic models?
•  What kind of things can they do?
•  How do we train/infer a topic model?
•  How do we evaluate a topic model?
Образец заголовкаTools
•  Topic models are a special application of
probability theory. In particular, they touch
– Probabilistic graphical Models
– Conjugate and non-conjugate priors
– Approximate posterior inference
– Exploratory data analysis
Образец заголовка
The Key Steps in every Topic
Model
Make	assump2ons	
Collect	Data	
Infer	posterior	
Evaluate	
Predict
Образец заголовкаOutline
•  Latent Dirichlet Allocation – Application of
key steps
– Graphical Model encoding the assumptions
– Inference Algorithms – Gibbs Sampling
•  Topic Models for more complex tasks
– Rating prediction
•  A completely novel topic model
incorporating sentiments (that we’ll
develop!)
Образец заголовкаLatent Dirichlet Allocation
•  Already covered in course
•  Application of the key steps
– Make assumptions
•  Each topic is a distribution over words
•  Each document is a mixture of topics
•  Each word is drawn from a topic
Образец заголовкаLatent Dirichlet Allocation
•  Graphical Model
•  Encodes	assump2ons	
•  Allows	us	to	break	down	the	joint	probability	into	product	of	condi2onals
Образец заголовкаLatent Dirichlet Allocation
•  Graphical Model
Образец заголовкаLatent Dirichlet Allocation
•  Application of the key steps
– Make assumptions (II)
•  Choose probability distributions
–  Choosing conjugate distributions makes life easier!
»  Eg: Multinomial and Dirichlet are conjugate
distributions
Образец заголовкаAside: Conjugate Distributions
•  Dirichlet Distribution:
: Probability of seeing different sides of die
•  Multinomial Distribution:
–  The number of occurrences of different sides (W) of the die is
distributed in a multinomial manner
•  Posterior distribution:
θ
p(W |θ) is	mul2nomial	
xi:	The	number	of	2mes	side	i	was	observed
Образец заголовкаLatent Dirichlet Allocation
•  Application of the key steps
– Make assumptions (II)
•  Choose probability distributions
–  Choosing conjugate distributions makes life easier!
»  Eg: Multinomial and Dirichlet are conjugate
distributions
– Collect Data
•  Corpus on which you want to detect themes
Образец заголовкаLatent Dirichlet Allocation
•  Application of the key steps
– Infer Posterior
•  Probabilistic graphical models provide algorithms
–  Mean field variational methods
–  Expectation Propagation (similar to EM)
–  Gibbs Sampling (most popular)
–  Variational Inference
Образец заголовкаAside: Gibbs Sampling
– Used when samples need to be drawn from a
joint distribution, but the joint distribution is
difficult to approximate
– Sample X=(x1, …, xn) from joint pdf p(x1, …, xn)
– Conditional distributions are relatively
strighforward
– Procedure:
•  Begin with some initial X(i)
•  Sample xj
(i+1) from
p(xj
(i+1) | x1
(i+1) ,.. , xj-1
(i+1) , xj+1
(i) , .., xn
(i+1) )
•  Repeat
Образец заголовкаLatent Dirichlet Allocation
•  Application of the key steps
– Infer Posterior (Gibbs Sampling)
•  Here, X is all parameters to be inferred
–  Per-word topic assignment zd,n
–  Per-document topic proportions d
–  Per-corpus topic-word distributions k
•  Extremely high dimensional!
•  Solution:
–  Integrate out and
–  Conjugate distributions make the integration
straightforward!
θ
β
θ β
Образец заголовкаLatent Dirichlet Allocation
•  Application of the key steps
– Infer Posterior (Gibbs Sampling)
•  After all computation:
•  nd,:
k, -(d,n): The number of words in document d that
belong to topic k, except for n-th word
•  v: Index of the n-th word in d-th document in the
vocabulary
•  Linear time in the number of tokens!
P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,:
k,−(d,n)
+αk( )
n:,v
k,−(d,n)
+ βv
n:,r
k,−(d,n)
+ βr
r=1
V
∑
Образец заголовкаLatent Dirichlet Allocation
•  Application of the key steps
– Infer Posterior (Gibbs Sampling)
•  After all computation:
•  Linear time in the number of tokens!
•  Further improvements that use the sparsity of the
problem when corpus and number of topics is
large
P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,:
k,−(d,n)
+αk( )
n:,v
k,−(d,n)
+ βv
n:,r
k,−(d,n)
+ βr
r=1
V
∑
Образец заголовкаTopic Models: Evaluation
•  Underlying topics are subjective
– Makes the evaluation difficult
– Workaround: Look at application and evaluate
•  Document classification
•  Information Retrieval
•  Rating Prediction
Образец заголовкаTopic Models: Evaluation
•  Use the trained model to predict probabilities
of seeing unseen documents
– Better models would give high probability
•  Even better:
– Predict the probability of second half of
documents using first halves as the corpus
– Does not require documents to be held out
Образец заголовкаBeyond LDA: Rating Prediction
•  Predict ratings associated with text
•  Additional assumption:
•  Rating is conditional on the topic assignment to different
words
•  Graphical Model:
Образец заголовкаBeyond LDA: Rating Prediction
•  Topics
–  Least, problem, unfortunately, supposed, worse, flat, dull
–  Bad, guys, watchable, not, one, movie
–  Both, motion, simple, perfect, fascinating, power
–  Cinematography, screenplay, performances, pictures, effective, sound
•  Notice how the assumption affects the extracted topics
–  Because of the dependence of the overall rating on number of words in
different topics, topics are collections of words that appear in similarly
ranked documents
–  Topics express sentiment but loose their original meaning!
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction
–  Joint Topic and Sentiment Modelling
Genera&ve	Model	
1.  Choose	aspects	and	words	for	
each	aspect	
	
Wdij
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction
–  Joint Topic and Sentiment Modelling
Genera&ve	Model	
1.  Choose	aspects	and	words	for	
each	aspect	
2.  Calculate	 aspect	 ra2ng	 based	
on	aspect	words	
	
sdi = βijWdij
j=1
n
∑
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction
–  Joint Topic and Sentiment Modelling
Genera&ve	Model	
1.  Choose	 aspects	 and	 words	
for	each	aspect	
2.  Calculate	 aspect	 ra2ng	
based	on	aspect	words	
3.  Overall	 ra2ng	 is	 weighted	
sum	of	aspect	ra2ngs	
	
rd ~ N αdi βijWdij
j=1
n
∑ ,δ2
i=1
k
∑
"
#
$$
%
&
''
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction
–  Joint Topic and Sentiment Modelling
Genera&ve	Model	
	
E-Step:	Infer	aspect	ra2ngs						
and	aspect	weights		
	
M-Step:	Update		
sd
αd
µ,Σ,β,δ( )
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction
–  Results
•  Detects sentiments without supervision
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction
–  Results
•  Requires keyword supervision – Any way to remove? (Think LDA!)
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction without
Aspect Keyword Supervision
–  Aspect Modelling Module from LDA included
Образец заголовка
Beyond LDA: Topic Phrase
Mining
•  Motivation:
–  machine learning is a phrase and should be assigned
to one topic
•  Assigning machine to “Industry” and learning to “Education”
is incorrect
•  Approach:
–  Extract high frequency phrases
•  If a phrase is infrequent, so is any super-phrase
•  If a document does not contain a frequent phrase of length n,
it also does not contain any of length > n
•  Use hierarchical clustering to find frequent phrases
–  Apply LDA on phrase tokens
Образец заголовкаSentiment Analysis
•  Let’s build our own simple model using the
key steps!
•  Use case:
Образец заголовкаSentiment Analysis
•  Make Assumptions
–  Each (topic, sentiment) pair has a vocabulary
•  ‘quick delivery’ has more probability for (service, +) than for
(service, -) or (food quality, +)
–  Each (topic, rating) pair has a sentiment distribution
•  + sentiments for food quality are more likely to appear in
highly rated reviews
•  A 4-star rated restaurant is likely to have good food quality
even if it does not provide wireless
–  Each review has
•  Overall rating
•  Topic distribution: Different users might talk about different
aspects in their reviews
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)	
2.  Choose	sen2ment	distribu2on	
for	all	(topic,	ra2ng)
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)	
2.  Choose	sen2ment	distribu2on	
for	all	(topic,	ra2ng)	
3.  For	each	review	
•  Choose	ra2ng
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)	
2.  Choose	sen2ment	distribu2on	
for	all	(topic,	ra2ng)	
3.  For	each	review	
•  Choose	ra2ng	
•  Choose	topic	distribu2on
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)	
2.  Choose	sen2ment	distribu2on	
for	all	(topic,	ra2ng)	
3.  For	each	review	
•  Choose	ra2ng	
•  Choose	topic	distribu2on	
•  For	each	word	in	review:	
•  Choose	topic
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)	
2.  Choose	sen2ment	distribu2on	
for	all	(topic,	ra2ng)	
3.  For	each	review	
•  Choose	ra2ng	
•  Choose	topic	distribu2on	
•  For	each	word	in	review:	
•  Choose	topic	
•  Choose	sen2ment
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)	
2.  Choose	sen2ment	distribu2on	
for	all	(topic,	ra2ng)	
3.  For	each	review	
•  Choose	ra2ng	
•  Choose	topic	distribu2on	
•  For	each	word	in	review:	
•  Choose	topic	
•  Choose	sen2ment	
•  Choose	word
Образец заголовкаSentiment Analysis
•  Inference Parameters	to	be	inferred	
1.  Per	document	topic	distribu2on	
2.  Ra2ng	distribu2on	
3.  Sen2ment	distribu2on	
4.  Word	distribu2ons	
Use	Collapsed	Gibbs	Sampling!	
Integrate	out						and		φ π
Образец заголовкаSentiment Analysis
•  Evaluation – Yelp
–  Sandwich: sandwich, slaw, primanti, coleslaw, cole, market, pastrami,
reuben, bro, mayo, famous, cheesesteak, rye, zucchini, swiss, sammy,
peppi, burgh, messi
–  Vietnamese: pho, noodl, bowl, soup, broth, sprout, vermicelli, peanut,
lemongrass, leaf
–  Payment options: server, check, custom, card, return, state, credit,
coupon, accept, tip, treat, gift, refill
–  Location: locat, park, street, drive, hill, window, south, car, downtown,
number, corner, distance
–  Ambience: crowd, fun, group, rock, play, loud, music, young, sing, club,
ticket, meet, entertain, dance, band, song
Образец заголовкаSentiment Analysis
•  Evaluation – Yelp
– Rating prediction
Образец заголовкаSentiment Analysis
•  Evaluation – Yelp
– Opinion Summarization
•  For all reviews of this restaurant
–  15% words assigned to topic “Vegetarian”
–  5% to “Breakfast” (Eggs) with sentiment 0.78
–  3% to “Staff Attitude” with sentiment 0.82
Образец заголовкаTopic Modelling: Future Work
•  Missing Links
– Model selection: Which model to pick for
which applications
– Incorporating linguistic structure/NLP:
•  How can our knowledge of language help?
– Bag of words:
•  Most models are based on the unigram bag of
words model
•  Context is lost – words like good or nice are often
associated with certain words within context, eg:
‘good standard of living’, ‘nice view from the hotel’
Образец заголовкаTopic Modelling
Questions?
Образец заголовкаTopic Modelling
Thank You!

Topic Modelling: Tutorial on Usage and Applications

  • 1.
    Образец заголовка Tutorial onTopic Modelling by Ayush Jain Prepared as an assignment for CS410: Text Information Systems in Spring
  • 2.
    Образец заголовка Topic Models • Discover hidden themes that pervade the collec2on •  Tag the documents on the basis of these themes •  Organize, summarize and search the documents on the basis of these themes
  • 3.
    Образец заголовкаTakeaways fromthis tutorial •  What are probabilistic topic models? •  What kind of things can they do? •  How do we train/infer a topic model? •  How do we evaluate a topic model?
  • 4.
    Образец заголовкаTools •  Topicmodels are a special application of probability theory. In particular, they touch – Probabilistic graphical Models – Conjugate and non-conjugate priors – Approximate posterior inference – Exploratory data analysis
  • 5.
    Образец заголовка The KeySteps in every Topic Model Make assump2ons Collect Data Infer posterior Evaluate Predict
  • 6.
    Образец заголовкаOutline •  LatentDirichlet Allocation – Application of key steps – Graphical Model encoding the assumptions – Inference Algorithms – Gibbs Sampling •  Topic Models for more complex tasks – Rating prediction •  A completely novel topic model incorporating sentiments (that we’ll develop!)
  • 7.
    Образец заголовкаLatent DirichletAllocation •  Already covered in course •  Application of the key steps – Make assumptions •  Each topic is a distribution over words •  Each document is a mixture of topics •  Each word is drawn from a topic
  • 8.
    Образец заголовкаLatent DirichletAllocation •  Graphical Model •  Encodes assump2ons •  Allows us to break down the joint probability into product of condi2onals
  • 9.
    Образец заголовкаLatent DirichletAllocation •  Graphical Model
  • 10.
    Образец заголовкаLatent DirichletAllocation •  Application of the key steps – Make assumptions (II) •  Choose probability distributions –  Choosing conjugate distributions makes life easier! »  Eg: Multinomial and Dirichlet are conjugate distributions
  • 11.
    Образец заголовкаAside: ConjugateDistributions •  Dirichlet Distribution: : Probability of seeing different sides of die •  Multinomial Distribution: –  The number of occurrences of different sides (W) of the die is distributed in a multinomial manner •  Posterior distribution: θ p(W |θ) is mul2nomial xi: The number of 2mes side i was observed
  • 12.
    Образец заголовкаLatent DirichletAllocation •  Application of the key steps – Make assumptions (II) •  Choose probability distributions –  Choosing conjugate distributions makes life easier! »  Eg: Multinomial and Dirichlet are conjugate distributions – Collect Data •  Corpus on which you want to detect themes
  • 13.
    Образец заголовкаLatent DirichletAllocation •  Application of the key steps – Infer Posterior •  Probabilistic graphical models provide algorithms –  Mean field variational methods –  Expectation Propagation (similar to EM) –  Gibbs Sampling (most popular) –  Variational Inference
  • 14.
    Образец заголовкаAside: GibbsSampling – Used when samples need to be drawn from a joint distribution, but the joint distribution is difficult to approximate – Sample X=(x1, …, xn) from joint pdf p(x1, …, xn) – Conditional distributions are relatively strighforward – Procedure: •  Begin with some initial X(i) •  Sample xj (i+1) from p(xj (i+1) | x1 (i+1) ,.. , xj-1 (i+1) , xj+1 (i) , .., xn (i+1) ) •  Repeat
  • 15.
    Образец заголовкаLatent DirichletAllocation •  Application of the key steps – Infer Posterior (Gibbs Sampling) •  Here, X is all parameters to be inferred –  Per-word topic assignment zd,n –  Per-document topic proportions d –  Per-corpus topic-word distributions k •  Extremely high dimensional! •  Solution: –  Integrate out and –  Conjugate distributions make the integration straightforward! θ β θ β
  • 16.
    Образец заголовкаLatent DirichletAllocation •  Application of the key steps – Infer Posterior (Gibbs Sampling) •  After all computation: •  nd,: k, -(d,n): The number of words in document d that belong to topic k, except for n-th word •  v: Index of the n-th word in d-th document in the vocabulary •  Linear time in the number of tokens! P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,: k,−(d,n) +αk( ) n:,v k,−(d,n) + βv n:,r k,−(d,n) + βr r=1 V ∑
  • 17.
    Образец заголовкаLatent DirichletAllocation •  Application of the key steps – Infer Posterior (Gibbs Sampling) •  After all computation: •  Linear time in the number of tokens! •  Further improvements that use the sparsity of the problem when corpus and number of topics is large P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,: k,−(d,n) +αk( ) n:,v k,−(d,n) + βv n:,r k,−(d,n) + βr r=1 V ∑
  • 18.
    Образец заголовкаTopic Models:Evaluation •  Underlying topics are subjective – Makes the evaluation difficult – Workaround: Look at application and evaluate •  Document classification •  Information Retrieval •  Rating Prediction
  • 19.
    Образец заголовкаTopic Models:Evaluation •  Use the trained model to predict probabilities of seeing unseen documents – Better models would give high probability •  Even better: – Predict the probability of second half of documents using first halves as the corpus – Does not require documents to be held out
  • 20.
    Образец заголовкаBeyond LDA:Rating Prediction •  Predict ratings associated with text •  Additional assumption: •  Rating is conditional on the topic assignment to different words •  Graphical Model:
  • 21.
    Образец заголовкаBeyond LDA:Rating Prediction •  Topics –  Least, problem, unfortunately, supposed, worse, flat, dull –  Bad, guys, watchable, not, one, movie –  Both, motion, simple, perfect, fascinating, power –  Cinematography, screenplay, performances, pictures, effective, sound •  Notice how the assumption affects the extracted topics –  Because of the dependence of the overall rating on number of words in different topics, topics are collections of words that appear in similarly ranked documents –  Topics express sentiment but loose their original meaning!
  • 22.
    Образец заголовкаBeyond LDA:Rating Prediction •  Latent Aspect Rating Prediction –  Joint Topic and Sentiment Modelling Genera&ve Model 1.  Choose aspects and words for each aspect Wdij
  • 23.
    Образец заголовкаBeyond LDA:Rating Prediction •  Latent Aspect Rating Prediction –  Joint Topic and Sentiment Modelling Genera&ve Model 1.  Choose aspects and words for each aspect 2.  Calculate aspect ra2ng based on aspect words sdi = βijWdij j=1 n ∑
  • 24.
    Образец заголовкаBeyond LDA:Rating Prediction •  Latent Aspect Rating Prediction –  Joint Topic and Sentiment Modelling Genera&ve Model 1.  Choose aspects and words for each aspect 2.  Calculate aspect ra2ng based on aspect words 3.  Overall ra2ng is weighted sum of aspect ra2ngs rd ~ N αdi βijWdij j=1 n ∑ ,δ2 i=1 k ∑ " # $$ % & ''
  • 25.
    Образец заголовкаBeyond LDA:Rating Prediction •  Latent Aspect Rating Prediction –  Joint Topic and Sentiment Modelling Genera&ve Model E-Step: Infer aspect ra2ngs and aspect weights M-Step: Update sd αd µ,Σ,β,δ( )
  • 26.
    Образец заголовкаBeyond LDA:Rating Prediction •  Latent Aspect Rating Prediction –  Results •  Detects sentiments without supervision
  • 27.
    Образец заголовкаBeyond LDA:Rating Prediction •  Latent Aspect Rating Prediction –  Results •  Requires keyword supervision – Any way to remove? (Think LDA!)
  • 28.
    Образец заголовкаBeyond LDA:Rating Prediction •  Latent Aspect Rating Prediction without Aspect Keyword Supervision –  Aspect Modelling Module from LDA included
  • 29.
    Образец заголовка Beyond LDA:Topic Phrase Mining •  Motivation: –  machine learning is a phrase and should be assigned to one topic •  Assigning machine to “Industry” and learning to “Education” is incorrect •  Approach: –  Extract high frequency phrases •  If a phrase is infrequent, so is any super-phrase •  If a document does not contain a frequent phrase of length n, it also does not contain any of length > n •  Use hierarchical clustering to find frequent phrases –  Apply LDA on phrase tokens
  • 30.
    Образец заголовкаSentiment Analysis • Let’s build our own simple model using the key steps! •  Use case:
  • 31.
    Образец заголовкаSentiment Analysis • Make Assumptions –  Each (topic, sentiment) pair has a vocabulary •  ‘quick delivery’ has more probability for (service, +) than for (service, -) or (food quality, +) –  Each (topic, rating) pair has a sentiment distribution •  + sentiments for food quality are more likely to appear in highly rated reviews •  A 4-star rated restaurant is likely to have good food quality even if it does not provide wireless –  Each review has •  Overall rating •  Topic distribution: Different users might talk about different aspects in their reviews
  • 32.
    Образец заголовкаSentiment Analysis • Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments)
  • 33.
    Образец заголовкаSentiment Analysis • Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments) 2.  Choose sen2ment distribu2on for all (topic, ra2ng)
  • 34.
    Образец заголовкаSentiment Analysis • Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments) 2.  Choose sen2ment distribu2on for all (topic, ra2ng) 3.  For each review •  Choose ra2ng
  • 35.
    Образец заголовкаSentiment Analysis • Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments) 2.  Choose sen2ment distribu2on for all (topic, ra2ng) 3.  For each review •  Choose ra2ng •  Choose topic distribu2on
  • 36.
    Образец заголовкаSentiment Analysis • Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments) 2.  Choose sen2ment distribu2on for all (topic, ra2ng) 3.  For each review •  Choose ra2ng •  Choose topic distribu2on •  For each word in review: •  Choose topic
  • 37.
    Образец заголовкаSentiment Analysis • Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments) 2.  Choose sen2ment distribu2on for all (topic, ra2ng) 3.  For each review •  Choose ra2ng •  Choose topic distribu2on •  For each word in review: •  Choose topic •  Choose sen2ment
  • 38.
    Образец заголовкаSentiment Analysis • Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments) 2.  Choose sen2ment distribu2on for all (topic, ra2ng) 3.  For each review •  Choose ra2ng •  Choose topic distribu2on •  For each word in review: •  Choose topic •  Choose sen2ment •  Choose word
  • 39.
    Образец заголовкаSentiment Analysis • Inference Parameters to be inferred 1.  Per document topic distribu2on 2.  Ra2ng distribu2on 3.  Sen2ment distribu2on 4.  Word distribu2ons Use Collapsed Gibbs Sampling! Integrate out and φ π
  • 40.
    Образец заголовкаSentiment Analysis • Evaluation – Yelp –  Sandwich: sandwich, slaw, primanti, coleslaw, cole, market, pastrami, reuben, bro, mayo, famous, cheesesteak, rye, zucchini, swiss, sammy, peppi, burgh, messi –  Vietnamese: pho, noodl, bowl, soup, broth, sprout, vermicelli, peanut, lemongrass, leaf –  Payment options: server, check, custom, card, return, state, credit, coupon, accept, tip, treat, gift, refill –  Location: locat, park, street, drive, hill, window, south, car, downtown, number, corner, distance –  Ambience: crowd, fun, group, rock, play, loud, music, young, sing, club, ticket, meet, entertain, dance, band, song
  • 41.
    Образец заголовкаSentiment Analysis • Evaluation – Yelp – Rating prediction
  • 42.
    Образец заголовкаSentiment Analysis • Evaluation – Yelp – Opinion Summarization •  For all reviews of this restaurant –  15% words assigned to topic “Vegetarian” –  5% to “Breakfast” (Eggs) with sentiment 0.78 –  3% to “Staff Attitude” with sentiment 0.82
  • 43.
    Образец заголовкаTopic Modelling:Future Work •  Missing Links – Model selection: Which model to pick for which applications – Incorporating linguistic structure/NLP: •  How can our knowledge of language help? – Bag of words: •  Most models are based on the unigram bag of words model •  Context is lost – words like good or nice are often associated with certain words within context, eg: ‘good standard of living’, ‘nice view from the hotel’
  • 44.
  • 45.