SlideShare a Scribd company logo
1 of 45
Download to read offline
Образец заголовка
Tutorial on Topic Modelling
by Ayush Jain
Prepared as an assignment for CS410: Text Information Systems in Spring
Образец заголовка
Topic Models
•  Discover	hidden	themes	that	
pervade	the	collec2on	
•  Tag	the	documents	on	the	basis	
of	these	themes	
•  Organize,	summarize	and	search	
the	documents	on	the	basis	of	
these	themes
Образец заголовкаTakeaways from this tutorial
•  What are probabilistic topic models?
•  What kind of things can they do?
•  How do we train/infer a topic model?
•  How do we evaluate a topic model?
Образец заголовкаTools
•  Topic models are a special application of
probability theory. In particular, they touch
– Probabilistic graphical Models
– Conjugate and non-conjugate priors
– Approximate posterior inference
– Exploratory data analysis
Образец заголовка
The Key Steps in every Topic
Make	assump2ons	
Collect	Data	
Infer	posterior	
Образец заголовкаOutline
•  Latent Dirichlet Allocation – Application of
key steps
– Graphical Model encoding the assumptions
– Inference Algorithms – Gibbs Sampling
•  Topic Models for more complex tasks
– Rating prediction
•  A completely novel topic model
incorporating sentiments (that we’ll
Образец заголовкаLatent Dirichlet Allocation
•  Already covered in course
•  Application of the key steps
– Make assumptions
•  Each topic is a distribution over words
•  Each document is a mixture of topics
•  Each word is drawn from a topic
Образец заголовкаLatent Dirichlet Allocation
•  Graphical Model
•  Encodes	assump2ons	
•  Allows	us	to	break	down	the	joint	probability	into	product	of	condi2onals
Образец заголовкаLatent Dirichlet Allocation
•  Graphical Model
Образец заголовкаLatent Dirichlet Allocation
•  Application of the key steps
– Make assumptions (II)
•  Choose probability distributions
–  Choosing conjugate distributions makes life easier!
»  Eg: Multinomial and Dirichlet are conjugate
Образец заголовкаAside: Conjugate Distributions
•  Dirichlet Distribution:
: Probability of seeing different sides of die
•  Multinomial Distribution:
–  The number of occurrences of different sides (W) of the die is
distributed in a multinomial manner
•  Posterior distribution:
p(W |θ) is	mul2nomial	
xi:	The	number	of	2mes	side	i	was	observed
Образец заголовкаLatent Dirichlet Allocation
•  Application of the key steps
– Make assumptions (II)
•  Choose probability distributions
–  Choosing conjugate distributions makes life easier!
»  Eg: Multinomial and Dirichlet are conjugate
– Collect Data
•  Corpus on which you want to detect themes
Образец заголовкаLatent Dirichlet Allocation
•  Application of the key steps
– Infer Posterior
•  Probabilistic graphical models provide algorithms
–  Mean field variational methods
–  Expectation Propagation (similar to EM)
–  Gibbs Sampling (most popular)
–  Variational Inference
Образец заголовкаAside: Gibbs Sampling
– Used when samples need to be drawn from a
joint distribution, but the joint distribution is
difficult to approximate
– Sample X=(x1, …, xn) from joint pdf p(x1, …, xn)
– Conditional distributions are relatively
– Procedure:
•  Begin with some initial X(i)
•  Sample xj
(i+1) from
(i+1) | x1
(i+1) ,.. , xj-1
(i+1) , xj+1
(i) , .., xn
(i+1) )
•  Repeat
Образец заголовкаLatent Dirichlet Allocation
•  Application of the key steps
– Infer Posterior (Gibbs Sampling)
•  Here, X is all parameters to be inferred
–  Per-word topic assignment zd,n
–  Per-document topic proportions d
–  Per-corpus topic-word distributions k
•  Extremely high dimensional!
•  Solution:
–  Integrate out and
–  Conjugate distributions make the integration
θ β
Образец заголовкаLatent Dirichlet Allocation
•  Application of the key steps
– Infer Posterior (Gibbs Sampling)
•  After all computation:
•  nd,:
k, -(d,n): The number of words in document d that
belong to topic k, except for n-th word
•  v: Index of the n-th word in d-th document in the
•  Linear time in the number of tokens!
P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,:
+αk( )
+ βv
+ βr
Образец заголовкаLatent Dirichlet Allocation
•  Application of the key steps
– Infer Posterior (Gibbs Sampling)
•  After all computation:
•  Linear time in the number of tokens!
•  Further improvements that use the sparsity of the
problem when corpus and number of topics is
P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,:
+αk( )
+ βv
+ βr
Образец заголовкаTopic Models: Evaluation
•  Underlying topics are subjective
– Makes the evaluation difficult
– Workaround: Look at application and evaluate
•  Document classification
•  Information Retrieval
•  Rating Prediction
Образец заголовкаTopic Models: Evaluation
•  Use the trained model to predict probabilities
of seeing unseen documents
– Better models would give high probability
•  Even better:
– Predict the probability of second half of
documents using first halves as the corpus
– Does not require documents to be held out
Образец заголовкаBeyond LDA: Rating Prediction
•  Predict ratings associated with text
•  Additional assumption:
•  Rating is conditional on the topic assignment to different
•  Graphical Model:
Образец заголовкаBeyond LDA: Rating Prediction
•  Topics
–  Least, problem, unfortunately, supposed, worse, flat, dull
–  Bad, guys, watchable, not, one, movie
–  Both, motion, simple, perfect, fascinating, power
–  Cinematography, screenplay, performances, pictures, effective, sound
•  Notice how the assumption affects the extracted topics
–  Because of the dependence of the overall rating on number of words in
different topics, topics are collections of words that appear in similarly
ranked documents
–  Topics express sentiment but loose their original meaning!
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction
–  Joint Topic and Sentiment Modelling
Genera&ve	Model	
1.  Choose	aspects	and	words	for	
each	aspect	
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction
–  Joint Topic and Sentiment Modelling
Genera&ve	Model	
1.  Choose	aspects	and	words	for	
each	aspect	
2.  Calculate	 aspect	 ra2ng	 based	
on	aspect	words	
sdi = βijWdij
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction
–  Joint Topic and Sentiment Modelling
Genera&ve	Model	
1.  Choose	 aspects	 and	 words	
for	each	aspect	
2.  Calculate	 aspect	 ra2ng	
based	on	aspect	words	
3.  Overall	 ra2ng	 is	 weighted	
sum	of	aspect	ra2ngs	
rd ~ N αdi βijWdij
∑ ,δ2
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction
–  Joint Topic and Sentiment Modelling
Genera&ve	Model	
E-Step:	Infer	aspect	ra2ngs						
and	aspect	weights		
M-Step:	Update		
µ,Σ,β,δ( )
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction
–  Results
•  Detects sentiments without supervision
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction
–  Results
•  Requires keyword supervision – Any way to remove? (Think LDA!)
Образец заголовкаBeyond LDA: Rating Prediction
•  Latent Aspect Rating Prediction without
Aspect Keyword Supervision
–  Aspect Modelling Module from LDA included
Образец заголовка
Beyond LDA: Topic Phrase
•  Motivation:
–  machine learning is a phrase and should be assigned
to one topic
•  Assigning machine to “Industry” and learning to “Education”
is incorrect
•  Approach:
–  Extract high frequency phrases
•  If a phrase is infrequent, so is any super-phrase
•  If a document does not contain a frequent phrase of length n,
it also does not contain any of length > n
•  Use hierarchical clustering to find frequent phrases
–  Apply LDA on phrase tokens
Образец заголовкаSentiment Analysis
•  Let’s build our own simple model using the
key steps!
•  Use case:
Образец заголовкаSentiment Analysis
•  Make Assumptions
–  Each (topic, sentiment) pair has a vocabulary
•  ‘quick delivery’ has more probability for (service, +) than for
(service, -) or (food quality, +)
–  Each (topic, rating) pair has a sentiment distribution
•  + sentiments for food quality are more likely to appear in
highly rated reviews
•  A 4-star rated restaurant is likely to have good food quality
even if it does not provide wireless
–  Each review has
•  Overall rating
•  Topic distribution: Different users might talk about different
aspects in their reviews
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)	
2.  Choose	sen2ment	distribu2on	
for	all	(topic,	ra2ng)
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)	
2.  Choose	sen2ment	distribu2on	
for	all	(topic,	ra2ng)	
3.  For	each	review	
•  Choose	ra2ng
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)	
2.  Choose	sen2ment	distribu2on	
for	all	(topic,	ra2ng)	
3.  For	each	review	
•  Choose	ra2ng	
•  Choose	topic	distribu2on
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)	
2.  Choose	sen2ment	distribu2on	
for	all	(topic,	ra2ng)	
3.  For	each	review	
•  Choose	ra2ng	
•  Choose	topic	distribu2on	
•  For	each	word	in	review:	
•  Choose	topic
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)	
2.  Choose	sen2ment	distribu2on	
for	all	(topic,	ra2ng)	
3.  For	each	review	
•  Choose	ra2ng	
•  Choose	topic	distribu2on	
•  For	each	word	in	review:	
•  Choose	topic	
•  Choose	sen2ment
Образец заголовкаSentiment Analysis
•  Graphical Model Genera&ve	Process	
1.  Choose	word	distribu2on	for	
all	(topic,	sen2ments)	
2.  Choose	sen2ment	distribu2on	
for	all	(topic,	ra2ng)	
3.  For	each	review	
•  Choose	ra2ng	
•  Choose	topic	distribu2on	
•  For	each	word	in	review:	
•  Choose	topic	
•  Choose	sen2ment	
•  Choose	word
Образец заголовкаSentiment Analysis
•  Inference Parameters	to	be	inferred	
1.  Per	document	topic	distribu2on	
2.  Ra2ng	distribu2on	
3.  Sen2ment	distribu2on	
4.  Word	distribu2ons	
Use	Collapsed	Gibbs	Sampling!	
Integrate	out						and		φ π
Образец заголовкаSentiment Analysis
•  Evaluation – Yelp
–  Sandwich: sandwich, slaw, primanti, coleslaw, cole, market, pastrami,
reuben, bro, mayo, famous, cheesesteak, rye, zucchini, swiss, sammy,
peppi, burgh, messi
–  Vietnamese: pho, noodl, bowl, soup, broth, sprout, vermicelli, peanut,
lemongrass, leaf
–  Payment options: server, check, custom, card, return, state, credit,
coupon, accept, tip, treat, gift, refill
–  Location: locat, park, street, drive, hill, window, south, car, downtown,
number, corner, distance
–  Ambience: crowd, fun, group, rock, play, loud, music, young, sing, club,
ticket, meet, entertain, dance, band, song
Образец заголовкаSentiment Analysis
•  Evaluation – Yelp
– Rating prediction
Образец заголовкаSentiment Analysis
•  Evaluation – Yelp
– Opinion Summarization
•  For all reviews of this restaurant
–  15% words assigned to topic “Vegetarian”
–  5% to “Breakfast” (Eggs) with sentiment 0.78
–  3% to “Staff Attitude” with sentiment 0.82
Образец заголовкаTopic Modelling: Future Work
•  Missing Links
– Model selection: Which model to pick for
which applications
– Incorporating linguistic structure/NLP:
•  How can our knowledge of language help?
– Bag of words:
•  Most models are based on the unigram bag of
words model
•  Context is lost – words like good or nice are often
associated with certain words within context, eg:
‘good standard of living’, ‘nice view from the hotel’
Образец заголовкаTopic Modelling
Образец заголовкаTopic Modelling
Thank You!

More Related Content

What's hot

Tag based recommender system
Tag based recommender systemTag based recommender system
Tag based recommender systemKaren Li
Distributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningLi Miao
Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social networkChanon Hongsirikulkit
Tag And Tag Based Recommender
Tag And Tag Based RecommenderTag And Tag Based Recommender
Tag And Tag Based Recommendergu wendong
Email Classification - Why Should it Matter to You?
Email Classification - Why Should it Matter to You?Email Classification - Why Should it Matter to You?
Email Classification - Why Should it Matter to You?Sherpa Software
Survey of Recommendation Systems
Survey of Recommendation SystemsSurvey of Recommendation Systems
Survey of Recommendation Systemsyoualab
Recommender systems using collaborative filtering
Recommender systems using collaborative filteringRecommender systems using collaborative filtering
Recommender systems using collaborative filteringD Yogendra Rao
Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011Ernesto Mislej
Recommender systems
Recommender systemsRecommender systems
Recommender systemsTamer Rezk
Summary of a Recommender Systems Survey paper
Summary of a Recommender Systems Survey paperSummary of a Recommender Systems Survey paper
Summary of a Recommender Systems Survey paperChangsung Moon
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Arjen de Vries
Replicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender SystemsReplicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender SystemsAlejandro Bellogin
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationArjen de Vries
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain
Overview of recommender system
Overview of recommender systemOverview of recommender system
Overview of recommender systemStanley Wang
Recommendation engines
Recommendation enginesRecommendation engines
Recommendation enginesGeorgian Micsa
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRoelof Pieters
[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systems[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systemsFalitokiniaina Rabearison
Recommender system a-introduction
Recommender system a-introductionRecommender system a-introduction
Recommender system a-introductionzh3f

What's hot (20)

Tag based recommender system
Tag based recommender systemTag based recommender system
Tag based recommender system
Final deck
Final deckFinal deck
Final deck
Distributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text Mining
Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social network
Tag And Tag Based Recommender
Tag And Tag Based RecommenderTag And Tag Based Recommender
Tag And Tag Based Recommender
Email Classification - Why Should it Matter to You?
Email Classification - Why Should it Matter to You?Email Classification - Why Should it Matter to You?
Email Classification - Why Should it Matter to You?
Survey of Recommendation Systems
Survey of Recommendation SystemsSurvey of Recommendation Systems
Survey of Recommendation Systems
Recommender systems using collaborative filtering
Recommender systems using collaborative filteringRecommender systems using collaborative filtering
Recommender systems using collaborative filtering
Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011
Recommender systems
Recommender systemsRecommender systems
Recommender systems
Summary of a Recommender Systems Survey paper
Summary of a Recommender Systems Survey paperSummary of a Recommender Systems Survey paper
Summary of a Recommender Systems Survey paper
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Replicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender SystemsReplicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender Systems
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Overview of recommender system
Overview of recommender systemOverview of recommender system
Overview of recommender system
Recommendation engines
Recommendation enginesRecommendation engines
Recommendation engines
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and Graphs
[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systems[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systems
Recommender system a-introduction
Recommender system a-introductionRecommender system a-introduction
Recommender system a-introduction

Viewers also liked

Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody

Viewers also liked (6)

Vsm lsi
Vsm lsiVsm lsi
Vsm lsi
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec

Similar to Topic Modelling: Tutorial on Usage and Applications

Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Lucidworks
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Sujit Pal
TopicModels_BleiPaper_Summary.pptxKalpit Desai
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)Nicolas Van Labeke
Survey Research in Software Engineering
Survey Research in Software EngineeringSurvey Research in Software Engineering
Survey Research in Software EngineeringDaniel Mendez
AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptxGaytriDhingra1
ML slide share.pptx
ML slide share.pptxML slide share.pptx
ML slide share.pptxGoodReads1
Text analysis
Text analysisText analysis
Text analysisshahidzac
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
Andrew Clegg, Data Scientician & Machine Learning Engine-Driver: "Deep produc...
Andrew Clegg, Data Scientician & Machine Learning Engine-Driver: "Deep produc...Andrew Clegg, Data Scientician & Machine Learning Engine-Driver: "Deep produc...
Andrew Clegg, Data Scientician & Machine Learning Engine-Driver: "Deep produc...Dataconomy Media
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Association for Computational Linguistics
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks

Similar to Topic Modelling: Tutorial on Usage and Applications (20)

Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
Final presentation
Final presentationFinal presentation
Final presentation
Survey Research in Software Engineering
Survey Research in Software EngineeringSurvey Research in Software Engineering
Survey Research in Software Engineering
Machine Learning
Machine Learning Machine Learning
Machine Learning
AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptx
ML slide share.pptx
ML slide share.pptxML slide share.pptx
ML slide share.pptx
Text analysis
Text analysisText analysis
Text analysis
Thesis writing clinic 2014
Thesis writing clinic 2014Thesis writing clinic 2014
Thesis writing clinic 2014
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Andrew Clegg, Data Scientician & Machine Learning Engine-Driver: "Deep produc...
Andrew Clegg, Data Scientician & Machine Learning Engine-Driver: "Deep produc...Andrew Clegg, Data Scientician & Machine Learning Engine-Driver: "Deep produc...
Andrew Clegg, Data Scientician & Machine Learning Engine-Driver: "Deep produc...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...

Recently uploaded

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Recently uploaded (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024

Topic Modelling: Tutorial on Usage and Applications

  • 1. Образец заголовка Tutorial on Topic Modelling by Ayush Jain Prepared as an assignment for CS410: Text Information Systems in Spring
  • 2. Образец заголовка Topic Models •  Discover hidden themes that pervade the collec2on •  Tag the documents on the basis of these themes •  Organize, summarize and search the documents on the basis of these themes
  • 3. Образец заголовкаTakeaways from this tutorial •  What are probabilistic topic models? •  What kind of things can they do? •  How do we train/infer a topic model? •  How do we evaluate a topic model?
  • 4. Образец заголовкаTools •  Topic models are a special application of probability theory. In particular, they touch – Probabilistic graphical Models – Conjugate and non-conjugate priors – Approximate posterior inference – Exploratory data analysis
  • 5. Образец заголовка The Key Steps in every Topic Model Make assump2ons Collect Data Infer posterior Evaluate Predict
  • 6. Образец заголовкаOutline •  Latent Dirichlet Allocation – Application of key steps – Graphical Model encoding the assumptions – Inference Algorithms – Gibbs Sampling •  Topic Models for more complex tasks – Rating prediction •  A completely novel topic model incorporating sentiments (that we’ll develop!)
  • 7. Образец заголовкаLatent Dirichlet Allocation •  Already covered in course •  Application of the key steps – Make assumptions •  Each topic is a distribution over words •  Each document is a mixture of topics •  Each word is drawn from a topic
  • 8. Образец заголовкаLatent Dirichlet Allocation •  Graphical Model •  Encodes assump2ons •  Allows us to break down the joint probability into product of condi2onals
  • 9. Образец заголовкаLatent Dirichlet Allocation •  Graphical Model
  • 10. Образец заголовкаLatent Dirichlet Allocation •  Application of the key steps – Make assumptions (II) •  Choose probability distributions –  Choosing conjugate distributions makes life easier! »  Eg: Multinomial and Dirichlet are conjugate distributions
  • 11. Образец заголовкаAside: Conjugate Distributions •  Dirichlet Distribution: : Probability of seeing different sides of die •  Multinomial Distribution: –  The number of occurrences of different sides (W) of the die is distributed in a multinomial manner •  Posterior distribution: θ p(W |θ) is mul2nomial xi: The number of 2mes side i was observed
  • 12. Образец заголовкаLatent Dirichlet Allocation •  Application of the key steps – Make assumptions (II) •  Choose probability distributions –  Choosing conjugate distributions makes life easier! »  Eg: Multinomial and Dirichlet are conjugate distributions – Collect Data •  Corpus on which you want to detect themes
  • 13. Образец заголовкаLatent Dirichlet Allocation •  Application of the key steps – Infer Posterior •  Probabilistic graphical models provide algorithms –  Mean field variational methods –  Expectation Propagation (similar to EM) –  Gibbs Sampling (most popular) –  Variational Inference
  • 14. Образец заголовкаAside: Gibbs Sampling – Used when samples need to be drawn from a joint distribution, but the joint distribution is difficult to approximate – Sample X=(x1, …, xn) from joint pdf p(x1, …, xn) – Conditional distributions are relatively strighforward – Procedure: •  Begin with some initial X(i) •  Sample xj (i+1) from p(xj (i+1) | x1 (i+1) ,.. , xj-1 (i+1) , xj+1 (i) , .., xn (i+1) ) •  Repeat
  • 15. Образец заголовкаLatent Dirichlet Allocation •  Application of the key steps – Infer Posterior (Gibbs Sampling) •  Here, X is all parameters to be inferred –  Per-word topic assignment zd,n –  Per-document topic proportions d –  Per-corpus topic-word distributions k •  Extremely high dimensional! •  Solution: –  Integrate out and –  Conjugate distributions make the integration straightforward! θ β θ β
  • 16. Образец заголовкаLatent Dirichlet Allocation •  Application of the key steps – Infer Posterior (Gibbs Sampling) •  After all computation: •  nd,: k, -(d,n): The number of words in document d that belong to topic k, except for n-th word •  v: Index of the n-th word in d-th document in the vocabulary •  Linear time in the number of tokens! P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,: k,−(d,n) +αk( ) n:,v k,−(d,n) + βv n:,r k,−(d,n) + βr r=1 V ∑
  • 17. Образец заголовкаLatent Dirichlet Allocation •  Application of the key steps – Infer Posterior (Gibbs Sampling) •  After all computation: •  Linear time in the number of tokens! •  Further improvements that use the sparsity of the problem when corpus and number of topics is large P Zd,n = k | Z−(d,n),W;α,β( )∝ nd,: k,−(d,n) +αk( ) n:,v k,−(d,n) + βv n:,r k,−(d,n) + βr r=1 V ∑
  • 18. Образец заголовкаTopic Models: Evaluation •  Underlying topics are subjective – Makes the evaluation difficult – Workaround: Look at application and evaluate •  Document classification •  Information Retrieval •  Rating Prediction
  • 19. Образец заголовкаTopic Models: Evaluation •  Use the trained model to predict probabilities of seeing unseen documents – Better models would give high probability •  Even better: – Predict the probability of second half of documents using first halves as the corpus – Does not require documents to be held out
  • 20. Образец заголовкаBeyond LDA: Rating Prediction •  Predict ratings associated with text •  Additional assumption: •  Rating is conditional on the topic assignment to different words •  Graphical Model:
  • 21. Образец заголовкаBeyond LDA: Rating Prediction •  Topics –  Least, problem, unfortunately, supposed, worse, flat, dull –  Bad, guys, watchable, not, one, movie –  Both, motion, simple, perfect, fascinating, power –  Cinematography, screenplay, performances, pictures, effective, sound •  Notice how the assumption affects the extracted topics –  Because of the dependence of the overall rating on number of words in different topics, topics are collections of words that appear in similarly ranked documents –  Topics express sentiment but loose their original meaning!
  • 22. Образец заголовкаBeyond LDA: Rating Prediction •  Latent Aspect Rating Prediction –  Joint Topic and Sentiment Modelling Genera&ve Model 1.  Choose aspects and words for each aspect Wdij
  • 23. Образец заголовкаBeyond LDA: Rating Prediction •  Latent Aspect Rating Prediction –  Joint Topic and Sentiment Modelling Genera&ve Model 1.  Choose aspects and words for each aspect 2.  Calculate aspect ra2ng based on aspect words sdi = βijWdij j=1 n ∑
  • 24. Образец заголовкаBeyond LDA: Rating Prediction •  Latent Aspect Rating Prediction –  Joint Topic and Sentiment Modelling Genera&ve Model 1.  Choose aspects and words for each aspect 2.  Calculate aspect ra2ng based on aspect words 3.  Overall ra2ng is weighted sum of aspect ra2ngs rd ~ N αdi βijWdij j=1 n ∑ ,δ2 i=1 k ∑ " # $$ % & ''
  • 25. Образец заголовкаBeyond LDA: Rating Prediction •  Latent Aspect Rating Prediction –  Joint Topic and Sentiment Modelling Genera&ve Model E-Step: Infer aspect ra2ngs and aspect weights M-Step: Update sd αd µ,Σ,β,δ( )
  • 26. Образец заголовкаBeyond LDA: Rating Prediction •  Latent Aspect Rating Prediction –  Results •  Detects sentiments without supervision
  • 27. Образец заголовкаBeyond LDA: Rating Prediction •  Latent Aspect Rating Prediction –  Results •  Requires keyword supervision – Any way to remove? (Think LDA!)
  • 28. Образец заголовкаBeyond LDA: Rating Prediction •  Latent Aspect Rating Prediction without Aspect Keyword Supervision –  Aspect Modelling Module from LDA included
  • 29. Образец заголовка Beyond LDA: Topic Phrase Mining •  Motivation: –  machine learning is a phrase and should be assigned to one topic •  Assigning machine to “Industry” and learning to “Education” is incorrect •  Approach: –  Extract high frequency phrases •  If a phrase is infrequent, so is any super-phrase •  If a document does not contain a frequent phrase of length n, it also does not contain any of length > n •  Use hierarchical clustering to find frequent phrases –  Apply LDA on phrase tokens
  • 30. Образец заголовкаSentiment Analysis •  Let’s build our own simple model using the key steps! •  Use case:
  • 31. Образец заголовкаSentiment Analysis •  Make Assumptions –  Each (topic, sentiment) pair has a vocabulary •  ‘quick delivery’ has more probability for (service, +) than for (service, -) or (food quality, +) –  Each (topic, rating) pair has a sentiment distribution •  + sentiments for food quality are more likely to appear in highly rated reviews •  A 4-star rated restaurant is likely to have good food quality even if it does not provide wireless –  Each review has •  Overall rating •  Topic distribution: Different users might talk about different aspects in their reviews
  • 32. Образец заголовкаSentiment Analysis •  Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments)
  • 33. Образец заголовкаSentiment Analysis •  Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments) 2.  Choose sen2ment distribu2on for all (topic, ra2ng)
  • 34. Образец заголовкаSentiment Analysis •  Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments) 2.  Choose sen2ment distribu2on for all (topic, ra2ng) 3.  For each review •  Choose ra2ng
  • 35. Образец заголовкаSentiment Analysis •  Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments) 2.  Choose sen2ment distribu2on for all (topic, ra2ng) 3.  For each review •  Choose ra2ng •  Choose topic distribu2on
  • 36. Образец заголовкаSentiment Analysis •  Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments) 2.  Choose sen2ment distribu2on for all (topic, ra2ng) 3.  For each review •  Choose ra2ng •  Choose topic distribu2on •  For each word in review: •  Choose topic
  • 37. Образец заголовкаSentiment Analysis •  Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments) 2.  Choose sen2ment distribu2on for all (topic, ra2ng) 3.  For each review •  Choose ra2ng •  Choose topic distribu2on •  For each word in review: •  Choose topic •  Choose sen2ment
  • 38. Образец заголовкаSentiment Analysis •  Graphical Model Genera&ve Process 1.  Choose word distribu2on for all (topic, sen2ments) 2.  Choose sen2ment distribu2on for all (topic, ra2ng) 3.  For each review •  Choose ra2ng •  Choose topic distribu2on •  For each word in review: •  Choose topic •  Choose sen2ment •  Choose word
  • 39. Образец заголовкаSentiment Analysis •  Inference Parameters to be inferred 1.  Per document topic distribu2on 2.  Ra2ng distribu2on 3.  Sen2ment distribu2on 4.  Word distribu2ons Use Collapsed Gibbs Sampling! Integrate out and φ π
  • 40. Образец заголовкаSentiment Analysis •  Evaluation – Yelp –  Sandwich: sandwich, slaw, primanti, coleslaw, cole, market, pastrami, reuben, bro, mayo, famous, cheesesteak, rye, zucchini, swiss, sammy, peppi, burgh, messi –  Vietnamese: pho, noodl, bowl, soup, broth, sprout, vermicelli, peanut, lemongrass, leaf –  Payment options: server, check, custom, card, return, state, credit, coupon, accept, tip, treat, gift, refill –  Location: locat, park, street, drive, hill, window, south, car, downtown, number, corner, distance –  Ambience: crowd, fun, group, rock, play, loud, music, young, sing, club, ticket, meet, entertain, dance, band, song
  • 41. Образец заголовкаSentiment Analysis •  Evaluation – Yelp – Rating prediction
  • 42. Образец заголовкаSentiment Analysis •  Evaluation – Yelp – Opinion Summarization •  For all reviews of this restaurant –  15% words assigned to topic “Vegetarian” –  5% to “Breakfast” (Eggs) with sentiment 0.78 –  3% to “Staff Attitude” with sentiment 0.82
  • 43. Образец заголовкаTopic Modelling: Future Work •  Missing Links – Model selection: Which model to pick for which applications – Incorporating linguistic structure/NLP: •  How can our knowledge of language help? – Bag of words: •  Most models are based on the unigram bag of words model •  Context is lost – words like good or nice are often associated with certain words within context, eg: ‘good standard of living’, ‘nice view from the hotel’