SlideShare a Scribd company logo
1 of 21
Introduction to Topic Modelling
Shubhmay Potdar
1 2 3
Problem Statement
How to summarize large
no. of documents
(research paper, news
articles, reviews, tweets
etc.) ?
Is there exist any way
through which documents
can be clustered in a fuzzy
way ?
How can we find
documents which talk
about same subjects and
abstractions ?
1 2 3
Applications
Understanding high level
concepts in a research
article
Extract topics/ abstractions
about reviews
Tagging documents with
appropriate tags
Nomenclature
❖ Document - data in the form of text/ text data
❖ Corpus - collection of documents
❖ Bag of words - document representation in the form of multiset
❖ Topics - collection of highly correlation words
❖ Term Frequency - No. of times a word appear in a document
❖ Document Frequency - No. of documents that contain a word
Topic Modelling
Topic Modeling is a technique
to extract the hidden topics
from large volumes of text.
https://github.com/chdoig/pytexas2015-topic-modeling
1 2 3
Word Representation
Term Frequency Term Frequency
Inverse Document
Frequency
Word Embeddings
TF-IDF
Natural Language Processing
Stop Words: Frequently occurring words like (the, in, but etc.)
Lemmatization: Derive root word with the help of vocabulary and morphological analysis.
Ex. Running - Run
Part of Speech Tagging: Tag a word with appropriate Part of Speech
Coreference Resolution: Process of finding relational links among the words within a
sentence
Latent Semantic Analysis
https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
Pros and Cons - LSA
Pros
● LSA is easy to train and tune (no
hyperparameters except rank)
● Embeddings usually work fine in
downstream tasks such as clusterization,
classification, regression, similarity-search
Cons:
● Major drawback is that embeddings are not
interpretable
● The probabilistic model of LSA does not
match observed data: LSA assumes that
words and documents form a joint Gaussian
model (ergodic hypothesis), while a Poisson
distribution has been observed
Latent Dirichlet Allocation
Generative Process:
● Assume that document consists
of N words
● Do for each word
○ Select a topic using
document topic
distribution
○ Draw a word from the
topic using topic word
distribution
https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158
Collapsed Gibbs Sampling - LDA
● Go through each document, and randomly assign each word to one of the K topics.
● This random assignment has already given as both the distribution
● for each document d…
○ For each word w in d…
■ for each topic t,
1. p(topic t | document d)
2. p(word w | topic t)
3. Reassign w a new topic = p(topic t | document d) * p(word w | topic t), we’re assuming that all topic
assignments except for the current word in question are correct, and then updating the assignment
of the current word using our model of how documents are generated.
LDA - Posterior using Gibbs Sampling
This method is to integral out random variables except for word topic {z_mn} and draw each z_mn from posterior.
where n_mz is a word count of document m with topic z, n_tz is a count of word t with topic z, n_z is a word count with
topic z and -mn means “except z_mn.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3.Jan (2003): 993-1022.
The posterior of z_mn is the following:
Perplexity
Word2Vec
- Representation of word into n dimensional
vector
- Created a huge buzz in the NLP community
- Commendable results, simple algorithm
- Can perform simple mathematical operation to
achieve
King - Man + Women = Queen
Ida2Vec
https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/
Evaluation - Keeping Human in loop
Word intrusion : For each trained topic, take first ten
words, substitute one of them with another, randomly
chosen word (intruder!) and see whether a human can
reliably tell which one it was.
Topic intrusion: Subjects are shown the title and a
snippet from a document. Along with the document
they are presented with four topics. Three of those
topics are the highest probability topics assigned to
that document. The remaining intruder topic is chosen
randomly from the other low-probability topics in the
model
References
1. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research
3.Jan (2003): 993-1022.
2. Blei, D. M. D., Edu, B. B., Ng, A. Y., Edu, A. S., Jordan, M. I., Edu, J. B. Baldwin, T. (2014). Online learning for latent dirichlet
allocation.
3. Reed, C. (2012). Latent Dirichlet Allocation: Towards a Deeper Understanding. Tutorial, (January)
4. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, J. D. (2007). Distributed Representations of Words and
Phrases and their Compositionality
5. T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space (2013)
6. E Moody, Christopher. (2016). Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec.
Thanks

More Related Content

What's hot

Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distancesGanesh Borle
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Sebastian Ruder
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 
Merging controlled vocabularies through semantic alignment based on linked data
Merging controlled vocabularies through semantic alignment based on linked dataMerging controlled vocabularies through semantic alignment based on linked data
Merging controlled vocabularies through semantic alignment based on linked dataJohn Pap
 
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLETSFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLETSouth Tyrol Free Software Conference
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimEdgar Marca
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLPRupak Roy
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)Sihan Chen
 

What's hot (20)

Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Ir 03
Ir   03Ir   03
Ir 03
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 
Merging controlled vocabularies through semantic alignment based on linked data
Merging controlled vocabularies through semantic alignment based on linked dataMerging controlled vocabularies through semantic alignment based on linked data
Merging controlled vocabularies through semantic alignment based on linked data
 
Ir 02
Ir   02Ir   02
Ir 02
 
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLETSFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
 
Using lexical chains for text summarization
Using lexical chains for text summarizationUsing lexical chains for text summarization
Using lexical chains for text summarization
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)
 

Similar to Introduction to Topic Modelling Techniques

A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesBryan Gummibearehausen
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
 
Project Proposal Topics Modeling (Ir)
Project Proposal    Topics Modeling (Ir)Project Proposal    Topics Modeling (Ir)
Project Proposal Topics Modeling (Ir)Svitlana volkova
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...IRJET Journal
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGijnlc
 

Similar to Introduction to Topic Modelling Techniques (20)

A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector Representations
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
 
Project Proposal Topics Modeling (Ir)
Project Proposal    Topics Modeling (Ir)Project Proposal    Topics Modeling (Ir)
Project Proposal Topics Modeling (Ir)
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
 
Lec1
Lec1Lec1
Lec1
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
 
Probabilistic Topic models
Probabilistic Topic modelsProbabilistic Topic models
Probabilistic Topic models
 

Recently uploaded

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 

Introduction to Topic Modelling Techniques

  • 1. Introduction to Topic Modelling Shubhmay Potdar
  • 2. 1 2 3 Problem Statement How to summarize large no. of documents (research paper, news articles, reviews, tweets etc.) ? Is there exist any way through which documents can be clustered in a fuzzy way ? How can we find documents which talk about same subjects and abstractions ?
  • 3. 1 2 3 Applications Understanding high level concepts in a research article Extract topics/ abstractions about reviews Tagging documents with appropriate tags
  • 4. Nomenclature ❖ Document - data in the form of text/ text data ❖ Corpus - collection of documents ❖ Bag of words - document representation in the form of multiset ❖ Topics - collection of highly correlation words ❖ Term Frequency - No. of times a word appear in a document ❖ Document Frequency - No. of documents that contain a word
  • 5. Topic Modelling Topic Modeling is a technique to extract the hidden topics from large volumes of text. https://github.com/chdoig/pytexas2015-topic-modeling
  • 6. 1 2 3 Word Representation Term Frequency Term Frequency Inverse Document Frequency Word Embeddings
  • 8. Natural Language Processing Stop Words: Frequently occurring words like (the, in, but etc.) Lemmatization: Derive root word with the help of vocabulary and morphological analysis. Ex. Running - Run Part of Speech Tagging: Tag a word with appropriate Part of Speech Coreference Resolution: Process of finding relational links among the words within a sentence
  • 10. Pros and Cons - LSA Pros ● LSA is easy to train and tune (no hyperparameters except rank) ● Embeddings usually work fine in downstream tasks such as clusterization, classification, regression, similarity-search Cons: ● Major drawback is that embeddings are not interpretable ● The probabilistic model of LSA does not match observed data: LSA assumes that words and documents form a joint Gaussian model (ergodic hypothesis), while a Poisson distribution has been observed
  • 11. Latent Dirichlet Allocation Generative Process: ● Assume that document consists of N words ● Do for each word ○ Select a topic using document topic distribution ○ Draw a word from the topic using topic word distribution https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158
  • 12. Collapsed Gibbs Sampling - LDA ● Go through each document, and randomly assign each word to one of the K topics. ● This random assignment has already given as both the distribution ● for each document d… ○ For each word w in d… ■ for each topic t, 1. p(topic t | document d) 2. p(word w | topic t) 3. Reassign w a new topic = p(topic t | document d) * p(word w | topic t), we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
  • 13. LDA - Posterior using Gibbs Sampling This method is to integral out random variables except for word topic {z_mn} and draw each z_mn from posterior. where n_mz is a word count of document m with topic z, n_tz is a count of word t with topic z, n_z is a word count with topic z and -mn means “except z_mn. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3.Jan (2003): 993-1022. The posterior of z_mn is the following:
  • 15. Word2Vec - Representation of word into n dimensional vector - Created a huge buzz in the NLP community - Commendable results, simple algorithm - Can perform simple mathematical operation to achieve King - Man + Women = Queen
  • 16.
  • 17.
  • 19. Evaluation - Keeping Human in loop Word intrusion : For each trained topic, take first ten words, substitute one of them with another, randomly chosen word (intruder!) and see whether a human can reliably tell which one it was. Topic intrusion: Subjects are shown the title and a snippet from a document. Along with the document they are presented with four topics. Three of those topics are the highest probability topics assigned to that document. The remaining intruder topic is chosen randomly from the other low-probability topics in the model
  • 20. References 1. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3.Jan (2003): 993-1022. 2. Blei, D. M. D., Edu, B. B., Ng, A. Y., Edu, A. S., Jordan, M. I., Edu, J. B. Baldwin, T. (2014). Online learning for latent dirichlet allocation. 3. Reed, C. (2012). Latent Dirichlet Allocation: Towards a Deeper Understanding. Tutorial, (January) 4. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, J. D. (2007). Distributed Representations of Words and Phrases and their Compositionality 5. T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space (2013) 6. E Moody, Christopher. (2016). Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec.