Document Clustering using LDA | Haridas Narayanaswamy [Pramati]

Pramati Technologies
Pramati TechnologiesPramati Technologies
Document clustering using LDA
Haridas N <haridas.n@imaginea.com>
@haridas_n
Agenda
● Introduction to LDA
● Other Clustering Methods
● Model pipeline and Training
● Evaluate LDA model results
○ How to measure the quality of results
○ Evaluate the coherence of the topics
○ Cross check the patents in the cluster are similar
LDA: Find natural categories of
millions of documents, and
suggest a name for each
category.
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
LDA - Latent Dirichlet Allocation
● Generative probabilistic model, which generates documents from topics and
topics from vocabs.
● An Unsupervised Model
● Other clustering algorithms are LSI, PLSI and K-Mean
Clustering Models
LSI
● Dimensionality reduction method using Truncated SVD.
● Document D = N x V
● SVD applied on D = N x T and T x V
● It lacks the interpretability of the topics.
● And representation quality isn’t that good.
PLSI
● Extension to the LSI by making it probabilistic model
LDA Model
● Plate notation of LDA
Probabilistic graphical model.
● Uses Bayesian inference to find
best likelihood estimation.
● Uses Dirichlet priors for Topic
and Vocabs, hence the name LDA
● Alpha and Beta are Dirichlet
priors
● K topics
● N vocabs
● M documents
K-mean clustering
● Kmean applied on top of the Document x Topic dataset.
● After the patents are rearranged based on spatial location, we can assign the topic
number based on existing patents in it.
● LDA is acting as a Dimensionality reduction of sparse Document x Vocab dataset
into Document x Topic matrix which is dense.
● Kmean does good job on dense vectors.
Feature Extraction
Feature Engineering
● Tokenization and text cleanups
● Apply standard and custom stopword filtering
● Noun-chunk extraction using spacy or nltk based taggers.
● N-gram features
○ If lot of data available then unigrams itself gives pretty good result.
● Stemming / Lemmatization
● TF-IDF based feature selection
Model Pipeline
Documents
Tokenize
D x V
Pre
Processing
BOW
(D x V)
LDA
D x T &
T x V
Training
Tech stack
● Developed on spark mllib ( Or you can use gensim if dataset is smaller )
● Have to handle millions of documents
● We use cluster size of 300GB RAM and 50Core CPU.
● S3 to persist the data
● Pre and post processing pipelines
Hyper parameters
● Doc-Concentration prior ( Alpha )
● Topic Concentration prior ( Beta )
● Number of topics ( K )
● Iterations
● Vocab Size or Feature size ( N ) - in BOW format.
● Max-df tuning
● Custom stopwords to further prune noisy vocabs.
Model Evaluation
Challenges on model evaluation
● LDA is an Unsupervised model, how do we cross check the convergence ?
● Test set validation ?
● What measure we use for grid search ?
● How we compare two LDA runs ?
● We want to avoid human bias involved when comparing the topics
Model Evaluation Methods
● Perplexity - Ensure log likelihood function is maximum point, which will bring
perplexity to lower side.
● Plot the sum of probabilities of top 10 vocabs from Topic x Vocab matrix.
● Topic Coherence valuation
● Topic Dependency score
● Manual evaluation framework.
Perplexity
● A measure to know probabilistic models’ likelihood function reached at maximum
point.
● Applied on held-out dataset or test dataset.
● This measure has been used to tune a particular parameter keeping others
constant - similar to Elbow point identification on Kmean.
● Perplexity doesn’t measure the contextual information between words, it’s rather
per word level.
● So it’s not directly usable as final model evaluation metric. We can use it to tune
the hyper parameters of the model.
Probability sum of top 10 vocabs from T x V matrix
Wordcloud based on the word weightage for a topic
Coherence Scores
● Best method which matches close to the manual verification.
● Gives importance to the co-occurrence of the words really there on the document
or not.
● We can control the context window, full document based, paragraph or Sentence
wise.
● Custom sliding window also we can apply.
● Gensim library provides off-the self implementation for standard coherence
scores.
Different Coherence methods
● Umass - Boolean document estimation
● UCI - Sliding window based document estimation
Different Coherence methods
● NPMI - Sliding window based co-occurrence counting.
● Etc..
● Java Implementation - https://github.com/dice-group/Palmetto
● Reference:- https://labs.imaginea.com/post/how-to-measure-topic-coherence/
Coherence scores are used to compare models - Umass
● LDA Run 1 - -5.403614
● LDA Run 2 - -2.780710
● LDA Run 3 - -3.300038
● Higher the score better, these scores better
Topic dependency - Jaccard Distance
● Find how close or distant the topics are
● Helpful to know whether your topics are very dependent or specific in nature
● It’s very easy to calculate, using the top N words from each topic-vocab
distribution.
● Overlap median score can be used as optimisation parameter for grid-search.
Grid search for best parameters
● Make use of the LDADE.
● Differential evolution methods to optimise any black box function
● Best fit if you are training on a small data-size, as you need to do hundreds of
model training to find good param set. Or you need big cluster to reduce the
training time.
● LDADE reduce the overall search space, but still it’s not very low in number
● Rule of thumb you can apply is, if you model trains with in few mins it’s ideal.
● Topic variance between two runs are considered as a loss function.
● Reference: https://labs.imaginea.com/reference/lda-tuning/
Summary
● LDA has been used to find latent topics from documents
● LDA converges well enough and accumulates good words for each topic to
describe it well.
● Can be usable as feature extraction from a document
● Model evaluation is a difficult part, Use coherence scores along with other
measures.
QA
Thank you
Haridas N <hn@haridas.in>
1 of 31

Recommended

Duet @ TREC 2019 Deep Learning Track by
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
107 views17 slides
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk by
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskSaurabh Saxena
214 views39 slides
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016 by
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
1.4K views42 slides
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System by
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT SystemHua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT SystemAssociation for Computational Linguistics
46 views1 slide
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework... by
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...PyData
814 views23 slides
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal... by
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...MLconf
1.8K views20 slides

More Related Content

What's hot

Canonical Formatted Address Data by
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Datadanielschulz2005
288 views34 slides
Canonical Formatted Address Data by
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Datadanielschulz2005
218 views34 slides
Neural Models for Information Retrieval by
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
2.2K views43 slides
Text Mining using LDA with Context by
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with ContextSteffen Staab
2.2K views67 slides
Language models by
Language modelsLanguage models
Language modelsMaryam Khordad
12K views24 slides
Deep Learning for Machine Translation by
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine TranslationMatīss
2.4K views53 slides

What's hot(20)

Neural Models for Information Retrieval by Bhaskar Mitra
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
Bhaskar Mitra2.2K views
Text Mining using LDA with Context by Steffen Staab
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
Steffen Staab2.2K views
Deep Learning for Machine Translation by Matīss
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
Matīss 2.4K views
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track by Bhaskar Mitra
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Bhaskar Mitra110 views
HPCC Systems Engineering Summit Presentation - Collaborative Research with FA... by HPCC Systems
HPCC Systems Engineering Summit Presentation - Collaborative Research with FA...HPCC Systems Engineering Summit Presentation - Collaborative Research with FA...
HPCC Systems Engineering Summit Presentation - Collaborative Research with FA...
HPCC Systems516 views
H2O World - GBM and Random Forest in H2O- Mark Landry by Sri Ambati
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark Landry
Sri Ambati2.4K views
Dual Embedding Space Model (DESM) by Bhaskar Mitra
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
Bhaskar Mitra772 views
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ... by Jimmy Lai
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Jimmy Lai29.6K views
SoDA v2 - Named Entity Recognition from streaming text by Sujit Pal
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
Sujit Pal1.3K views
MachineLearningMSConference by George Simov
MachineLearningMSConferenceMachineLearningMSConference
MachineLearningMSConference
George Simov31 views
Integration of speech recognition with computer assisted translation by Chamani Shiranthika
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
IRJET- Automatic Language Identification using Hybrid Approach and Classifica... by IRJET Journal
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET Journal12 views
Topic model, LDA and all that by Zhibo Xiao
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
Zhibo Xiao1.8K views
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex... by KozoChikai
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
KozoChikai19 views
5 Lessons Learned from Designing Neural Models for Information Retrieval by Bhaskar Mitra
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
Bhaskar Mitra1.5K views

Similar to Document Clustering using LDA | Haridas Narayanaswamy [Pramati]

Software Craftmanship - Cours Polytech by
Software Craftmanship - Cours PolytechSoftware Craftmanship - Cours Polytech
Software Craftmanship - Cours Polytechyannick grenzinger
31 views66 slides
presentation.ppt by
presentation.pptpresentation.ppt
presentation.pptMadhuriChandanbatwe
8 views17 slides
Text Classification by
Text ClassificationText Classification
Text ClassificationRAX Automation Suite
3.3K views27 slides
IRE Semantic Annotation of Documents by
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents Sharvil Katariya
594 views21 slides
Recurrent Neural Networks for Text Analysis by
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysisodsc
6.6K views56 slides

Similar to Document Clustering using LDA | Haridas Narayanaswamy [Pramati](20)

IRE Semantic Annotation of Documents by Sharvil Katariya
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
Sharvil Katariya594 views
Recurrent Neural Networks for Text Analysis by odsc
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
odsc6.6K views
Analytics Boot Camp - Slides by Aditya Joshi
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
Aditya Joshi128 views
Production-Ready BIG ML Workflows - from zero to hero by Daniel Marcous
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous2.2K views
Predicting Tweet Sentiment by Lucinda Linde
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
Lucinda Linde108 views
ODSC East: Effective Transfer Learning for NLP by indico data
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
indico data417 views
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum... by Spark Summit
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit4K views
Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c... by Dat Nguyen
Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c...Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c...
Vjai paper reading201808-acl18-simple-and_effective multi-paragraph reading c...
Dat Nguyen298 views
Productionalizing Spark ML by datamantra
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra792 views
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ... by Lviv Startup Club
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
MLlib and Machine Learning on Spark by Petr Zapletal
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
Petr Zapletal4.9K views
Sagemaker built_in algorithems.pptx by asifshahzad100
Sagemaker built_in algorithems.pptxSagemaker built_in algorithems.pptx
Sagemaker built_in algorithems.pptx
asifshahzad1002 views
Classification of webpages as Ephemeral or Evergreen by Monis Javed
Classification of webpages as Ephemeral or EvergreenClassification of webpages as Ephemeral or Evergreen
Classification of webpages as Ephemeral or Evergreen
Monis Javed673 views
NLP and Deep Learning for non_experts by Sanghamitra Deb
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
Sanghamitra Deb296 views

More from Pramati Technologies

Graph db - Pramati Technologies [Meetup] by
Graph db - Pramati Technologies [Meetup]Graph db - Pramati Technologies [Meetup]
Graph db - Pramati Technologies [Meetup]Pramati Technologies
137 views20 slides
Clojure through the eyes of a Java Nut | [Mixed Nuts] at Pramati Technologies by
Clojure through the eyes of a Java Nut | [Mixed Nuts] at Pramati TechnologiesClojure through the eyes of a Java Nut | [Mixed Nuts] at Pramati Technologies
Clojure through the eyes of a Java Nut | [Mixed Nuts] at Pramati TechnologiesPramati Technologies
180 views56 slides
Swift UI - Declarative Programming [Pramati Technologies] by
Swift UI - Declarative Programming [Pramati Technologies]Swift UI - Declarative Programming [Pramati Technologies]
Swift UI - Declarative Programming [Pramati Technologies]Pramati Technologies
349 views14 slides
Adaptive Cards - Pramati Technologies by
Adaptive Cards - Pramati TechnologiesAdaptive Cards - Pramati Technologies
Adaptive Cards - Pramati TechnologiesPramati Technologies
94 views18 slides
VitaFlow | Mageswaran Dhandapani [Pramati] by
VitaFlow | Mageswaran Dhandapani [Pramati]VitaFlow | Mageswaran Dhandapani [Pramati]
VitaFlow | Mageswaran Dhandapani [Pramati]Pramati Technologies
205 views30 slides
Typography Style Transfer using GANs | Pramati by
Typography Style Transfer using GANs | Pramati Typography Style Transfer using GANs | Pramati
Typography Style Transfer using GANs | Pramati Pramati Technologies
98 views22 slides

More from Pramati Technologies(7)

Recently uploaded

6g - REPORT.pdf by
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdfLiveplex
10 views23 slides
Mini-Track: Challenges to Network Automation Adoption by
Mini-Track: Challenges to Network Automation AdoptionMini-Track: Challenges to Network Automation Adoption
Mini-Track: Challenges to Network Automation AdoptionNetwork Automation Forum
13 views27 slides
Business Analyst Series 2023 - Week 3 Session 5 by
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5DianaGray10
300 views20 slides
Democratising digital commerce in India-Report by
Democratising digital commerce in India-ReportDemocratising digital commerce in India-Report
Democratising digital commerce in India-ReportKapil Khandelwal (KK)
18 views161 slides
"Running students' code in isolation. The hard way", Yurii Holiuk by
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk Fwdays
17 views34 slides
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc
11 views29 slides

Recently uploaded(20)

6g - REPORT.pdf by Liveplex
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdf
Liveplex10 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10300 views
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays17 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc11 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software280 views
Powerful Google developer tools for immediate impact! (2023-24) by wesley chun
Powerful Google developer tools for immediate impact! (2023-24)Powerful Google developer tools for immediate impact! (2023-24)
Powerful Google developer tools for immediate impact! (2023-24)
wesley chun10 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely25 views
Future of AR - Facebook Presentation by ssuserb54b561
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
ssuserb54b56115 views

Document Clustering using LDA | Haridas Narayanaswamy [Pramati]

  • 1. Document clustering using LDA Haridas N <haridas.n@imaginea.com> @haridas_n
  • 2. Agenda ● Introduction to LDA ● Other Clustering Methods ● Model pipeline and Training ● Evaluate LDA model results ○ How to measure the quality of results ○ Evaluate the coherence of the topics ○ Cross check the patents in the cluster are similar
  • 3. LDA: Find natural categories of millions of documents, and suggest a name for each category.
  • 5. LDA - Latent Dirichlet Allocation ● Generative probabilistic model, which generates documents from topics and topics from vocabs. ● An Unsupervised Model ● Other clustering algorithms are LSI, PLSI and K-Mean
  • 7. LSI ● Dimensionality reduction method using Truncated SVD. ● Document D = N x V ● SVD applied on D = N x T and T x V ● It lacks the interpretability of the topics. ● And representation quality isn’t that good.
  • 8. PLSI ● Extension to the LSI by making it probabilistic model
  • 9. LDA Model ● Plate notation of LDA Probabilistic graphical model. ● Uses Bayesian inference to find best likelihood estimation. ● Uses Dirichlet priors for Topic and Vocabs, hence the name LDA ● Alpha and Beta are Dirichlet priors ● K topics ● N vocabs ● M documents
  • 10. K-mean clustering ● Kmean applied on top of the Document x Topic dataset. ● After the patents are rearranged based on spatial location, we can assign the topic number based on existing patents in it. ● LDA is acting as a Dimensionality reduction of sparse Document x Vocab dataset into Document x Topic matrix which is dense. ● Kmean does good job on dense vectors.
  • 12. Feature Engineering ● Tokenization and text cleanups ● Apply standard and custom stopword filtering ● Noun-chunk extraction using spacy or nltk based taggers. ● N-gram features ○ If lot of data available then unigrams itself gives pretty good result. ● Stemming / Lemmatization ● TF-IDF based feature selection
  • 13. Model Pipeline Documents Tokenize D x V Pre Processing BOW (D x V) LDA D x T & T x V
  • 15. Tech stack ● Developed on spark mllib ( Or you can use gensim if dataset is smaller ) ● Have to handle millions of documents ● We use cluster size of 300GB RAM and 50Core CPU. ● S3 to persist the data ● Pre and post processing pipelines
  • 16. Hyper parameters ● Doc-Concentration prior ( Alpha ) ● Topic Concentration prior ( Beta ) ● Number of topics ( K ) ● Iterations ● Vocab Size or Feature size ( N ) - in BOW format. ● Max-df tuning ● Custom stopwords to further prune noisy vocabs.
  • 18. Challenges on model evaluation ● LDA is an Unsupervised model, how do we cross check the convergence ? ● Test set validation ? ● What measure we use for grid search ? ● How we compare two LDA runs ? ● We want to avoid human bias involved when comparing the topics
  • 19. Model Evaluation Methods ● Perplexity - Ensure log likelihood function is maximum point, which will bring perplexity to lower side. ● Plot the sum of probabilities of top 10 vocabs from Topic x Vocab matrix. ● Topic Coherence valuation ● Topic Dependency score ● Manual evaluation framework.
  • 20. Perplexity ● A measure to know probabilistic models’ likelihood function reached at maximum point. ● Applied on held-out dataset or test dataset. ● This measure has been used to tune a particular parameter keeping others constant - similar to Elbow point identification on Kmean. ● Perplexity doesn’t measure the contextual information between words, it’s rather per word level. ● So it’s not directly usable as final model evaluation metric. We can use it to tune the hyper parameters of the model.
  • 21. Probability sum of top 10 vocabs from T x V matrix
  • 22. Wordcloud based on the word weightage for a topic
  • 23. Coherence Scores ● Best method which matches close to the manual verification. ● Gives importance to the co-occurrence of the words really there on the document or not. ● We can control the context window, full document based, paragraph or Sentence wise. ● Custom sliding window also we can apply. ● Gensim library provides off-the self implementation for standard coherence scores.
  • 24. Different Coherence methods ● Umass - Boolean document estimation ● UCI - Sliding window based document estimation
  • 25. Different Coherence methods ● NPMI - Sliding window based co-occurrence counting. ● Etc.. ● Java Implementation - https://github.com/dice-group/Palmetto ● Reference:- https://labs.imaginea.com/post/how-to-measure-topic-coherence/
  • 26. Coherence scores are used to compare models - Umass ● LDA Run 1 - -5.403614 ● LDA Run 2 - -2.780710 ● LDA Run 3 - -3.300038 ● Higher the score better, these scores better
  • 27. Topic dependency - Jaccard Distance ● Find how close or distant the topics are ● Helpful to know whether your topics are very dependent or specific in nature ● It’s very easy to calculate, using the top N words from each topic-vocab distribution. ● Overlap median score can be used as optimisation parameter for grid-search.
  • 28. Grid search for best parameters ● Make use of the LDADE. ● Differential evolution methods to optimise any black box function ● Best fit if you are training on a small data-size, as you need to do hundreds of model training to find good param set. Or you need big cluster to reduce the training time. ● LDADE reduce the overall search space, but still it’s not very low in number ● Rule of thumb you can apply is, if you model trains with in few mins it’s ideal. ● Topic variance between two runs are considered as a loss function. ● Reference: https://labs.imaginea.com/reference/lda-tuning/
  • 29. Summary ● LDA has been used to find latent topics from documents ● LDA converges well enough and accumulates good words for each topic to describe it well. ● Can be usable as feature extraction from a document ● Model evaluation is a difficult part, Use coherence scores along with other measures.
  • 30. QA
  • 31. Thank you Haridas N <hn@haridas.in>