Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Global Healthcare Report Q2 2019 by CB Insights 1450819 views
- Be A Great Product Leader (Amplify,... by Adam Nash 390758 views
- Trillion Dollar Coach Book (Bill Ca... by Eric Schmidt 440485 views
- APIdays Paris 2019 - Innovation @ s... by apidays 451815 views
- A few thoughts on work life-balance by Wim Vanderbauwhede 286718 views
- Is vc still a thing final by Mark Suster 328073 views

115 views

Published on

Published in:
Technology

No Downloads

Total views

115

On SlideShare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

1

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Document clustering using LDA Haridas N <haridas.n@imaginea.com> @haridas_n
- 2. Agenda ● Introduction to LDA ● Other Clustering Methods ● Model pipeline and Training ● Evaluate LDA model results ○ How to measure the quality of results ○ Evaluate the coherence of the topics ○ Cross check the patents in the cluster are similar
- 3. LDA: Find natural categories of millions of documents, and suggest a name for each category.
- 4. LDA - Latent Dirichlet Allocation ● Generative probabilistic model, which generates documents from topics and topics from vocabs. ● An Unsupervised Model ● Other clustering algorithms are LSI, PLSI and K-Mean
- 5. Clustering Models
- 6. LSI ● Dimensionality reduction method using Truncated SVD. ● Document D = N x V ● SVD applied on D = N x T and T x V ● It lacks the interpretability of the topics. ● And representation quality isn’t that good.
- 7. PLSI ● Extension to the LSI by making it probabilistic model
- 8. LDA Model ● Plate notation of LDA Probabilistic graphical model. ● Uses Bayesian inference to ﬁnd best likelihood estimation. ● Uses Dirichlet priors for Topic and Vocabs, hence the name LDA ● Alpha and Beta are Dirichlet priors ● K topics ● N vocabs ● M documents
- 9. K-mean clustering ● Kmean applied on top of the Document x Topic dataset. ● After the patents are rearranged based on spatial location, we can assign the topic number based on existing patents in it. ● LDA is acting as a Dimensionality reduction of sparse Document x Vocab dataset into Document x Topic matrix which is dense. ● Kmean does good job on dense vectors.
- 10. Feature Extraction
- 11. Feature Engineering ● Tokenization and text cleanups ● Apply standard and custom stopword ﬁltering ● Noun-chunk extraction using spacy or nltk based taggers. ● N-gram features ○ If lot of data available then unigrams itself gives pretty good result. ● Stemming / Lemmatization ● TF-IDF based feature selection
- 12. Model Pipeline Documents Tokenize D x V Pre Processing BOW (D x V) LDA D x T & T x V
- 13. Training
- 14. Tech stack ● Developed on spark mllib ( Or you can use gensim if dataset is smaller ) ● Have to handle millions of documents ● We use cluster size of 300GB RAM and 50Core CPU. ● S3 to persist the data ● Pre and post processing pipelines
- 15. Hyper parameters ● Doc-Concentration prior ( Alpha ) ● Topic Concentration prior ( Beta ) ● Number of topics ( K ) ● Iterations ● Vocab Size or Feature size ( N ) - in BOW format. ● Max-df tuning ● Custom stopwords to further prune noisy vocabs.
- 16. Model Evaluation
- 17. Challenges on model evaluation ● LDA is an Unsupervised model, how do we cross check the convergence ? ● Test set validation ? ● What measure we use for grid search ? ● How we compare two LDA runs ? ● We want to avoid human bias involved when comparing the topics
- 18. Model Evaluation Methods ● Perplexity - Ensure log likelihood function is maximum point, which will bring perplexity to lower side. ● Plot the sum of probabilities of top 10 vocabs from Topic x Vocab matrix. ● Topic Coherence valuation ● Topic Dependency score ● Manual evaluation framework.
- 19. Perplexity ● A measure to know probabilistic models’ likelihood function reached at maximum point. ● Applied on held-out dataset or test dataset. ● This measure has been used to tune a particular parameter keeping others constant - similar to Elbow point identiﬁcation on Kmean. ● Perplexity doesn’t measure the contextual information between words, it’s rather per word level. ● So it’s not directly usable as ﬁnal model evaluation metric. We can use it to tune the hyper parameters of the model.
- 20. Probability sum of top 10 vocabs from T x V matrix
- 21. Wordcloud based on the word weightage for a topic
- 22. Coherence Scores ● Best method which matches close to the manual veriﬁcation. ● Gives importance to the co-occurrence of the words really there on the document or not. ● We can control the context window, full document based, paragraph or Sentence wise. ● Custom sliding window also we can apply. ● Gensim library provides oﬀ-the self implementation for standard coherence scores.
- 23. Different Coherence methods ● Umass - Boolean document estimation ● UCI - Sliding window based document estimation
- 24. Different Coherence methods ● NPMI - Sliding window based co-occurrence counting. ● Etc.. ● Java Implementation - https://github.com/dice-group/Palmetto ● Reference:- https://labs.imaginea.com/post/how-to-measure-topic-coherence/
- 25. Coherence scores are used to compare models - Umass ● LDA Run 1 - -5.403614 ● LDA Run 2 - -2.780710 ● LDA Run 3 - -3.300038 ● Higher the score better, these scores better
- 26. Topic dependency - Jaccard Distance ● Find how close or distant the topics are ● Helpful to know whether your topics are very dependent or speciﬁc in nature ● It’s very easy to calculate, using the top N words from each topic-vocab distribution. ● Overlap median score can be used as optimisation parameter for grid-search.
- 27. Grid search for best parameters ● Make use of the LDADE. ● Diﬀerential evolution methods to optimise any black box function ● Best ﬁt if you are training on a small data-size, as you need to do hundreds of model training to ﬁnd good param set. Or you need big cluster to reduce the training time. ● LDADE reduce the overall search space, but still it’s not very low in number ● Rule of thumb you can apply is, if you model trains with in few mins it’s ideal. ● Topic variance between two runs are considered as a loss function. ● Reference: https://labs.imaginea.com/reference/lda-tuning/
- 28. Summary ● LDA has been used to ﬁnd latent topics from documents ● LDA converges well enough and accumulates good words for each topic to describe it well. ● Can be usable as feature extraction from a document ● Model evaluation is a diﬃcult part, Use coherence scores along with other measures.
- 29. QA
- 30. Thank you Haridas N <hn@haridas.in>

No public clipboards found for this slide

Be the first to comment