Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Document Clustering using LDA | Haridas Narayanaswamy [Pramati]


Published on

This talk covers how one can find the latent topics from a bunch of documents without any labels (unsupervised learning). Also covered are Latent Dirichlet Allocation (LDA), a type of document clustering model. LDA can be used for multiple NLP pipelines, eg; Document clustering, topic evaluation, feature extraction, Document similarity study, text summarisation etc. Evaluating the quality of result from such unsupervised models are a challenge, we will discuss few such effective evaluation methods.

Published in: Technology
  • Be the first to comment

Document Clustering using LDA | Haridas Narayanaswamy [Pramati]

  1. 1. Document clustering using LDA Haridas N <> @haridas_n
  2. 2. Agenda ● Introduction to LDA ● Other Clustering Methods ● Model pipeline and Training ● Evaluate LDA model results ○ How to measure the quality of results ○ Evaluate the coherence of the topics ○ Cross check the patents in the cluster are similar
  3. 3. LDA: Find natural categories of millions of documents, and suggest a name for each category.
  4. 4. LDA - Latent Dirichlet Allocation ● Generative probabilistic model, which generates documents from topics and topics from vocabs. ● An Unsupervised Model ● Other clustering algorithms are LSI, PLSI and K-Mean
  5. 5. Clustering Models
  6. 6. LSI ● Dimensionality reduction method using Truncated SVD. ● Document D = N x V ● SVD applied on D = N x T and T x V ● It lacks the interpretability of the topics. ● And representation quality isn’t that good.
  7. 7. PLSI ● Extension to the LSI by making it probabilistic model
  8. 8. LDA Model ● Plate notation of LDA Probabilistic graphical model. ● Uses Bayesian inference to find best likelihood estimation. ● Uses Dirichlet priors for Topic and Vocabs, hence the name LDA ● Alpha and Beta are Dirichlet priors ● K topics ● N vocabs ● M documents
  9. 9. K-mean clustering ● Kmean applied on top of the Document x Topic dataset. ● After the patents are rearranged based on spatial location, we can assign the topic number based on existing patents in it. ● LDA is acting as a Dimensionality reduction of sparse Document x Vocab dataset into Document x Topic matrix which is dense. ● Kmean does good job on dense vectors.
  10. 10. Feature Extraction
  11. 11. Feature Engineering ● Tokenization and text cleanups ● Apply standard and custom stopword filtering ● Noun-chunk extraction using spacy or nltk based taggers. ● N-gram features ○ If lot of data available then unigrams itself gives pretty good result. ● Stemming / Lemmatization ● TF-IDF based feature selection
  12. 12. Model Pipeline Documents Tokenize D x V Pre Processing BOW (D x V) LDA D x T & T x V
  13. 13. Training
  14. 14. Tech stack ● Developed on spark mllib ( Or you can use gensim if dataset is smaller ) ● Have to handle millions of documents ● We use cluster size of 300GB RAM and 50Core CPU. ● S3 to persist the data ● Pre and post processing pipelines
  15. 15. Hyper parameters ● Doc-Concentration prior ( Alpha ) ● Topic Concentration prior ( Beta ) ● Number of topics ( K ) ● Iterations ● Vocab Size or Feature size ( N ) - in BOW format. ● Max-df tuning ● Custom stopwords to further prune noisy vocabs.
  16. 16. Model Evaluation
  17. 17. Challenges on model evaluation ● LDA is an Unsupervised model, how do we cross check the convergence ? ● Test set validation ? ● What measure we use for grid search ? ● How we compare two LDA runs ? ● We want to avoid human bias involved when comparing the topics
  18. 18. Model Evaluation Methods ● Perplexity - Ensure log likelihood function is maximum point, which will bring perplexity to lower side. ● Plot the sum of probabilities of top 10 vocabs from Topic x Vocab matrix. ● Topic Coherence valuation ● Topic Dependency score ● Manual evaluation framework.
  19. 19. Perplexity ● A measure to know probabilistic models’ likelihood function reached at maximum point. ● Applied on held-out dataset or test dataset. ● This measure has been used to tune a particular parameter keeping others constant - similar to Elbow point identification on Kmean. ● Perplexity doesn’t measure the contextual information between words, it’s rather per word level. ● So it’s not directly usable as final model evaluation metric. We can use it to tune the hyper parameters of the model.
  20. 20. Probability sum of top 10 vocabs from T x V matrix
  21. 21. Wordcloud based on the word weightage for a topic
  22. 22. Coherence Scores ● Best method which matches close to the manual verification. ● Gives importance to the co-occurrence of the words really there on the document or not. ● We can control the context window, full document based, paragraph or Sentence wise. ● Custom sliding window also we can apply. ● Gensim library provides off-the self implementation for standard coherence scores.
  23. 23. Different Coherence methods ● Umass - Boolean document estimation ● UCI - Sliding window based document estimation
  24. 24. Different Coherence methods ● NPMI - Sliding window based co-occurrence counting. ● Etc.. ● Java Implementation - ● Reference:-
  25. 25. Coherence scores are used to compare models - Umass ● LDA Run 1 - -5.403614 ● LDA Run 2 - -2.780710 ● LDA Run 3 - -3.300038 ● Higher the score better, these scores better
  26. 26. Topic dependency - Jaccard Distance ● Find how close or distant the topics are ● Helpful to know whether your topics are very dependent or specific in nature ● It’s very easy to calculate, using the top N words from each topic-vocab distribution. ● Overlap median score can be used as optimisation parameter for grid-search.
  27. 27. Grid search for best parameters ● Make use of the LDADE. ● Differential evolution methods to optimise any black box function ● Best fit if you are training on a small data-size, as you need to do hundreds of model training to find good param set. Or you need big cluster to reduce the training time. ● LDADE reduce the overall search space, but still it’s not very low in number ● Rule of thumb you can apply is, if you model trains with in few mins it’s ideal. ● Topic variance between two runs are considered as a loss function. ● Reference:
  28. 28. Summary ● LDA has been used to find latent topics from documents ● LDA converges well enough and accumulates good words for each topic to describe it well. ● Can be usable as feature extraction from a document ● Model evaluation is a difficult part, Use coherence scores along with other measures.
  29. 29. QA
  30. 30. Thank you Haridas N <>