Report

Share

Follow

•1 like•267 views

•1 like•267 views

Report

Share

Download to read offline

This talk covers how one can find the latent topics from a bunch of documents without any labels (unsupervised learning). Also covered are Latent Dirichlet Allocation (LDA), a type of document clustering model. LDA can be used for multiple NLP pipelines, eg; Document clustering, topic evaluation, feature extraction, Document similarity study, text summarisation etc. Evaluating the quality of result from such unsupervised models are a challenge, we will discuss few such effective evaluation methods.

Follow

- 1. Document clustering using LDA Haridas N <haridas.n@imaginea.com> @haridas_n
- 2. Agenda ● Introduction to LDA ● Other Clustering Methods ● Model pipeline and Training ● Evaluate LDA model results ○ How to measure the quality of results ○ Evaluate the coherence of the topics ○ Cross check the patents in the cluster are similar
- 3. LDA: Find natural categories of millions of documents, and suggest a name for each category.
- 5. LDA - Latent Dirichlet Allocation ● Generative probabilistic model, which generates documents from topics and topics from vocabs. ● An Unsupervised Model ● Other clustering algorithms are LSI, PLSI and K-Mean
- 7. LSI ● Dimensionality reduction method using Truncated SVD. ● Document D = N x V ● SVD applied on D = N x T and T x V ● It lacks the interpretability of the topics. ● And representation quality isn’t that good.
- 8. PLSI ● Extension to the LSI by making it probabilistic model
- 9. LDA Model ● Plate notation of LDA Probabilistic graphical model. ● Uses Bayesian inference to ﬁnd best likelihood estimation. ● Uses Dirichlet priors for Topic and Vocabs, hence the name LDA ● Alpha and Beta are Dirichlet priors ● K topics ● N vocabs ● M documents
- 10. K-mean clustering ● Kmean applied on top of the Document x Topic dataset. ● After the patents are rearranged based on spatial location, we can assign the topic number based on existing patents in it. ● LDA is acting as a Dimensionality reduction of sparse Document x Vocab dataset into Document x Topic matrix which is dense. ● Kmean does good job on dense vectors.
- 12. Feature Engineering ● Tokenization and text cleanups ● Apply standard and custom stopword ﬁltering ● Noun-chunk extraction using spacy or nltk based taggers. ● N-gram features ○ If lot of data available then unigrams itself gives pretty good result. ● Stemming / Lemmatization ● TF-IDF based feature selection
- 13. Model Pipeline Documents Tokenize D x V Pre Processing BOW (D x V) LDA D x T & T x V
- 14. Training
- 15. Tech stack ● Developed on spark mllib ( Or you can use gensim if dataset is smaller ) ● Have to handle millions of documents ● We use cluster size of 300GB RAM and 50Core CPU. ● S3 to persist the data ● Pre and post processing pipelines
- 16. Hyper parameters ● Doc-Concentration prior ( Alpha ) ● Topic Concentration prior ( Beta ) ● Number of topics ( K ) ● Iterations ● Vocab Size or Feature size ( N ) - in BOW format. ● Max-df tuning ● Custom stopwords to further prune noisy vocabs.
- 17. Model Evaluation
- 18. Challenges on model evaluation ● LDA is an Unsupervised model, how do we cross check the convergence ? ● Test set validation ? ● What measure we use for grid search ? ● How we compare two LDA runs ? ● We want to avoid human bias involved when comparing the topics
- 19. Model Evaluation Methods ● Perplexity - Ensure log likelihood function is maximum point, which will bring perplexity to lower side. ● Plot the sum of probabilities of top 10 vocabs from Topic x Vocab matrix. ● Topic Coherence valuation ● Topic Dependency score ● Manual evaluation framework.
- 20. Perplexity ● A measure to know probabilistic models’ likelihood function reached at maximum point. ● Applied on held-out dataset or test dataset. ● This measure has been used to tune a particular parameter keeping others constant - similar to Elbow point identiﬁcation on Kmean. ● Perplexity doesn’t measure the contextual information between words, it’s rather per word level. ● So it’s not directly usable as ﬁnal model evaluation metric. We can use it to tune the hyper parameters of the model.
- 21. Probability sum of top 10 vocabs from T x V matrix
- 22. Wordcloud based on the word weightage for a topic
- 23. Coherence Scores ● Best method which matches close to the manual veriﬁcation. ● Gives importance to the co-occurrence of the words really there on the document or not. ● We can control the context window, full document based, paragraph or Sentence wise. ● Custom sliding window also we can apply. ● Gensim library provides oﬀ-the self implementation for standard coherence scores.
- 24. Different Coherence methods ● Umass - Boolean document estimation ● UCI - Sliding window based document estimation
- 25. Different Coherence methods ● NPMI - Sliding window based co-occurrence counting. ● Etc.. ● Java Implementation - https://github.com/dice-group/Palmetto ● Reference:- https://labs.imaginea.com/post/how-to-measure-topic-coherence/
- 26. Coherence scores are used to compare models - Umass ● LDA Run 1 - -5.403614 ● LDA Run 2 - -2.780710 ● LDA Run 3 - -3.300038 ● Higher the score better, these scores better
- 27. Topic dependency - Jaccard Distance ● Find how close or distant the topics are ● Helpful to know whether your topics are very dependent or speciﬁc in nature ● It’s very easy to calculate, using the top N words from each topic-vocab distribution. ● Overlap median score can be used as optimisation parameter for grid-search.
- 28. Grid search for best parameters ● Make use of the LDADE. ● Diﬀerential evolution methods to optimise any black box function ● Best ﬁt if you are training on a small data-size, as you need to do hundreds of model training to ﬁnd good param set. Or you need big cluster to reduce the training time. ● LDADE reduce the overall search space, but still it’s not very low in number ● Rule of thumb you can apply is, if you model trains with in few mins it’s ideal. ● Topic variance between two runs are considered as a loss function. ● Reference: https://labs.imaginea.com/reference/lda-tuning/
- 29. Summary ● LDA has been used to ﬁnd latent topics from documents ● LDA converges well enough and accumulates good words for each topic to describe it well. ● Can be usable as feature extraction from a document ● Model evaluation is a diﬃcult part, Use coherence scores along with other measures.
- 30. QA
- 31. Thank you Haridas N <hn@haridas.in>