Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks

2,495 views

Published on

Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.

Published in: Data & Analytics
  • Hi there! Get Your Professional Job-Winning Resume Here - Check our website! http://bit.ly/resumpro
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks

  1. 1. Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks Università degli Studi di Milano - Bicocca Di Donato Leonardo Text Mining Course - Prof. Fabio Stella
  2. 2. Introduction Di Donato Leonardo, Università degli Studi di Milano - Bicocca super abundant amount of digital unstructured information it continues to grow at an astonishing rate (it doubles every two years) man can not manage it: information overload. problems: crawling, representing, storing, summarizing, clustering, searching ... (general rule: every problem is an opportunity) opportunity: automatically extract value from chaos what value? how to do it?
  3. 3. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Goals the value that we want to extract is: clusters of semantically related documents our purpose is [1] the unsupervised clustering of a text dataset [2] the implementation of information retrieval procedures that exploit the representation of documents at the topic level [3] the modeling of the ability to computationally identify the meaning of words in context (word sense disambiguation) our documents collection: a partition of the Associated Press dataset ~ 2300 english textual news (dating back to the '90s) characteristic of any text document: it is often messy, has flaws and noise we need to clean the data we need a structured representation of the data Dataset
  4. 4. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Pre-Processing google refine [ link ] [1] replacement of abbreviations and common entities with expressions that normalize them (e.g., {dlrs, dlr, $, ...} → {dollar}, {mln, mlns, ...} → {million}) [2] adjustment of flaws and [3] stripping metadata entities through regular expressions mallet [ link ] [1] make all the characters lowercase [2] tokenization [3] stop-word removal [4] vocabulary proportional cut-off, with threshold 0.03 [5] term-frequency representation of each document corpus is a unique file, every line is a document with this format: results: |W| = 32349 token types, 241908 words
  5. 5. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Topic Models probabilistic generative models for uncovering the underlying semantic structure of a document collection based on a Bayesian analysis of the original texts [ Blei, 2003 ] goal: discover patterns of word-use and connect documents that exhibit similar patterns idea: documents are mixtures of topics (assignments) and each topic is a multinomial probability distribution over words which are the topics have generated the given corpus of documents with the maximum likelihood ? we have to infer 3 latent variables: [1] the word distribution over topics [2] the topics distribution over documents [3] the word-topic assignments [1] Φ(j) = P(W|Z = j) [2] Θ(d) = P (Z|D = d) [3] P(Z|W)
  6. 6. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Topic Models Latent Dirichlet Allocation (LDA) model associates with [2] and [1] two smoothing hyper-parameters α and β. the number of times a topic j which has been selected for a document is indicated by αj (α1 , ..., αT are the parameters of a prior Dirichlet) β is the parameter of a prior Dirichlet which indicates the count of extracted words from a topic (before observing any corpus document) To estimate them we can use different methods (e.g.; Gibbs Sampling) we need to estimate the distributions Φ and Θ: it is possible compute them directly through the matrixes of counts
  7. 7. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Tuning which are the best value for hyper-parameters ? usually α = 50/T and β = 0.01 are those that give the best results [ Steyvers and Griffiths, 2007 ] which is the optimal number of topics T ? and the number of iterations I ? it depends on the specific problem, it's an open problem we have set T = 35 and T = 40 there are topics evaluation techniques that try to face this problem ... we have used one of those techniques (i.e., the topic coherence metric, which evaluates the semantic coherence of a topic) to compare two model configurations: symmetric α versus asymmetric α
  8. 8. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Symmetric α versus Asymmetric α an asymmetric configuration (AS) for the alpha hyper-parameters serves to calibrate with more flexibility the degree of topics sparseness has been empirically demonstrated that optimizing Dirichlet hyper- parameters (αi , ..., αT ) for topics-document distribution makes a huge difference: topics are not dominated by very common words and they are more stable as their number increase [ Wallach, 2009 ] it has not been verified by our experimentation: the topic's average coherence for AS configuration was worse than SS configuration why ? in our corpus there isn’t a topic that tends to occur in each document (or the optimal number of T may be greater, or simply the answer is more trivial ...)
  9. 9. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Top topics for symmetric α and T = 35
  10. 10. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Post-Processing - Information Retrieval why should we use topic models to improve information retrieval tasks ? [1] we can cluster queries according the extracted topics [2] two documents which share no common words can be measured as similar query likelihood model is a basic approach for information retrieval in this context (generative model) we can evaluate how well a document matches a query specifying how the words of the query may have been generated by a language model we derive a language model for each document (a mixture of topics) so, the relevant documents will have a topic distribution that is likely may generated the set of words contained in (or associated with) the query → documents similarity
  11. 11. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Documents Similarity two approaches to compute the similarity between documents [1] probabilistic query approach [2] comparison of topics distribution of documents how ? through divergence metrics (e.g., symmetrised Kullback-Leibler, Jenson-Shannon)
  12. 12. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Similar documents for query "forest fire" AP880727-0015 X Fire-spitting helicopters were dispatched to Yellowstone National Park on Tuesday to help protect the Old Faithful geyser area from a 6,000-acre blaze ...
  13. 13. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Post-Processing - Word Sense Disambiguation the ability to identify the meaning of words in context in a computational manner is usually referred as the Word Sense Disambiguation four elements: [1] selection of word senses (i.e., the classes) [2] use of external knowledge sources [3] representation of context [4] selection of an automatic classification method input: a user specified context document dc that contains the word wx to be disambiguated [1] → given s most similar words for wx , for each of this we build a sense document capturing synsets, glosses, example phrases, and other relevant relations from WordNet [2] → WordNet as external knowledge sources to create the sense documents ds [3] → the topical and the semantic features [4] → comparison of document dc with each of the s ds document (with one of the two approaches presented): the most similar will be the sense of word wx in context dc
  14. 14. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Words similarity two possible approaches to compute the similarity between words: [1] associative relation [2] comparison of (topics-words) P(Z|W) distribution
  15. 15. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Words similar to token "arab"
  16. 16. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Future Work topic modeling → ● train an LDA model with asymmetric α for increasing values of T and evaluate the resulting quality of topics ● train an LDA model with asymmetric α on a vocabulary on which has not been performed any proportional cut-off ● investigate a possible implementation of a multiple chain model to obtain topics more stable ● use other metric of topic evaluation information retrieval → ● assess and fine-tune the prior probability of a document in the query likelihood model ● use other high-frequency metrics (e.g., α-skew) in relation to the comparison of distributions word sense disambiguation → ● implement and evaluate other methods to compare context document and sense documents (e.g., compute P(dc , ds ) under the assumption that they are conditionally independent, given the topic variable) ● refine the mechanism of sense selection (e.g., choosing each of the s most probable words into probability interval in order to minimize the risk that all the most similar words refer to meanings really strictly correlated)
  17. 17. Thank you for your attention. Di Donato Leonardo, Università degli Studi di Milano - Bicocca

×