Information retrieval as statistical translation


Published on

Presentation on Research paper Information retrieval as statistical translation ADAM BERGER & JOHN LAFFERTY 1999

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Information retrieval as statistical translation

  1. 1. Information Retrieval as Statistical Translation ADAM BERGER & JOHN LAFFERTY 1999 Bhavesh Singh 2010cs50281
  3. 3. INTRODUCTION • Information Retrieval (IR): Obtaining information resources relevant to an information need from a collection of information resources (documents). • Predicting relevance is the central goal of IR. • A new probabilistic approach to IR based upon the ideas and methods of statistical machine translation. • Model: Medium between data and understanding. • Ultimately, document retrieval systems must be sophisticated enough to handle polysemy and synonymy.
  4. 4. INTRODUCTION (…cont.) SOME BASIC TERMINOLOGIES PRECISION is the fraction of the documents retrieved that are relevant to the user's information need. RECALL is the fraction of the documents that are relevant to the query that are successfully retrieved. There is a inverse relation between precision and recall.
  5. 5. MODEL OF QUERY GENERATION • The user ‘U’ has an information need ‘I’ . • From this need, he generates an ideal document ‘d’. • Ideal Document: a perfect fit for the user, but almost certainly not present in the retrieval system’s collection of documents. • He selects a set of key terms from ‘d’, and generates a query ‘q’ from this set. In this setting, the task of a retrieval system is to find those documents most similar to ‘d’.
  6. 6. The Retrieval System’s task To find the most likely documents given the query; that is, those ‘d’ for which p(d | q, U) is highest. By Bayes’ law – Denominator p(q | U) is fixed for a given query and user, we can ignore it for the purpose of ranking documents, and define the relevance of a document to a query as –
  7. 7. 2-POISSON MODEL (PREVIOUS WORK) The 2-Poisson model is a mixture, that is a linear combination, of two Poisson distributions: Where Et – the Elite set of term t which occur more densely and non randomly in a few documents. In the context of IR, the 2-Poisson is used to model the probability distribution of the frequency X of a term in a collection of documents. The effectiveness of the Two-Poisson model for document retrieval was never tested, for two reasons. The first issue is that the learning of the three parameters using the Expectation Maximization (EM) algorithm for each term is expensive, and in general large collections contain millions of terms. The second problem is that the model does not take into account the document size, therefore the model should be extended to normalize different document lengths.
  8. 8. STATISTICAL MACHINE TRANSLATION Automatic translation by computer was first contemplated by Warren Weaver when modern computers were in their infancy. The central problem of statistical MT is to build a system that automatically learns how to translate text, by inspecting a large set of sentences in one language along with their translations into another language. Let translational probability for each English word ‘e’ translating to each French word ‘f’ is given by : t( f | e).
  9. 9. STATISTICAL MT (..cont.) The probability that an English sentence e = {e1, e2,…} translates to a French sentence f = {f1,f2,…} is calculated as where Gamma is a normalizing factor. The hidden variable in this model is the alignment a between the French and English words: aj = k means that the kth English word translates to the jth French word.
  10. 10. MODEL OF DOCUMENT-QUERY TRANSLATION First, a word ‘w’ is chosen at random from the document d according to distribution l( w | d) that we call the document language model. Next translate ‘w’ into the word or phrase ‘q’ according to a translational model, with parameters t( q | w). Thus, the probability of choosing q as a representative of the document d is – Now assuming the sample size model ᶲ( n | d) as the Poisson distribution with mean lamda(d) as -
  11. 11. MODEL OF DOCUMENT-QUERY TRANSLATION (…cont.) Under that assumption of treating the number of samples ‘n’ as Poisson distribution, the probability that a particular query q = q1,q2,…qm is generated will be given by – This was the Model 1 of document-query translation. It was inspired by IBM statistical translation model. To fit translational probabilities in Model 1, Expectation Maximization (EM) algorithm is used.
  12. 12. MODEL 0 – THE SIMPLEST CASE: WORDFOR-WORD TRANSLATION The simplest version of the model 1 which we will distinguish as Model 0 is one where each word ‘w’ can be translated only as itself; that is, the translation probabilities are “diagonal”: Under this model, the query terms are chosen simply according to their frequency of occurrence in the document. The probability for query in this case is simply -
  13. 13. EXPERIMENTAL RESULTS Precision-Recall plots. The left plot compares Model 1 to Model 0 on the SDR data. The right plot compares the same language model scored according to Model 0, demonstrating that the approximations are very good.
  14. 14. CRITIQUE The 2-Poisson Model was never tested due to one of the reason that the learning of three parameters for each term is expensive because the Expectation Maximization algorithm converges in several iterations. According to this paper, to fit the translation probabilities of Model 1, EM algorithm is used. So this is also an expensive operation. The efficiency of EM in Model 1 is not discussed well. It should be more elaborated.
  15. 15. REFERENCES [1] “Information Retrieval as Statistical Translation” by Adam Berger and John Lafferty, 1999. [2] “Two Poisson model” by Giambattista Amati, Fondazione ugo Bordoni. [3] Information Retrieval as Statistical Translation by Robert Barbey. [4] Wikipedia article on “Information Retrieval”.
  16. 16. THANK YOU