Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

1,471 views

Published on

No Downloads

Total views

1,471

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

71

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Introduction to Information Retrieval:Language models for information retrievalby C.D. Manning, P. Raghavan, and H. Schutze.<br />Presentation by Dustin Smith<br />The University of Texas at Austin<br />School of Information<br />dustin.smith@utexas.edu<br />10/3/2011<br />1<br />INF384H / CS395T: Concepts of Information Retrieval<br />
- 2. Christopher Manning – background<br />BA Australian National University 1989 (majors in mathematics, computer science and linguistics)<br />PhD Stanford Linguistics 1995<br />Asst Professor Carnegie Mellon University Computational Linguistics Program 1994-96<br />Lecturer University of Sydney Dept of Linguistics 1996-99<br />Asst Professor Stanford University Depts of Computer Science and Linguistics 1999-2006<br />Current: Assoc Professor Stanford University Depts of Linguistics and Computer Science<br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />2<br />
- 3. Prabhakar Raghavan– background<br />Undergraduate degree in electrical engineering from ITT, Madras<br />PhD in computer science from UC Berkeley<br />Current: Working at Yahoo! Labs and is a Consulting Professor of Computer Science at Stanford University<br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />3<br />
- 4. Hinrich Schütze– background<br />Technical University of Braunschweig<br />Vordiplom Mathematik<br />Vordiplom Informatik<br />University of Stuttgart, Diplom Informatik (MSCS)<br />Stanford University, Ph.D., Computational Linguistics<br />Current: Chair of Theoretical Computational Linguistics, Institute for Natural Language Processing at the University of Stuttgart<br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />4<br />
- 5. Chapter/Presentation Outline<br />Introduction to the concept of Language Models<br />Finite automata and language models<br />Types of language models<br />Multinomial distributions over words<br />Description of the Query Likelihood Model<br />Using query likelihood language models in IR<br />Estimating the query generation probability<br />Ponte and Croft’s experiments<br />Comparison of the language modeling approach to IR against other approaches to IR<br />Description of various extension to the language modeling approach<br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />5<br />
- 6. Language Models<br />Based on concept that a document is a good match for a query if the document model is likely to generate the query.<br />An alternative to the straightforward query-document probability model. (traditional approach)<br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />6<br />
- 7. Finite automata and language models (238)<br /><ul><li>In figure 12.1 the alphabet is {“I”, “wish”} and the language produced by the model is {“I wish”, “I wish I wish”, “I wish I wish I wish I wish”, etc.}
- 8. The process is analogous for a document model
- 9. Figure 12.2 represents a single node with a single distribution over terms s.t.𝑡∈𝑉𝑃(𝑡)=1. </li></ul> <br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />7<br />Language Models<br />
- 10. Calculating phrase probability with stop/continue probability included (238)<br /><ul><li>The probability calculations are very small.
- 11. This calculation is shown with stop probabilities, but in practice these are left out. </li></ul>10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />8<br />Language Models<br />
- 12. Comparison of document models (239-240)<br /><ul><li>In theory these models represent different documents, different alphabets, and different languages.
- 13. Given a query s = “frog said that toad likes that dog”,our two model probabilities are calculated by simply multiplying term distributions.
- 14. It’s evident why P(s|𝑀1) scores higher than P(s|𝑀2). More query terms were present in P(s|𝑀1) and so the probability is greater.</li></ul> <br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />9<br />Language Models<br />
- 15. Types of language models(240)<br /><ul><li>Unigram Language Model
- 16. Bigram Language Model
- 17. Section Conclusion
- 18. Which𝑀𝑑 to use?</li></ul> <br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />10<br />Chain rule<br />Language Models<br />
- 19. Using query likelihood language models in IR (242-243)<br />Using Bayes rule:<br />P(d|q)=P(q|d)P(d)/P(q)<br />With P(d) and P(q) uniform across documents,<br />=> P(d|q) = P(q|d)<br />In the query likelihood model we construct a language model 𝑀𝑑 from each document<br />Goal: to rank documents by P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query<br /> <br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />11<br />The Query Likelihood Model<br />
- 20. Using query likelihood language models in IR (242-243)<br />Multinomial unigram language model<br />Pq𝑀𝑑=𝐾𝑞𝑡∈𝑉𝑃(𝑡|𝑀𝑑)𝑡𝑓𝑡,𝑑<br />𝐾𝑞 is dropped as it is constant across all queries<br /> <br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />12<br />Query generation process:1. Infer a LM for each document <br />2. Estimate Pq𝑀𝑑, the probability of generating the query according to each one of these document models<br />3. Rank the documents according to these probabilities<br /> <br />The Query Likelihood Model<br />
- 21. Estimating the query generation probability (244)<br /><ul><li>Query generation probability = Pq𝑀𝑑</li></ul> <br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />13<br /><ul><li>M𝑑 is the language model of document d
- 22. tf𝑡.𝑑is the raw term frequency of term t in document d
- 23. L𝑑is the number of tokens in document d</li></ul> <br />The Query Likelihood Model<br />
- 24. Smoothing Methods (245-246)<br /><ul><li>Linear Interpolation
- 25. Bayesian Smoothing
- 26. Note: MLE =
- 27. maximum likelihood estimate</li></ul>Conceptually the same:<br />The probability estimate for a word present in the document combines a discounted (MLE) and a fraction of the estimate of its prevalence in the whole collection.<br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />14<br />
- 28. Ponte and Croft’s Experiments (246)<br /><ul><li>1998 experiments
- 29. First experiments on the language modeling approach to IR
- 30. Performed on TREC topics 202-250 over TREC disks 2 and 3.</li></ul>LM much better than tf-idf (specifically at higher recalls) <br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />15<br />The Query Likelihood Model<br />
- 31. LM vs. BIM vs. XML retrieval (249)<br />Language models and the most successful XML retrieval models approach relevance modeling in a roundabout way as apposed to the BIM model that evaluates relevance directly.<br />LM initially appears to not include relevance modeling <br />The most successful XML retrieval models assume that queries and documents are objects of the same type<br />BIM models have relevance as the central variable that is evaluated<br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />16<br />Language Modeling Versus Other Approaches in IR<br />
- 32. LM vs. traditional tf-idf(249)<br />The LM has significant relations to tf-idf models <br />They differ on a more conceptual level<br />Both directly use term frequency<br />Both have a method of mixing document frequency and collection frequency to produce probabilities<br />Both treat terms independently<br />LM intuitions are more probabilistic than geometric<br />LM mathematical models are more principled rather than heuristic<br />LM differs in its use of tf and df<br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />17<br />
- 33. Document Likelihood Model (250)<br />Downsides:<br /><ul><li>Takes much more smoothing
- 34. Results in worse estimates</li></ul>Features:<br /><ul><li>Uses a query to generate a document with a query language model (𝑀𝑞)
- 35. Easier to incorporate relevance feedback by expanding the query with terms from relevant documents</li></ul> <br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />18<br />Extended Language Modeling Approaches<br />
- 36. Three language model approaches (250)<br /><ul><li>Query likelihood
- 37. Using a document model to produce a relevant query
- 38. Document likelihood
- 39. Using a query model to produce a relevant document
- 40. Model comparison
- 41. Comparing these models </li></ul>10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />19<br />Extended Language Modeling Approaches<br />
- 42. Kullback-Leibler (KL) divergence (251)<br /><ul><li>KL divergence is an asymmetric divergence measure originating in information theory, which measures how bad the probability distribution 𝑀𝑞 is at modeling 𝑀𝑑 (pg. 251)
- 43. Outperforms query and document likelihood models
- 44. But, scores are not comparable across queries</li></ul> <br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />20<br />Extended Language Modeling Approaches<br />
- 45. Translation Model – Features (251)<br />Answer to synonymy in basic LM models<br />Lets you generate query words that are not in a document by translating to alternate terms with similar meaning<br />Provides a basis for executing cross-language IR<br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />21<br />Extended Language Modeling Approaches<br />
- 46. Translation Model – Issues (251)<br />Computationally intensive <br />Need to build the model using outside resources<br />Thesaurus <br />Bilingual dictionary<br />Statistical machine translation system’s translation dictionary<br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />22<br />Extended Language Modeling Approaches<br />
- 47. Thanks for not throwing vegetables!<br />Questions?<br />10/3/2011<br />INF384H / CS395T: Concepts of Information Retrieval<br />23<br />

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment