Language Models for Information Retrieval

1. Introduction to Information Retrieval:Language models for information retrievalby C.D. Manning, P. Raghavan, and H. Schutze. Presentation by Dustin Smith The University of Texas at Austin School of Information dustin.smith@utexas.edu 10/3/2011 1 INF384H / CS395T: Concepts of Information Retrieval

2. Christopher Manning – background BA Australian National University 1989 (majors in mathematics, computer science and linguistics) PhD Stanford Linguistics 1995 Asst Professor Carnegie Mellon University Computational Linguistics Program 1994-96 Lecturer University of Sydney Dept of Linguistics 1996-99 Asst Professor Stanford University Depts of Computer Science and Linguistics 1999-2006 Current: Assoc Professor Stanford University Depts of Linguistics and Computer Science 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 2

3. Prabhakar Raghavan– background Undergraduate degree in electrical engineering from ITT, Madras PhD in computer science from UC Berkeley Current: Working at Yahoo! Labs and is a Consulting Professor of Computer Science at Stanford University 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 3

4. Hinrich Schütze– background Technical University of Braunschweig Vordiplom Mathematik Vordiplom Informatik University of Stuttgart, Diplom Informatik (MSCS) Stanford University, Ph.D., Computational Linguistics Current: Chair of Theoretical Computational Linguistics, Institute for Natural Language Processing at the University of Stuttgart 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 4

5. Chapter/Presentation Outline Introduction to the concept of Language Models Finite automata and language models Types of language models Multinomial distributions over words Description of the Query Likelihood Model Using query likelihood language models in IR Estimating the query generation probability Ponte and Croft’s experiments Comparison of the language modeling approach to IR against other approaches to IR Description of various extension to the language modeling approach 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 5

6. Language Models Based on concept that a document is a good match for a query if the document model is likely to generate the query. An alternative to the straightforward query-document probability model. (traditional approach) 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 6

8. The process is analogous for a document model

9. Figure 12.2 represents a single node with a single distribution over terms s.t.𝑡∈𝑉𝑃(𝑡)=1. 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 7 Language Models

11. This calculation is shown with stop probabilities, but in practice these are left out. 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 8 Language Models

13. Given a query s = “frog said that toad likes that dog”,our two model probabilities are calculated by simply multiplying term distributions.

14. It’s evident why P(s|𝑀1) scores higher than P(s|𝑀2). More query terms were present in P(s|𝑀1) and so the probability is greater. 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 9 Language Models

16. Bigram Language Model

17. Section Conclusion

18. Which𝑀𝑑 to use? 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 10 Chain rule Language Models

19. Using query likelihood language models in IR (242-243) Using Bayes rule: P(d|q)=P(q|d)P(d)/P(q) With P(d) and P(q) uniform across documents, => P(d|q) = P(q|d) In the query likelihood model we construct a language model 𝑀𝑑 from each document Goal: to rank documents by P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 11 The Query Likelihood Model

20. Using query likelihood language models in IR (242-243) Multinomial unigram language model Pq𝑀𝑑=𝐾𝑞𝑡∈𝑉𝑃(𝑡|𝑀𝑑)𝑡𝑓𝑡,𝑑 𝐾𝑞 is dropped as it is constant across all queries 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 12 Query generation process:1. Infer a LM for each document 2. Estimate Pq𝑀𝑑, the probability of generating the query according to each one of these document models 3. Rank the documents according to these probabilities The Query Likelihood Model

22. tf𝑡.𝑑is the raw term frequency of term t in document d

23. L𝑑is the number of tokens in document d The Query Likelihood Model

25. Bayesian Smoothing

26. Note: MLE =

27. maximum likelihood estimateConceptually the same: The probability estimate for a word present in the document combines a discounted (MLE) and a fraction of the estimate of its prevalence in the whole collection. 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 14

29. First experiments on the language modeling approach to IR

30. Performed on TREC topics 202-250 over TREC disks 2 and 3.LM much better than tf-idf (specifically at higher recalls) 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 15 The Query Likelihood Model

31. LM vs. BIM vs. XML retrieval (249) Language models and the most successful XML retrieval models approach relevance modeling in a roundabout way as apposed to the BIM model that evaluates relevance directly. LM initially appears to not include relevance modeling The most successful XML retrieval models assume that queries and documents are objects of the same type BIM models have relevance as the central variable that is evaluated 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 16 Language Modeling Versus Other Approaches in IR

32. LM vs. traditional tf-idf(249) The LM has significant relations to tf-idf models They differ on a more conceptual level Both directly use term frequency Both have a method of mixing document frequency and collection frequency to produce probabilities Both treat terms independently LM intuitions are more probabilistic than geometric LM mathematical models are more principled rather than heuristic LM differs in its use of tf and df 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 17

35. Easier to incorporate relevance feedback by expanding the query with terms from relevant documents 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 18 Extended Language Modeling Approaches

37. Using a document model to produce a relevant query

38. Document likelihood

39. Using a query model to produce a relevant document

40. Model comparison

41. Comparing these models 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 19 Extended Language Modeling Approaches

43. Outperforms query and document likelihood models

44. But, scores are not comparable across queries 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 20 Extended Language Modeling Approaches

45. Translation Model – Features (251) Answer to synonymy in basic LM models Lets you generate query words that are not in a document by translating to alternate terms with similar meaning Provides a basis for executing cross-language IR 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 21 Extended Language Modeling Approaches

46. Translation Model – Issues (251) Computationally intensive Need to build the model using outside resources Thesaurus Bilingual dictionary Statistical machine translation system’s translation dictionary 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 22 Extended Language Modeling Approaches

47. Thanks for not throwing vegetables! Questions? 10/3/2011 INF384H / CS395T: Concepts of Information Retrieval 23

Language Models for Information Retrieval

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Language Models for Information Retrieval

Similar to Language Models for Information Retrieval (20)

Recently uploaded

Recently uploaded (20)

Language Models for Information Retrieval

Editor's Notes