Language Models for Information Retrieval

Introduction to Information Retrieval:Language models for information retrievalby C.D. Manning, P. Raghavan, and H. Schutze.Presentation by Dustin SmithThe University of Texas at AustinSchool of Informationdustin.smith@utexas.edu10/3/20111INF384H / CS395T: Concepts of Information Retrieval

Christopher Manning – backgroundBA Australian National University 1989 (majors in mathematics, computer science and linguistics)PhD Stanford Linguistics 1995Asst Professor Carnegie Mellon University Computational Linguistics Program 1994-96Lecturer University of Sydney Dept of Linguistics 1996-99Asst Professor Stanford University Depts of Computer Science and Linguistics 1999-2006Current: Assoc Professor Stanford University Depts of Linguistics and Computer Science10/3/2011INF384H / CS395T: Concepts of Information Retrieval2

Prabhakar Raghavan– backgroundUndergraduate degree in electrical engineering from ITT, MadrasPhD in computer science from UC BerkeleyCurrent: Working at Yahoo! Labs and is a Consulting Professor of Computer Science at Stanford University10/3/2011INF384H / CS395T: Concepts of Information Retrieval3

Hinrich Schütze– backgroundTechnical University of BraunschweigVordiplom MathematikVordiplom InformatikUniversity of Stuttgart, Diplom Informatik (MSCS)Stanford University, Ph.D., Computational LinguisticsCurrent: Chair of Theoretical Computational Linguistics, Institute for Natural Language Processing at the University of Stuttgart10/3/2011INF384H / CS395T: Concepts of Information Retrieval4

Chapter/Presentation OutlineIntroduction to the concept of Language ModelsFinite automata and language modelsTypes of language modelsMultinomial distributions over wordsDescription of the Query Likelihood ModelUsing query likelihood language models in IREstimating the query generation probabilityPonte and Croft’s experimentsComparison of the language modeling approach to IR against other approaches to IRDescription of various extension to the language modeling approach10/3/2011INF384H / CS395T: Concepts of Information Retrieval5

Language ModelsBased on concept that a document is a good match for a query if the document model is likely to generate the query.An alternative to the straightforward query-document probability model. (traditional approach)10/3/2011INF384H / CS395T: Concepts of Information Retrieval6

Finite automata and language models (238)In figure 12.1 the alphabet is {“I”, “wish”} and the language produced by the model is {“I wish”, “I wish I wish”, “I wish I wish I wish I wish”, etc.}

The process is analogous for a document model

Figure 12.2 represents a single node with a single distribution over terms s.t.𝑡∈𝑉𝑃(𝑡)=1. 10/3/2011INF384H / CS395T: Concepts of Information Retrieval7Language Models

Calculating phrase probability with stop/continue probability included (238)The probability calculations are very small.

This calculation is shown with stop probabilities, but in practice these are left out. 10/3/2011INF384H / CS395T: Concepts of Information Retrieval8Language Models

Comparison of document models (239-240)In theory these models represent different documents, different alphabets, and different languages.

Given a query s = “frog said that toad likes that dog”,our two model probabilities are calculated by simply multiplying term distributions.

It’s evident why P(s|𝑀1) scores higher than P(s|𝑀2). More query terms were present in P(s|𝑀1) and so the probability is greater. 10/3/2011INF384H / CS395T: Concepts of Information Retrieval9Language Models

Types of language models(240)Unigram Language Model

Which𝑀𝑑 to use? 10/3/2011INF384H / CS395T: Concepts of Information Retrieval10Chain ruleLanguage Models

Using query likelihood language models in IR (242-243)Using Bayes rule:P(d|q)=P(q|d)P(d)/P(q)With P(d) and P(q) uniform across documents,=> P(d|q) = P(q|d)In the query likelihood model we construct a language model 𝑀𝑑 from each documentGoal: to rank documents by P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query 10/3/2011INF384H / CS395T: Concepts of Information Retrieval11The Query Likelihood Model

Using query likelihood language models in IR (242-243)Multinomial unigram language modelPq𝑀𝑑=𝐾𝑞𝑡∈𝑉𝑃(𝑡|𝑀𝑑)𝑡𝑓𝑡,𝑑𝐾𝑞 is dropped as it is constant across all queries 10/3/2011INF384H / CS395T: Concepts of Information Retrieval12Query generation process:1. Infer a LM for each document 2. Estimate Pq𝑀𝑑, the probability of generating the query according to each one of these document models3. Rank the documents according to these probabilities The Query Likelihood Model

Estimating the query generation probability (244)Query generation probability = Pq𝑀𝑑 10/3/2011INF384H / CS395T: Concepts of Information Retrieval13M𝑑 is the language model of document d

tf𝑡.𝑑is the raw term frequency of term t in document d

L𝑑is the number of tokens in document d The Query Likelihood Model

Language Models for Information Retrieval

More Related Content

What's hot

Similar to Language Models for Information Retrieval

Recently uploaded

Language Models for Information Retrieval

Editor's Notes