Introduction to Information Retrieval:Language models for information retrievalby C.D. Manning, P. Raghavan, and H. Schutze.Presentation by Dustin SmithThe University of Texas at AustinSchool of Informationdustin.smith@utexas.edu10/3/20111INF384H / CS395T: Concepts of Information Retrieval
Christopher Manning – backgroundBA Australian National University 1989 (majors in mathematics, computer science and linguistics)PhD Stanford Linguistics 1995Asst Professor Carnegie Mellon University Computational Linguistics Program 1994-96Lecturer University of Sydney Dept of Linguistics 1996-99Asst Professor Stanford University Depts of Computer Science and Linguistics 1999-2006Current: Assoc Professor Stanford University Depts of Linguistics and Computer Science10/3/2011INF384H / CS395T: Concepts of Information Retrieval2
Prabhakar Raghavan– backgroundUndergraduate degree in electrical engineering from ITT, MadrasPhD in computer science from UC BerkeleyCurrent: Working at Yahoo! Labs and is a Consulting Professor of Computer Science at Stanford University10/3/2011INF384H / CS395T: Concepts of Information Retrieval3
Hinrich Schütze– backgroundTechnical University of BraunschweigVordiplom MathematikVordiplom InformatikUniversity of Stuttgart, Diplom Informatik (MSCS)Stanford University, Ph.D., Computational LinguisticsCurrent: Chair of Theoretical Computational Linguistics, Institute for Natural Language Processing at the University of Stuttgart10/3/2011INF384H / CS395T: Concepts of Information Retrieval4
Chapter/Presentation OutlineIntroduction to the concept of Language ModelsFinite automata and language modelsTypes of language modelsMultinomial distributions over wordsDescription of the Query Likelihood ModelUsing query likelihood language models in IREstimating the query generation probabilityPonte and Croft’s experimentsComparison of the language modeling approach to IR against other approaches to IRDescription of various extension to the language modeling approach10/3/2011INF384H / CS395T: Concepts of Information Retrieval5
Language ModelsBased on concept that a document is a good match for a query if the document model is likely to generate the query.An alternative to the straightforward query-document probability model. (traditional approach)10/3/2011INF384H / CS395T: Concepts of Information Retrieval6
Finite automata and language models (238)In figure 12.1 the alphabet is {“I”, “wish”} and the language produced by the model is {“I wish”, “I wish I wish”, “I wish I wish I wish I wish”, etc.}
The process is analogous for a document model
Figure 12.2 represents a single node with a single distribution over terms s.t.𝑡∈𝑉𝑃(𝑡)=1.  10/3/2011INF384H / CS395T: Concepts of Information Retrieval7Language Models
Calculating phrase probability with stop/continue probability included (238)The probability calculations are very small.
This calculation is shown with stop probabilities, but in practice these are left out. 10/3/2011INF384H / CS395T: Concepts of Information Retrieval8Language Models
Comparison of document models (239-240)In theory these models represent different documents, different alphabets, and different languages.
Given a query s = “frog said that toad likes that dog”,our two model probabilities are calculated by simply multiplying term distributions.
It’s evident why P(s|𝑀1) scores higher than P(s|𝑀2). More query terms were present in P(s|𝑀1) and so the probability is greater. 10/3/2011INF384H / CS395T: Concepts of Information Retrieval9Language Models
Types of language models(240)Unigram Language Model
Bigram Language Model
Section Conclusion
Which𝑀𝑑 to use? 10/3/2011INF384H / CS395T: Concepts of Information Retrieval10Chain ruleLanguage Models
Using query likelihood language models in IR (242-243)Using Bayes rule:P(d|q)=P(q|d)P(d)/P(q)With P(d) and P(q) uniform across documents,=> P(d|q) = P(q|d)In the query likelihood model we construct a language model 𝑀𝑑 from each documentGoal: to rank documents by P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query 10/3/2011INF384H / CS395T: Concepts of Information Retrieval11The Query Likelihood Model
Using query likelihood language models in IR (242-243)Multinomial unigram language modelPq𝑀𝑑=𝐾𝑞𝑡∈𝑉𝑃(𝑡|𝑀𝑑)𝑡𝑓𝑡,𝑑𝐾𝑞 is dropped as it is constant across all queries 10/3/2011INF384H / CS395T: Concepts of Information Retrieval12Query generation process:1. Infer a LM for each document 2. Estimate Pq𝑀𝑑, the probability of generating the query according to each one of these document models3. Rank the documents according to these probabilities The Query Likelihood Model
Estimating the query generation probability (244)Query generation probability = Pq𝑀𝑑 10/3/2011INF384H / CS395T: Concepts of Information Retrieval13M𝑑 is the language model of document d
tf𝑡.𝑑is the raw term frequency of term t in document d
L𝑑is the number of tokens in document d The Query Likelihood Model

Language Models for Information Retrieval

  • 1.
    Introduction to InformationRetrieval:Language models for information retrievalby C.D. Manning, P. Raghavan, and H. Schutze.Presentation by Dustin SmithThe University of Texas at AustinSchool of Informationdustin.smith@utexas.edu10/3/20111INF384H / CS395T: Concepts of Information Retrieval
  • 2.
    Christopher Manning –backgroundBA Australian National University 1989 (majors in mathematics, computer science and linguistics)PhD Stanford Linguistics 1995Asst Professor Carnegie Mellon University Computational Linguistics Program 1994-96Lecturer University of Sydney Dept of Linguistics 1996-99Asst Professor Stanford University Depts of Computer Science and Linguistics 1999-2006Current: Assoc Professor Stanford University Depts of Linguistics and Computer Science10/3/2011INF384H / CS395T: Concepts of Information Retrieval2
  • 3.
    Prabhakar Raghavan– backgroundUndergraduatedegree in electrical engineering from ITT, MadrasPhD in computer science from UC BerkeleyCurrent: Working at Yahoo! Labs and is a Consulting Professor of Computer Science at Stanford University10/3/2011INF384H / CS395T: Concepts of Information Retrieval3
  • 4.
    Hinrich Schütze– backgroundTechnicalUniversity of BraunschweigVordiplom MathematikVordiplom InformatikUniversity of Stuttgart, Diplom Informatik (MSCS)Stanford University, Ph.D., Computational LinguisticsCurrent: Chair of Theoretical Computational Linguistics, Institute for Natural Language Processing at the University of Stuttgart10/3/2011INF384H / CS395T: Concepts of Information Retrieval4
  • 5.
    Chapter/Presentation OutlineIntroduction tothe concept of Language ModelsFinite automata and language modelsTypes of language modelsMultinomial distributions over wordsDescription of the Query Likelihood ModelUsing query likelihood language models in IREstimating the query generation probabilityPonte and Croft’s experimentsComparison of the language modeling approach to IR against other approaches to IRDescription of various extension to the language modeling approach10/3/2011INF384H / CS395T: Concepts of Information Retrieval5
  • 6.
    Language ModelsBased onconcept that a document is a good match for a query if the document model is likely to generate the query.An alternative to the straightforward query-document probability model. (traditional approach)10/3/2011INF384H / CS395T: Concepts of Information Retrieval6
  • 7.
    Finite automata andlanguage models (238)In figure 12.1 the alphabet is {“I”, “wish”} and the language produced by the model is {“I wish”, “I wish I wish”, “I wish I wish I wish I wish”, etc.}
  • 8.
    The process isanalogous for a document model
  • 9.
    Figure 12.2 representsa single node with a single distribution over terms s.t.𝑡∈𝑉𝑃(𝑡)=1.  10/3/2011INF384H / CS395T: Concepts of Information Retrieval7Language Models
  • 10.
    Calculating phrase probabilitywith stop/continue probability included (238)The probability calculations are very small.
  • 11.
    This calculation isshown with stop probabilities, but in practice these are left out. 10/3/2011INF384H / CS395T: Concepts of Information Retrieval8Language Models
  • 12.
    Comparison of documentmodels (239-240)In theory these models represent different documents, different alphabets, and different languages.
  • 13.
    Given a querys = “frog said that toad likes that dog”,our two model probabilities are calculated by simply multiplying term distributions.
  • 14.
    It’s evident whyP(s|𝑀1) scores higher than P(s|𝑀2). More query terms were present in P(s|𝑀1) and so the probability is greater. 10/3/2011INF384H / CS395T: Concepts of Information Retrieval9Language Models
  • 15.
    Types of languagemodels(240)Unigram Language Model
  • 16.
  • 17.
  • 18.
    Which𝑀𝑑 to use? 10/3/2011INF384H/ CS395T: Concepts of Information Retrieval10Chain ruleLanguage Models
  • 19.
    Using query likelihoodlanguage models in IR (242-243)Using Bayes rule:P(d|q)=P(q|d)P(d)/P(q)With P(d) and P(q) uniform across documents,=> P(d|q) = P(q|d)In the query likelihood model we construct a language model 𝑀𝑑 from each documentGoal: to rank documents by P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query 10/3/2011INF384H / CS395T: Concepts of Information Retrieval11The Query Likelihood Model
  • 20.
    Using query likelihoodlanguage models in IR (242-243)Multinomial unigram language modelPq𝑀𝑑=𝐾𝑞𝑡∈𝑉𝑃(𝑡|𝑀𝑑)𝑡𝑓𝑡,𝑑𝐾𝑞 is dropped as it is constant across all queries 10/3/2011INF384H / CS395T: Concepts of Information Retrieval12Query generation process:1. Infer a LM for each document 2. Estimate Pq𝑀𝑑, the probability of generating the query according to each one of these document models3. Rank the documents according to these probabilities The Query Likelihood Model
  • 21.
    Estimating the querygeneration probability (244)Query generation probability = Pq𝑀𝑑 10/3/2011INF384H / CS395T: Concepts of Information Retrieval13M𝑑 is the language model of document d
  • 22.
    tf𝑡.𝑑is the rawterm frequency of term t in document d
  • 23.
    L𝑑is the numberof tokens in document d The Query Likelihood Model

Editor's Notes

  • #2 --- Rah-guh-vun
  • #3 --- appears to be the most social and outgoing of our nerdy authors
  • #6 Just read it
  • #7 ---Q-d model: modeling the relevance of a document to a query
  • #8 Alright, so: what is a document model? And how does it generate the query?They use the concept of automata to help explain what is meant by a language or document model. For any given document you have an alphabet w.r.t. that document and a language produces by that alphabetProbability is distributed over terms ST the sum of all probabilities is equal to 1. straightforward.
  • #9 --- I didn’t quite understand where the 0.8 stop/continue probability came from---Left out because given a fixed STOP prob, it does not effect results when comparing models to leave it out.Now we will compare models
  • #10 Next we look at probability over sequences of terms.
  • #11 ---By using the chain rule, we can build probabilities over sequences of terms. ---Two specific models that use the chain rule are the unigram and bigram modelsDescribe images---The fundamental question in language modeling is which doc-model to use?
  • #12 ---now we introduce formally the model representing the initial concepts of LM for IR
  • #13 The most common way to achieve the goal of the query likelihood model is to use the multinomial unigram language modelThe query generation process is randomNext: estimating this 𝐏𝒒𝑴𝒅The most common way to achieve the goal of the query likelihood model is to use the multinomial unigram language modelThe query generation process is randomNext: estimating this 𝐏(𝒒│𝑴_𝒅 )
  • #14 Basically we are counting how often each word occurs and dividing by the total # of words in the documentNotice the ^, that indicates that this probability is an estimateTherein lies the issue with language modelsWhich leads to the re-occuring issue of “zero probabilities”Which then leads to the much used approach of “smoothing”, which we will see a lot of in the next two presentations in detail.
  • #15 the initial idea behind smoothing was to allow for non-occuring terms to be in a query generated by the document model GIVE example, say you have a document about tigers that doesn’t contain the word cat but a user queries “big striped cats”One of the important points in this section is that smoothing is essential for the overall good properties of LMs
  • #16 ---But, as Dr. Lease has mentioned… its easy to get good results when you are comparing to the standard tf-idf---NEXT: comparison of language models to other IR approaches
  • #17 But they mention that LM can be thought to indirectly include relevance modeling by viewing documents and info needs as the same type of object and analyzing it with NLP BIM = binary independence model
  • #18 -Both use tf-Both use df and cf to produce prob-Both treat terms independently ------NEXT: document model
  • #19 Downsides: both downside stem from there being less text to estimate withNEXT: all three approaches
  • #20 --- so far we’ve addressed query likelihood and document likelihood, now they focus on comparing these modelsNext: model comparison
  • #21 Q -- What will we use to compare models? One example would be the notorious KL-divergence.Comment -- Some prior results show that comparing models outperforms both query and document likelihood modelsComment -- Not bad for ad hoc queries, but bad for topic trackingNEXT: translation model
  • #22 -- Synonymy: uses similar, but not the same words to say the same thing---I believe synonymy is still a pretty big issue
  • #23 -- more computationally intensive than basic LM approaches-- all of these extended language models have been shown to improve basic LM approaches