Language Models for Information Retrieval

  • 650 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
650
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
26
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • --- Rah-guh-vun
  • --- appears to be the most social and outgoing of our nerdy authors
  • Just read it
  • ---Q-d model: modeling the relevance of a document to a query
  • Alright, so: what is a document model? And how does it generate the query?They use the concept of automata to help explain what is meant by a language or document model. For any given document you have an alphabet w.r.t. that document and a language produces by that alphabetProbability is distributed over terms ST the sum of all probabilities is equal to 1. straightforward.
  • --- I didn’t quite understand where the 0.8 stop/continue probability came from---Left out because given a fixed STOP prob, it does not effect results when comparing models to leave it out.Now we will compare models
  • Next we look at probability over sequences of terms.
  • ---By using the chain rule, we can build probabilities over sequences of terms. ---Two specific models that use the chain rule are the unigram and bigram modelsDescribe images---The fundamental question in language modeling is which doc-model to use?
  • ---now we introduce formally the model representing the initial concepts of LM for IR
  • The most common way to achieve the goal of the query likelihood model is to use the multinomial unigram language modelThe query generation process is randomNext: estimating this 𝐏𝒒𝑴𝒅The most common way to achieve the goal of the query likelihood model is to use the multinomial unigram language modelThe query generation process is randomNext: estimating this 𝐏(𝒒│𝑴_𝒅 )
  • Basically we are counting how often each word occurs and dividing by the total # of words in the documentNotice the ^, that indicates that this probability is an estimateTherein lies the issue with language modelsWhich leads to the re-occuring issue of “zero probabilities”Which then leads to the much used approach of “smoothing”, which we will see a lot of in the next two presentations in detail.
  • the initial idea behind smoothing was to allow for non-occuring terms to be in a query generated by the document model GIVE example, say you have a document about tigers that doesn’t contain the word cat but a user queries “big striped cats”One of the important points in this section is that smoothing is essential for the overall good properties of LMs
  • ---But, as Dr. Lease has mentioned… its easy to get good results when you are comparing to the standard tf-idf---NEXT: comparison of language models to other IR approaches
  • But they mention that LM can be thought to indirectly include relevance modeling by viewing documents and info needs as the same type of object and analyzing it with NLP BIM = binary independence model
  • -Both use tf-Both use df and cf to produce prob-Both treat terms independently ------NEXT: document model
  • Downsides: both downside stem from there being less text to estimate withNEXT: all three approaches
  • --- so far we’ve addressed query likelihood and document likelihood, now they focus on comparing these modelsNext: model comparison
  • Q -- What will we use to compare models? One example would be the notorious KL-divergence.Comment -- Some prior results show that comparing models outperforms both query and document likelihood modelsComment -- Not bad for ad hoc queries, but bad for topic trackingNEXT: translation model
  • -- Synonymy: uses similar, but not the same words to say the same thing---I believe synonymy is still a pretty big issue
  • -- more computationally intensive than basic LM approaches-- all of these extended language models have been shown to improve basic LM approaches

Transcript

  • 1. Introduction to Information Retrieval:Language models for information retrievalby C.D. Manning, P. Raghavan, and H. Schutze.
    Presentation by Dustin Smith
    The University of Texas at Austin
    School of Information
    dustin.smith@utexas.edu
    10/3/2011
    1
    INF384H / CS395T: Concepts of Information Retrieval
  • 2. Christopher Manning – background
    BA Australian National University 1989 (majors in mathematics, computer science and linguistics)
    PhD Stanford Linguistics 1995
    Asst Professor Carnegie Mellon University Computational Linguistics Program 1994-96
    Lecturer University of Sydney Dept of Linguistics 1996-99
    Asst Professor Stanford University Depts of Computer Science and Linguistics 1999-2006
    Current: Assoc Professor Stanford University Depts of Linguistics and Computer Science
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    2
  • 3. Prabhakar Raghavan– background
    Undergraduate degree in electrical engineering from ITT, Madras
    PhD in computer science from UC Berkeley
    Current: Working at Yahoo! Labs and is a Consulting Professor of Computer Science at Stanford University
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    3
  • 4. Hinrich Schütze– background
    Technical University of Braunschweig
    Vordiplom Mathematik
    Vordiplom Informatik
    University of Stuttgart, Diplom Informatik (MSCS)
    Stanford University, Ph.D., Computational Linguistics
    Current: Chair of Theoretical Computational Linguistics, Institute for Natural Language Processing at the University of Stuttgart
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    4
  • 5. Chapter/Presentation Outline
    Introduction to the concept of Language Models
    Finite automata and language models
    Types of language models
    Multinomial distributions over words
    Description of the Query Likelihood Model
    Using query likelihood language models in IR
    Estimating the query generation probability
    Ponte and Croft’s experiments
    Comparison of the language modeling approach to IR against other approaches to IR
    Description of various extension to the language modeling approach
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    5
  • 6. Language Models
    Based on concept that a document is a good match for a query if the document model is likely to generate the query.
    An alternative to the straightforward query-document probability model. (traditional approach)
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    6
  • 7. Finite automata and language models (238)
    • In figure 12.1 the alphabet is {“I”, “wish”} and the language produced by the model is {“I wish”, “I wish I wish”, “I wish I wish I wish I wish”, etc.}
    • 8. The process is analogous for a document model
    • 9. Figure 12.2 represents a single node with a single distribution over terms s.t.𝑡∈𝑉𝑃(𝑡)=1.
     
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    7
    Language Models
  • 10. Calculating phrase probability with stop/continue probability included (238)
    • The probability calculations are very small.
    • 11. This calculation is shown with stop probabilities, but in practice these are left out.
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    8
    Language Models
  • 12. Comparison of document models (239-240)
    • In theory these models represent different documents, different alphabets, and different languages.
    • 13. Given a query s = “frog said that toad likes that dog”,our two model probabilities are calculated by simply multiplying term distributions.
    • 14. It’s evident why P(s|𝑀1) scores higher than P(s|𝑀2). More query terms were present in P(s|𝑀1) and so the probability is greater.
     
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    9
    Language Models
  • 15. Types of language models(240)
    • Unigram Language Model
    • 16. Bigram Language Model
    • 17. Section Conclusion
    • 18. Which𝑀𝑑 to use?
     
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    10
    Chain rule
    Language Models
  • 19. Using query likelihood language models in IR (242-243)
    Using Bayes rule:
    P(d|q)=P(q|d)P(d)/P(q)
    With P(d) and P(q) uniform across documents,
    => P(d|q) = P(q|d)
    In the query likelihood model we construct a language model 𝑀𝑑 from each document
    Goal: to rank documents by P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query
     
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    11
    The Query Likelihood Model
  • 20. Using query likelihood language models in IR (242-243)
    Multinomial unigram language model
    Pq𝑀𝑑=𝐾𝑞𝑡∈𝑉𝑃(𝑡|𝑀𝑑)𝑡𝑓𝑡,𝑑
    𝐾𝑞 is dropped as it is constant across all queries
     
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    12
    Query generation process:1. Infer a LM for each document
    2. Estimate Pq𝑀𝑑, the probability of generating the query according to each one of these document models
    3. Rank the documents according to these probabilities
     
    The Query Likelihood Model
  • 21. Estimating the query generation probability (244)
    • Query generation probability = Pq𝑀𝑑
     
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    13
    • M𝑑 is the language model of document d
    • 22. tf𝑡.𝑑is the raw term frequency of term t in document d
    • 23. L𝑑is the number of tokens in document d
     
    The Query Likelihood Model
  • 24. Smoothing Methods (245-246)
    • Linear Interpolation
    • 25. Bayesian Smoothing
    • 26. Note: MLE =
    • 27. maximum likelihood estimate
    Conceptually the same:
    The probability estimate for a word present in the document combines a discounted (MLE) and a fraction of the estimate of its prevalence in the whole collection.
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    14
  • 28. Ponte and Croft’s Experiments (246)
    • 1998 experiments
    • 29. First experiments on the language modeling approach to IR
    • 30. Performed on TREC topics 202-250 over TREC disks 2 and 3.
    LM much better than tf-idf (specifically at higher recalls)
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    15
    The Query Likelihood Model
  • 31. LM vs. BIM vs. XML retrieval (249)
    Language models and the most successful XML retrieval models approach relevance modeling in a roundabout way as apposed to the BIM model that evaluates relevance directly.
    LM initially appears to not include relevance modeling
    The most successful XML retrieval models assume that queries and documents are objects of the same type
    BIM models have relevance as the central variable that is evaluated
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    16
    Language Modeling Versus Other Approaches in IR
  • 32. LM vs. traditional tf-idf(249)
    The LM has significant relations to tf-idf models
    They differ on a more conceptual level
    Both directly use term frequency
    Both have a method of mixing document frequency and collection frequency to produce probabilities
    Both treat terms independently
    LM intuitions are more probabilistic than geometric
    LM mathematical models are more principled rather than heuristic
    LM differs in its use of tf and df
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    17
  • 33. Document Likelihood Model (250)
    Downsides:
    • Takes much more smoothing
    • 34. Results in worse estimates
    Features:
    • Uses a query to generate a document with a query language model (𝑀𝑞)
    • 35. Easier to incorporate relevance feedback by expanding the query with terms from relevant documents
     
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    18
    Extended Language Modeling Approaches
  • 36. Three language model approaches (250)
    • Query likelihood
    • 37. Using a document model to produce a relevant query
    • 38. Document likelihood
    • 39. Using a query model to produce a relevant document
    • 40. Model comparison
    • 41. Comparing these models
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    19
    Extended Language Modeling Approaches
  • 42. Kullback-Leibler (KL) divergence (251)
    • KL divergence is an asymmetric divergence measure originating in information theory, which measures how bad the probability distribution 𝑀𝑞 is at modeling 𝑀𝑑 (pg. 251)
    • 43. Outperforms query and document likelihood models
    • 44. But, scores are not comparable across queries
     
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    20
    Extended Language Modeling Approaches
  • 45. Translation Model – Features (251)
    Answer to synonymy in basic LM models
    Lets you generate query words that are not in a document by translating to alternate terms with similar meaning
    Provides a basis for executing cross-language IR
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    21
    Extended Language Modeling Approaches
  • 46. Translation Model – Issues (251)
    Computationally intensive
    Need to build the model using outside resources
    Thesaurus
    Bilingual dictionary
    Statistical machine translation system’s translation dictionary
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    22
    Extended Language Modeling Approaches
  • 47. Thanks for not throwing vegetables!
    Questions?
    10/3/2011
    INF384H / CS395T: Concepts of Information Retrieval
    23