Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Information Retrieval Models Part I

2,407 views

Published on

Tutorial on foundations of Information Retrieval Models by Thomas Roelleke and Ingo Frommholz presented at the Information Retrieval and Foraging Autumn School at Schloß Dagstuhl, Germany, September 2014.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Information Retrieval Models Part I

  1. 1. IR Models Part I | Foundations Thomas Roelleke, Queen Mary University of London Ingo Frommholz, University of Bedfordshire Autumn School for Information Retrieval and Foraging Schloss Dagstuhl, September 2014 ASIRF Sponsors:
  2. 2. IR Models Acknowledgements The knowledge presented in this tutorial and the Morgan & Claypool book is the result of many many discussions with colleagues. People involved in the production and reviewing: Gianna Amati and Djoerd Hiemstra (the experts), Diane Cerra and Gary Marchionini (Morgan & Claypool), Ricardo Baeza-Yates, Norbert Fuhr, and Mounia Lalmas. Thomas' PhD students (who had no choice): Jun Wang, Hengzhi Wu, Fred Forst, Hany Azzam, Sirvan Yahyaei, Marco Bonzanini, Miguel Martinez-Alvarez. Many more IR experts including Fabio Crestani, Keith van Rijsbergen, Stephen Robertson, Fabrizio Sebastiani, Arjen deVries, Tassos Tombros, Hugo Zaragoza, ChengXiang Zhai. And non-IR experts Fabrizio Smeraldi, Andreas Kaltenbrunner and Norman Fenton. 2 / 133
  3. 3. IR Models Table of Contents 1 Introduction 2 Foundations of IR Models 3 / 133
  4. 4. Introduction Warming Up Background: Time-Line of IR Models Notation
  5. 5. Introduction Warming Up
  6. 6. IR Models Introduction Warming Up Information Retrieval Conceptual Model DD rel. judg. aQ b a Q Q D b r IR Q D D D Q D R [Fuhr, 1992] 6 / 133
  7. 7. IR Models Introduction Warming Up Vector Space Model, Term Space Still one of the prominent IR frameworks is the Vector Space Model (VSM) A term space is a vector space where each dimension represents one term in our vocabulary If we have n terms in our collection, we get an n-dimensional term or vector space Each document and each query is represented by a vector in the term space 7 / 133
  8. 8. IR Models Introduction Warming Up Formal Description Set of terms in our vocabulary: T = ft1; : : : ; tng T spans an n-dimensional vector space Document dj is represented by a vector of document term weights Query q is represented by a vector of query term weights 8 / 133
  9. 9. IR Models Introduction Warming Up Document Vector Document dj is represented by a vector of document term weights dji 2 R: Document term weights can be computed, e.g., using tf and idf (see below) 9 / 133
  10. 10. IR Models Introduction Warming Up Document Vector Document dj is represented by a vector of document term weights dji 2 R: Weight of term in document Document term weights can be computed, e.g., using tf and idf (see below) 10 / 133
  11. 11. IR Models Introduction Warming Up Query Vector Like documents, a query q is represented by a vector of query term weights qi 2 R: ~q = 0 BB@ q1 q2 : : : qn 1 CCA qi denotes the query term weight of term ti qi is 0 if the term does not appear in the query. qi may be set to 1 if the term does appear in the query. Further query term weights are possible, for example 2 if the term is important 1 if the term is just nice to have" 11 / 133
  12. 12. IR Models Introduction Warming Up Retrieval Function The retrieval function computes a retrieval status value (RSV) using a vector similarity measure, e.g. the scalar product: RSV (dj ; q) = ~dj ~q = Xn i=1 dji qi t t 1 2 q d d 1 2 Ranking of documents according to decreasing RSV 12 / 133
  13. 13. IR Models Introduction Warming Up Example Query Query: side eects of drugs on memory and cognitive abilities ti Query ~q ~ d1 ~ d2 ~ d3 ~ d4 side eect 2 1 0.5 1 1 drug 2 1 1 1 1 memory 1 1 0 1 0 cognitive ability 1 0 1 1 0.5 RSV 5 4 6 4.5 Produces the ranking d3 d1 d4 d2 13 / 133
  14. 14. IR Models Introduction Warming Up Term weights: Example Text In his address to the CBI, Mr Cameron is expected to say: Scotland does twice as much trade with the rest of the UK than with the rest of the world put together { trade that helps to support one million Scottish jobs.Meanwhile, Mr Salmond has set out six job-creating powers for Scotland that he said were guaranteed with a Yes vote in the referendum. During their televised BBC debate on Monday, Mr Salmond had challenged Better Together head Alistair Darling to name three job-creating powers that were being oered to the Scottish Parliament by the pro-UK parties in the event of a No vote. Source: http://www.bbc.co.uk/news/uk-scotland-scotland-politics-28952197 What are good descriptors for the text? Which are more, which are less important? Which are informative? Which are good discriminators? How can a machine answer these questions? 14 / 133
  15. 15. IR Models Introduction Warming Up Frequencies The answer is counting Dierent assumptions: The more frequent a term appears in a document, the more suitable it is to describe its content Location-based count. Think of term positions or locations. In how many locations of a text do we observe the term? The term `scotland' appears in 2 out of 138 locations in the example text The less documents a term occurs in, the more discriminative or informative it is Document-based count. In how many documents do we observe the term? Think of stop-words like `the', `a' etc. Location- and document-based frequencies are the building blocks of all (probabilistic) models to come 15 / 133
  16. 16. Introduction Background: Time-Line of IR Models
  17. 17. IR Models Introduction Background: Time-Line of IR Models Timeline of IR Models: 50s, 60s and 70s Zipf and Luhn: distribution of document frequencies; [Croft and Harper, 1979]: BIR without relevance; [Robertson and Sparck-Jones, 1976]: BIR; [Salton, 1971, Salton et al., 1975]: VSM, TF-IDF; [Rocchio, 1971]: Relevance feedback; [Maron and Kuhns, 1960]: On Relevance, Probabilistic Indexing, and IR 17 / 133
  18. 18. IR Models Introduction Background: Time-Line of IR Models Timeline of IR Models: 80s [Cooper, 1988, Cooper, 1991, Cooper, 1994]: Beyond Boole, Probability Theory in IR: An Encumbrance; [Dumais et al., 1988, Deerwester et al., 1990]: Latent semantic indexing; [van Rijsbergen, 1986, van Rijsbergen, 1989]: P(d ! q); [Bookstein, 1980, Salton et al., 1983]: Fuzzy, extended Boolean 18 / 133
  19. 19. IR Models Introduction Background: Time-Line of IR Models Timeline of IR Models: 90s [Ponte and Croft, 1998]: LM; [Brin and Page, 1998, Kleinberg, 1999]: Pagerank and Hits; [Robertson et al., 1994, Singhal et al., 1996]: Pivoted Document Length Normalisation; [Wong and Yao, 1995]: P(d ! q); [Robertson and Walker, 1994, Robertson et al., 1995]: 2-Poisson, BM25; [Margulis, 1992, Church and Gale, 1995]: Poisson; [Fuhr, 1992]: Probabilistic Models in IR; [Turtle and Croft, 1990, Turtle and Croft, 1991]: PIN's; [Fuhr, 1989]: Models for Probabilistic Indexing 19 / 133
  20. 20. IR Models Introduction Background: Time-Line of IR Models Timeline of IR Models: 00s ICTIR 2009 and ICTIR 2011; [Roelleke and Wang, 2008]: TF-IDF Uncovered; [Luk, 2008, Robertson, 2005]: Event Spaces; [Roelleke and Wang, 2006]: Parallel Derivation of Models; [Fang and Zhai, 2005]: Axiomatic approach; [He and Ounis, 2005]: TF in BM25 and DFR; [Metzler and Croft, 2004]: LM and PIN's;[Robertson, 2004]: Understanding IDF; [Sparck-Jones et al., 2003]: LM and Relevance; [Croft and Laerty, 2003, Laerty and Zhai, 2003]: LM book; [Zaragoza et al., 2003]: Bayesian extension to LM; [Bruza and Song, 2003]: probabilistic dependencies in LM; [Amati and van Rijsbergen, 2002]: DFR; [Lavrenko and Croft, 2001]: Relevance-based LM; [Hiemstra, 2000]: TF-IDF and LM; [Sparck-Jones et al., 2000]: probabilistic model: status 20 / 133
  21. 21. IR Models Introduction Background: Time-Line of IR Models Timeline of IR Models: 2010 and Beyond Models for interactive and dynamic IR (e.g. iPRP [Fuhr, 2008]) Quantum models [van Rijsbergen, 2004, Piwowarski et al., 2010] 21 / 133
  22. 22. Introduction Notation
  23. 23. IR Models Introduction Notation Notation A tedious start ... but a must-have. Sets Locations Documents Terms Probabilities 23 / 133
  24. 24. IR Models Introduction Notation Notation: Sets Notation description of events, sets, and frequencies t, d, q, c, r term t, document d, query q, collection c, rele- vant r Dc , Dr Dc = fd1; : : :g: set of Documents in collection c; Dr : relevant documents Tc , Tr Tc = ft1; : : :g: set of Terms in collection c; Tr : terms that occur in relevant documents Lc , Lr Lc = fl1; : : :g; set of Locations in collection c; Lr : locations in relevant documents 24 / 133
  25. 25. IR Models Introduction Notation Notation: Locations Notation description of events, sets, and frequencies Traditional notation nL(t; d) number of Locations at which term t occurs in document d tf, tfd NL(d) number of Locations in docu- ment d (document length) dl nL(t; q) number of Locations at which term t occurs in query q qtf, tfq NL(q) number of Locations in query q (query length) ql 25 / 133
  26. 26. IR Models Introduction Notation Notation: Locations Notation description of events, sets, and frequencies Traditional notation nL(t; c) number of Locations at which term t occurs in collection c TF, cf(t) NL(c) number of Locations in collec- tion c nL(t; r ) number of Locations at which term t occurs in the set Lr NL(r ) number of Locations in the set Lr 26 / 133
  27. 27. IR Models Introduction Notation Notation: Documents Notation description of events, sets, and frequencies Traditional notation nD(t; c) number of Documents in which term t occurs in the set Dc of collection c nt , df(t) ND(c) number of Documents in the set Dc of collection c N nD(t; r ) number of Documents in which term t occurs in the set Dr of relevant documents rt ND(r ) number of Documents in the set Dr of relevant documents R 27 / 133
  28. 28. IR Models Introduction Notation Notation: Terms Notation description of events, sets, and frequencies Traditional notation nT (d; c) number of Terms in docu- ment d in collection c NT (c) number of Terms in collec- tion c 28 / 133
  29. 29. IR Models Introduction Notation Notation: Average and Pivoted Length Let u denote a collection associated with a set of documents. For example: u = c, or u = r , or u = r . Notation description of events, sets, and frequen- cies Traditional notation avgdl(u) average document length: avgdl(u) = NL(u)=ND(u) (avgdl if collection im- plicit) avgdl pivdl(d; u) pivoted document length: pivdl(d; u) = NL(d)=avgdl(u) = dl=avgdl(u) (pivdl(d) if collection implicit) pivdl (t; u) average term frequency over all docu- ments in Du: nL(t; u)=ND(u) avgtf(t; u) average term frequency over elite docu- ments in Du: nL(t; u)=nD(t; u) 29 / 133
  30. 30. IR Models Introduction Notation Notation: Location-based Probabilities Notation Description of Probabili- ties Traditional notation PL(tjd) := nL(t;d) NL(d) Location-based within- document term probabil- ity P(tjd) = tfd jdj , jdj = dl = NL(d) PL(tjq) := nL(t;q) NL(q) Location-based within- query term probability P(tjq) = tfq jqj , jqj = ql = NL(q) PL(tjc) := nL(t;c) NL(c) Location-based within- collection term probabil- ity P(tjc) = tfc jcj , jcj = NL(c) PL(tjr ) := nL(t;r ) NL(r ) Location-based within- relevance term probabil- ity Event space PL: Locations (LM, TF) 30 / 133
  31. 31. IR Models Introduction Notation Notation: Document-based Probabilities Notation Description of Probabilities Traditional notation PD(tjc) := nD(t;c) ND(c) Document-based within- collection term probability P(t) = nt N , N = ND(c) PD(tjr ) := nD(t;r ) ND(r ) Document-based within- relevance term probability P(tjr ) = rt R , R = ND(r ) PT (djc) := nT (d;c) NT (c) Term-based document proba- bility Pavg(tjc) := avgtf(t;c) avgdl(c) probability that t occurs in document with average length; avgtf(t; c) avgdl(c) Event space PD: Documents (BIR, IDF) 31 / 133
  32. 32. IR Models Introduction Notation Toy Example Notation Value NL(c) 20 ND(c) 10 avgdl(c) 20/10=2 Notation Value doc1 doc2 doc3 NL(d) 2 3 3 pivdl(d; c) 2/2 3/2 3/2 Notation Value sailing boats nL(t; c) 8 6 nD(t; c) 6 5 PL(tjc) 8/20 6/20 PD(tjc) 6/10 5/10 (t; c) 8/10 6/10 avgtf(t; c) 8/6 6/5 32 / 133
  33. 33. Foundations of IR Models TF-IDF PRF: The Probability of Relevance Framework BIR: Binary Independence Retrieval Poisson and 2-Poisson BM25 LM: Language Modelling PIN's: Probabilistic Inference Networks Relevance-based Models Foundations: Summary
  34. 34. Foundations of IR Models TF-IDF
  35. 35. IR Models Foundations of IR Models TF-IDF TF-IDF Still a very popular model Best known outside IR research, ery intuitive TF-IDF is not a model; it is just a weighting scheme in the vector space model TF-IDF is purely heuristic; it has no probabilistic roots. But: TF-IDF and LM are dual models that can be shown to be derived from the same root. Simpli
  36. 36. ed version of BM25 35 / 133
  37. 37. IR Models Foundations of IR Models TF-IDF TF Variants: TF(t; d) TFtotal(t; d) := lftotal(t; d) := nL(t; d) (= tfd ) TFsum(t; d) := lfsum(t; d) := nL(t; d) NL(d) = PL(tjd) = tfd dl TFmax(t; d) := lfmax(t; d) := nL(t; d) nL(tmax; d) TFlog(t; d) := lflog(t; d) := log(1 + nL(t; d)) (= log(1 + tfd )) TFfrac;K (t; d) := lffrac;K (t; d) := nL(t; d) nL(t; d) + Kd = tfd tfd + Kd TFBM25;k1;b(t; d) := ::: := nL(t; d) nL(t; d) + k1 (b pivdl(d; c) + (1

×