Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Information Retrieval


Published on

Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.

Published in: Technology
  • Hello! I can recommend a site that has helped me. It's called ⇒ ⇐ They helped me for writing my quality research paper.
    Are you sure you want to  Yes  No
    Your message goes here
  • ⇒ ⇐ This service will write as best as they can. So you do not need to waste the time on rewritings.
    Are you sure you want to  Yes  No
    Your message goes here
  • Very nice tips on this. In case you need help on any kind of academic writing visit website ⇒ ⇐ and place your order
    Are you sure you want to  Yes  No
    Your message goes here
  • I thought car auctions were only for dealers and I saved more than I expected. Thanks for showing me how to do it. ➢➢➢
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { } ......................................................................................................................... Download Full EPUB Ebook here { } ......................................................................................................................... Download Full doc Ebook here { } ......................................................................................................................... Download PDF EBOOK here { } ......................................................................................................................... Download EPUB Ebook here { } ......................................................................................................................... Download doc Ebook here { } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
    Are you sure you want to  Yes  No
    Your message goes here

Introduction to Information Retrieval

  1. 1. Introduction to Information Retrieval June, 2013 Roi Blanco
  2. 2. Acknowledgements • Many of these slides were taken from other presentations – P. Raghavan, C. Manning, H. Schutze IR lectures – Mounia Lalmas’s personal stash – Other random slide decks • Textbooks – Ricardo Baeza-Yates, Berthier Ribeiro Neto – Raghavan, Manning, Schutze – … among other good books • Many online tutorials, many online tools available (full toolkits) 2
  3. 3. Big Plan • What is Information Retrieval? – Search engine history – Examples of IR systems (you might now have known!) • Is IR hard? – Users and human cognition – What is it like to be a search engine? • Web Search – Architecture – Differences between Web search and IR – Crawling 3
  4. 4. • Representation – Document view – Document processing – Indexing • Modeling – Vector space – Probabilistic – Language Models – Extensions • Others – Distributed – Efficiency – Caching – Temporal issues – Relevance feedback – … 4
  5. 5. 5
  6. 6. Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze Introduction to Information Retrieval 6 6
  7. 7. Information Retrieval (II) • What do we understand by documents? How do we decide what is a document and whatnot? • What is an information need? What types of information needs can we satisfy automatically? • What is a large collection? Which environments are suitable for IR 7 7
  8. 8. Basic assumptions of Information Retrieval • Collection: A set of documents – Assume it is a static collection • Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task 8
  9. 9. Key issues • How to describe information resources or information-bearing objects in ways that they can be effectively used by those who need to use them ? – Organizing/Indexing/Storing • How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs – Retrieving / Accessing / Filtering 9
  10. 10. Unstructured data Unstructured data? SELECT * from HOTELS where city = Bangalore and $$$ < 2 10 Cheap hotels in Bangalore CITY $$$ name Bangalore 1.5 Cheapo one Barcelona 1 EvenCheapoer 10
  11. 11. Unstructured (text) vs. structured (database) data in the mid-nineties 11
  12. 12. Unstructured (text) vs. structured (database) data today
  13. 13. 13
  14. 14. Search Engine Index Square Pants! 14
  15. 15. 15
  16. 16. Timeline 1990 1991 1993 1994 1998 ... 16
  17. 17. ... 1995 1996 1997 1998 1999 2000 17
  18. 18. 2009 2005 ... 2008 18
  19. 19. 2001 2003 2002 2003 2003 2003 2003 2010 2010 2003 19
  20. 20. 20
  21. 21. Your ads here! 21
  22. 22. 22
  23. 23. 23
  24. 24. 24
  25. 25. 25
  26. 26. 26
  27. 27. 27
  28. 28. 28
  29. 29. 29
  30. 30. 30
  31. 31. 31
  32. 32. 32
  33. 33. 33
  34. 34. 34
  35. 35. Usability We also fail at using the technology Sometimes
  36. 36. 36
  37. 37. Applications • Text Search • Ad search • Image/Video search • Email Search • Question Answering systems • Recommender systems • Desktop Search • Expert Finding • .... Jobs Prizes Products News Source code Videogames Maps Partners Mashups ... 37
  38. 38. Types of search engines • Q&A engines • Collaborative • Enterprise • Web • Metasearch • Semantic • NLP • ... 38
  39. 39. 40
  40. 40. IR issues • Find out what the user needs … and do it quickly • Challenges: user intention, accessibility, volatility, redundancy, lack of structure, low quality, different data sources, volume, scale • The main bottleneck is human cognition and not computational 41
  41. 41. IR is mostly about relevance • Relevance is the core concept in IR, but nobody has a good definition • Relevance = useful • Relevance = topically related • Relevance = new • Relevance = interesting • Relevance = ??? • However we still want relevant information 42
  42. 42. • Information needs must be expressed as a query – But users don’t often know what they want • Problems – Verbalizing information needs – Understanding query syntax – Understanding search engines 43
  43. 43. Understanding(?) the user I am a hungry tourist in Barcelona, and I want to find a place to eat; however I don’t want to spend a lot of money I want information on places with cheap food in Barcelona Info about bars in Barcelona Bar celona Misconception Mistranslation Misformulation 44
  44. 44. Why this is hard? • Documents/images/ video/speech/etc are complex. We need some representation • Semantics – What do words mean? • Natural language – How do we say things? • L Computers cannot deal with these easily 45
  45. 45. … and even harder • Context • Opinion Funny? Talented? Honest? 46
  46. 46. Semantics Bank Note River Bank Bank 47 Blood bank
  47. 47. What is it like to be a search engine? • How can we figure out what you’re trying to do? • Signal can be somehow weak, sometimes! [ jaguar ] [ iraq ] [ latest release Thinkpad drivers touchpad ] [ ebay ] [ first ] [ google ] [ brittttteny spirs ] 48
  48. 48. Search is a multi-step process • Session search – Verbalize your query – Look for a document – Find your information there – Refine • Teleporting – Go directly to the site you like – Formulating the query is too hard, you trust more the final site, etc. 49
  49. 49. • Someone told me that in the mid-1800’s, people often would carry around a special kind of notebook. They would use the notebook to write down quotations that they heard, or copy passages from books they’d read. The notebook was an important part of their education, and it had a particular name. – What was the name of the notebook? 50 Examples from Dan Russel
  50. 50. Naming the un-nameable • What’s this thing called? 51
  51. 51. More tasks … • Going beyond a search engine – Using images / multimedia content – Using maps – Using other sources • Think of how to express things differently (synonyms) – A friend told me that there is an abandoned city in the waters of San Francisco Bay. Is that true? If it IS true, what was the name of the supposed city? • Exploring a topic further in depth • Refining a question – Suppose you want to buy a unicycle for your Mom or Dad. How would you find it? • Looking for lists of information – Can you find a list of all the groups that inhabited California at the time of the missions? 52
  52. 52. IR tasks • Known-item finding – You want to retrieve some data that you know they exist – What year was Peter Mika born? • Exploratory seeking – You want to find some information through an iterative process – Not a single answer to your query • Exhaustive search – You want to find all the information possible about a particular issue – Issuing several queries to cover the user information need • Re-finding – You want to find an item you have found already 53
  53. 53. Scale • >300TB of print data produced per year – +Video, speech, domain-specific information (>600PB per year) • IR has to be fast + scalable • Information is dynamic – News, web pages, maps, … – Queries are dynamic (you might even change your information needs while searching) • Cope with data and searcher change – This introduces tensions in every component of a search engine 54
  54. 54. Methodology • Experimentation in IR • Three fundamental types of IR research: – Systems (efficiency) – Methods (effectiveness) – Applications (user utility) • Empirical evaluation plays a critical role across all three types of research 55
  55. 55. Methodology (II) • Information retrieval (IR) is a highly applied scientific discipline • Experimentation is a critical component of the scientific method • Poor experimental methodologies are not scientifically sound and should be avoided 56
  56. 56. 57
  57. 57. 58 Task Info need Verbal form query Search engine Corpus results Query refinement
  58. 58. User Interface Query interpretation Document Collection Crawling Text Processing Indexing General Voodoo Matching Ranking Metadata Index Document Interpretation 59
  59. 59. Crawler NLP pipeline Indexer Documents Tokens Index Query System 60
  60. 60. Broker DNS Cluster Cluster cache server partition replication 61
  61. 61. <a href= • Web pages are linked – AKA Web Graph • We can walk trough the graph to crawl • We can rank using the graph 62
  62. 62. Web pages are connected 63
  63. 63. Web Search • Basic search technology shared with IR systems – Representation – Indexing – Ranking • Scale (in terms of data and users) changes the game – Efficiency/architectural design decisions • Link structure – For data acquisition (crawling) – For ranking (PageRank, HITS) – For spam detection – For extending document representations (anchor text) • Adversarial IR • Monetization 64
  64. 64. User Needs • Need – Informational – want to learn about something (~40% / 65%) – Navigational – want to go to that page (~25% / 15%) – Transactional – want to do something (web-mediated) (~35% / 20%) • Access a service • Downloads • Shop – Gray areas • Find a good hub • Exploratory search “see what’s there” Low hemoglobin United Airlines Seattle weather Mars surface images Canon S410 Car rental Brasil 65
  65. 65. How far do people look for results? (Source: WhitePaper_2006_SearchEngineUserBehavior.pdf) 66
  66. 66. Users’ empirical evaluation of results • Quality of pages varies widely – Relevance is not enough – Other desirable qualities (non IR!!) • Content: Trustworthy, diverse, non-duplicated, well maintained • Web readability: display correctly & fast • No annoyances: pop-ups, etc. • Precision vs. recall – On the web, recall seldom matters • What matters – Precision at 1? Precision above the fold? – Comprehensiveness – must be able to deal with obscure queries • Recall matters when the number of matches is very small • User perceptions may be unscientific, but are significant over a large aggregate 67
  67. 67. Users’ empirical evaluation of engines • Relevance and validity of results • UI – Simple, no clutter, error tolerant • Trust – Results are objective • Coverage of topics for ambiguous queries • Pre/Post process tools provided – Mitigate user errors (auto spell check, search assist,…) – Explicit: Search within results, more like this, refine ... – Anticipative: related searches • Deal with idiosyncrasies – Web specific vocabulary • Impact on stemming, spell-check, etc. – Web addresses typed in the search box • “The first, the last, the best and the worst …” 68
  68. 68. The Web document collection • No design/co-ordination • Distributed content creation, linking, democratization of publishing • Content includes truth, lies, obsolete information, contradictions … • Unstructured (text, html, …), semi-structured (XML, annotated photos), structured (Databases)… • Scale much larger than previous text collections … but corporate records are catching up • Growth – slowed down from initial “volume doubling every few months” but still expanding • Content can be dynamically generated The Web 69
  69. 69. Basic crawler operation • Begin with known “seed” URLs • Fetch and parse them –Extract URLs they point to –Place the extracted URLs on a queue • Fetch each URL on the queue and repeat 70
  70. 70. Crawling picture Web URLs frontier Unseen Web URLs crawled and parsed Seed pages 71
  71. 71. Simple picture – complications • Web crawling isn’t feasible with one machine – All of the above steps distributed • Malicious pages – Spam pages – Spider traps – including dynamically generated • Even non-malicious pages pose challenges – Latency/bandwidth to remote servers vary – Webmasters’ stipulations • How “deep” should you crawl a site’s URL hierarchy? – Site mirrors and duplicate pages • Politeness – don’t hit a server too often 72
  72. 72. What any crawler must do • Be Polite: Respect implicit and explicit politeness considerations – Only crawl allowed pages – Respect robots.txt • Be Robust: Be immune to spider traps and other malicious behavior from web servers –Be efficient 73
  73. 73. What any crawler should do • Be capable of distributed operation: designed to run on multiple distributed machines • Be scalable: designed to increase the crawl rate by adding more machines • Performance/efficiency: permit full use of available processing and network resources 74
  74. 74. What any crawler should do • Fetch pages of “higher quality” first • Continuous operation: Continue fetching fresh copies of a previously fetched page • Extensible: Adapt to new data formats, protocols 75
  75. 75. Updated crawling picture URLs crawled and parsed Unseen Web Seed Pages URL frontier Crawling thread 76
  76. 76. 77
  77. 77. Document views sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02” Sailing in Greece B. Smith content view head title author chapter section section structure view data view layout view 78
  78. 78. What is a document: document views • Content view is concerned with representing the content of the document; that is, what is the document about. • Data view is concerned with factual data associated with the document (e.g. author names, publishing date) • Layout view is concerned with how documents are displayed to the users; this view is related to user interface and visualization issues. • Structure view is concerned with the logical structure of the document, (e.g. a book being composed of chapters, themselves composed of sections, etc.) 79
  79. 79. Indexing language • An indexing language: – Is the language used to describe the content of documents (and queries) – And it usually consists of index terms that are derived from the text (automatic indexing), or arrived at independently (manual indexing), using a controlled or uncontrolled vocabulary – Basic operation: is this query term present in this document? 80
  80. 80. Generating document representations • The building of the indexing language, that is generating the document representation, is done in several steps: – Character encoding – Language recognition – Page segmentation (boilerplate detection) – Tokenization (identification of words) – Term normalization – Stopword removal – Stemming – Others (doc. Expansion, etc.) 81
  81. 81. Generating document representations: overview documents tokens stop-words stems terms (index terms) tokenization remove noisy words reduce to stems + others: e.g. - thesaurus - more complex processing 82
  82. 82. Parsing a document • What format is it in? – pdf/word/excel/html? • What language is it in? • What character set is in use? – (ISO-8818, UTF-8, …) But these tasks are often done heuristically … 83
  83. 83. Complications: Format/language • Documents being indexed can include docs from many different languages – A single index may contain terms from many languages. • Sometimes a document or its components can contain multiple languages/formats – French email with a German pdf attachment. – French email quote clauses from an English-language contract • There are commercial and open source libraries that can handle a lot of this stuff 84
  84. 84. Complications: What is a document? We return from our query “documents” but there are often interesting questions of grain size: What is a unit document? – A file? – An email? (Perhaps one of many in a single mbox file) • What about an email with 5 attachments? – A group of files (e.g., PPT or LaTeX split over HTML pages) 85
  85. 85. Tokenization • Input: “Friends, Romans and Countrymen” • Output: Tokens – Friends – Romans – Countrymen • A token is an instance of a sequence of characters • Each such token is now a candidate for an index entry, after further processing • But what are valid tokens to emit? 86
  86. 86. Tokenization • Issues in tokenization: – Finland’s capital  Finland AND s? Finlands? Finland’s? – Hewlett-Packard  Hewlett and Packard as two tokens? • state-of-the-art: break up hyphenated sequence. • co-education • lowercase, lower-case, lower case ? • It can be effective to get the user to put in possible hyphens – San Francisco: one token or two? • How do you decide it is one token? 87
  87. 87. Numbers • 3/20/91 Mar. 12, 1991 20/3/91 • 55 B.C. • B-52 • My PGP key is 324a3df234cb23e • (800) 234-2333 • Often have embedded spaces • Older IR systems may not index numbers But often very useful: think about things like looking up error codes/stacktraces on the web • Will often index “meta-data” separately Creation date, format, etc. 88
  88. 88. Tokenization: language issues • French – L'ensemble  one token or two? • L ? L’ ? Le ? • Want l’ensemble to match with un ensemble – Until at least 2003, it didn’t on Google » Internationalization! • German noun compounds are not segmented – Lebensversicherungsgesellschaftsangestellter – ‘life insurance company employee’ – German retrieval systems benefit greatly from a compound splitter module – Can give a 15% performance boost for German 89
  89. 89. Tokenization: language issues • Chinese and Japanese have no spaces between words: – 莎拉波娃现在居住在美国东南部的佛罗里达。 – Not always guaranteed a unique tokenization • Further complicated in Japanese, with multiple alphabets intermingled – Dates/amounts in multiple formats フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana Hiragana Kanji Romaji End-user can express query entirely in hiragana! 90
  90. 90. Tokenization: language issues • Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right • Words are separated, but letter forms within a word form complex ligatures ← → ← → ← start ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’ • With Unicode, the surface presentation is complex, but the stored form is straightforward 91
  91. 91. Stop words • With a stop list, you exclude from the dictionary entirely the commonest words. Intuition: – They have little semantic content: the, a, and, to, be – There are a lot of them: ~30% of postings for top 30 words • But the trend is away from doing this: – Good compression techniques means the space for including stop words in a system can be small – Good query optimization techniques mean you pay little at query time for including stop words. – You need them for: • Phrase queries: “King of Denmark” • Various song titles, etc.: “Let it be”, “To be or not to be” • “Relational” queries: “flights to London” 92
  92. 92. Normalization to terms • Want: matches to occur despite superficial differences in the character sequences of the tokens • We may need to “normalize” words in indexed text as well as query words into the same form – We want to match U.S.A. and USA • Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary • We most commonly implicitly define equivalence classes of terms by, e.g., – deleting periods to form a term • U.S.A., USA  USA – deleting hyphens to form a term • anti-discriminatory, antidiscriminatory  antidiscriminatory 93
  93. 93. Normalization: other languages • Accents: e.g., French résumé vs. resume. • Umlauts: e.g., German: Tuebingen vs. Tübingen – Should be equivalent • Most important criterion: – How are your users like to write their queries for these words? • Even in languages that standardly have accents, users often may not type them – Often best to normalize to a de-accented term • Tuebingen, Tübingen, Tubingen  Tubingen 94
  94. 94. Case folding • Reduce all letters to lower case – exception: upper case in mid-sentence? • e.g., General Motors • Fed vs. fed • SAIL vs. sail – Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization… • Longstanding Google example: [fixed in 2011…] – Query C.A.T. – #1 result is for “cats” (well, Lolcats) not Caterpillar Inc. 95
  95. 95. Normalization to terms • An alternative to equivalence classing is to do asymmetric expansion • An example of where this may be useful – Enter: window Search: window, windows – Enter: windows Search: Windows, windows, window – Enter: Windows Search: Windows • Potentially more powerful, but less efficient 96
  96. 96. Thesauri and soundex • Do we handle synonyms and homonyms? – E.g., by hand-constructed equivalence classes • car = automobile color = colour – We can rewrite to form equivalence-class terms • When the document contains automobile, index it under car-automobile (and vice-versa) – Or we can expand a query • When the query contains automobile, look under car as well • What about spelling mistakes? – One approach is Soundex, which forms equivalence classes of words based on phonetic heuristics 97
  97. 97. Lemmatization • Reduce inflectional/variant forms to base form • E.g., – am, are, is  be – car, cars, car's, cars'  car • the boy's cars are different colors  the boy car be different color • Lemmatization implies doing “proper” reduction to dictionary headword form 98
  98. 98. Stemming • Reduce terms to their “roots” before indexing • “Stemming” suggests crude affix chopping – language dependent – e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress 99
  99. 99. – Affix removal • remove the longest affix: {sailing, sailor} => sail • simple and effective stemming • a widely used such stemmer is Porter’s algorithm – Dictionary-based using a look-up table • look for stem of a word in table: play + ing => play • space is required to store the (large) table, so often not practical 100
  100. 100. Stemming: some issues • Detect equivalent stems: – {organize, organise}: e as the longest affix leads to {organiz, organis}, which should lead to one stem: organis – Heuristics are therefore used to deal with such cases. • Over-stemming: – {organisation, organ} reduced into org, which is incorrect – Again heuristics are used to deal with such cases. 101
  101. 101. Porter’s algorithm • Commonest algorithm for stemming English – Results suggest it’s at least as good as other stemming options • Conventions + 5 phases of reductions – phases applied sequentially – each phase consists of a set of commands – sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. 102
  102. 102. Typical rules in Porter • sses  ss • ies  i • ational  ate • tional  tion 103
  103. 103. Language-specificity • The above methods embody transformations that are – Language-specific, and often – Application-specific • These are “plug-in” addenda to the indexing process • Both open source and commercial plug-ins are available for handling these 104
  104. 104. Does stemming help? • English: very mixed results. Helps recall for some queries but harms precision on others – E.g., operative (dentistry) ⇒ oper • Definitely useful for Spanish, German, Finnish, … – 30% performance gains for Finnish! 105
  105. 105. Others: Using a thesaurus • A thesaurus provides a standard vocabulary for indexing (and searching) • More precisely, a thesaurus provides a classified hierarchy for broadening and narrowing terms bank: 1. Finance institute 2. River edge – if a document is indexed with bank, then index it with “finance institute” or “river edge” – need to disambiguate the sense of bank in the text: e.g. if money appears in the document, then chose “finance institute” • A widely used online thesaurus: WordNet 106
  106. 106. Information storage • Whole topic on its own • How do we keep fresh copies of the web manageable by a cluster of computers and are able to answer millions of queries in milliseconds – Inverted indexes – Compression – Caching – Distributed architectures – … and a lot of tricks • Inverted indexes: cornerstone data structure of IR systems – For each term t, we must store a list of all documents that contain t. – Identify each doc by a docID, a document serial number – Index construction is tricky (can’t hold all the information needed in memory) 107
  107. 107. 108 docs t1 t2 t3 D1 1 0 1 D2 1 0 0 D3 0 1 1 D4 1 0 0 D5 1 1 1 D6 1 1 0 D7 0 1 0 D8 0 1 0 D9 0 1 1 D10 0 1 1 Terms D1 D2 D3 D4 t1 1 1 0 1 t2 0 0 1 0 t3 1 0 1 0
  108. 108. • Most basic form: – Document frequency – Term frequency – Document identifiers 109 term Term id df a 1 4 as 2 3 (1,2), (2,5), (10,1), (11,1) (1,3), (3,4), (20,1)
  109. 109. • Indexes contain more information – Position in the document • Useful for “phrase queries” or “proximity queries” – Fields in which the term appears in the document – Metadata … – All that can be used for ranking 110 (1,2, [1, 1], [2,10]), … Field 1 (title), position 1
  110. 110. Queries • How do we process a query? • Several kinds of queries – Boolean •Chicken AND salt • Gnome OR KDE • Salt AND NOT pepper – Phrase queries – Ranked 111
  111. 111. List Merging •“Exact match” queries – Chicken AND curry – Locate Chicken in the dictionary – Fetch its postings – Locate curry in the dictionary –Fetch its postings –Merge both postings 112
  112. 112. Intersecting two postings lists 113
  113. 113. List Merging Walk through the postings in O(x+y) time salt pepper 3 22 23 25 3 5 22 25 36 3 22 25 114
  114. 114. 115
  115. 115. Models of information retrieval • A model: – abstracts away from the real world – uses a branch of mathematics – possibly: uses a metaphor for searching 116
  116. 116. Short history of IR modelling • Boolean model (±1950) • Document similarity (±1957) • Vector space model (±1970) • Probabilistic retrieval (±1976) • Language models (±1998) • Linkage-based models (±1998) • Positional models (±2004) • Fielded models (±2005) 117
  117. 117. The Boolean model (±1950) • Exact matching: data retrieval (instead of information retrieval) – A term specifies a set of documents – Boolean logic to combine terms / document sets – AND, OR and NOT: intersection, union, and difference 118
  118. 118. Statistical similarity between documents (±1957) • The principle of similarity "The more two representations agree in given elements and their distribution, the higher would be the probability of their representing similar information” (Luhn 1957) It is here proposed that the frequency of word [term] occurrence in an article [document ] furnishes a useful measurement of word [term] significance” 119
  119. 119. Zipf’s law terms by rank order frequency of terms f r 120
  120. 120. Zipf’s law • Relative frequencies of terms. • In natural language, there are a few very frequent terms and very many very rare terms. • Zipf’s law: The ith most frequent term has frequency proportional to 1/i . • cfi ∝ 1/i = K/i where K is a normalizing constant • cfi is collection frequency: the number of occurrences of the term ti in the collection. • Zipf’s law holds for different languages 121
  121. 121. Zipf consequences • If the most frequent term (the) occurs cf1 times – then the second most frequent term (of) occurs cf1/2 times – the third most frequent term (and) occurs cf1/3 times … • Equivalent: cfi = K/i where K is a normalizing factor, so – log cfi = log K - log i – Linear relationship between log cfi and log i • Another power law relationship 122
  122. 122. Zipf’s law in action 123
  123. 123. Luhn’s analysis -Observation terms by rank order frequency of terms f resolving power r upper cut-off lower cut-off common terms rare terms significant terms Resolving power of significant terms: ability of terms to discriminate document content peak at rank order position half way between the two cut-offs 124
  124. 124. Luhn’s analysis - Implications • Common terms are not good at representing document content – partly implemented through the removal of stop words • Rare words are also not good at representing document content – usually nothing is done – Not true for every “document” • Need a means to quantify the resolving power of a term: – associate weights to index terms – tf×idf approach 125
  125. 125. Ranked retrieval • Boolean queries are good for expert users with precise understanding of their needs and the collection. – Also good for applications: Applications can easily consume 1000s of results. • Not good for the majority of users. – Most users incapable of writing Boolean queries (or they are, but they think it’s too much work). – Most users don’t want to wade through 1000s of results. • This is particularly true of web search.
  126. 126. Feast or Famine • Boolean queries often result in either too few (=0) or too many (1000s) results. • Query 1: “standard user dlink 650” → 200,000 hits • Query 2: “standard user dlink 650 no card found”: 0 hits • It takes a lot of skill to come up with a query that produces a manageable number of hits. – AND gives too few; OR gives too many
  127. 127. Ranked retrieval models • Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query • Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language • In principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa 128
  128. 128. Feast or famine: not a problem in ranked retrieval • When a system produces a ranked result set, large result sets are not an issue – Indeed, the size of the result set is not an issue – We just show the top k ( ≈ 10) results – We do not overwhelm the user – Premise: the ranking algorithm works
  129. 129. Scoring as the basis of ranked retrieval • We wish to return in order the documents most likely to be useful to the searcher • How can we rank-order the documents in the collection with respect to a query? • Assign a score – say in [0, 1] – to each document • This score measures how well document and query “match”.
  130. 130. Query-document matching scores • We need a way of assigning a score to a query/document pair • Let’s start with a one-term query • If the query term does not occur in the document: score should be 0 • The more frequent the query term in the document, the higher the score (should be) • We will look at a number of alternatives for this.
  131. 131. Bag of words model • Vector representation does not consider the ordering of words in a document • John is quicker than Mary and Mary is quicker than John have the same vectors • This is called the bag of words model.
  132. 132. Term frequency tf • The term frequency tf(t,d) of term t in document d is defined as the number of times that t occurs in d. • We want to use tf when computing query-document match scores. But how? • Raw term frequency is not what we want: – A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. – But not 10 times more relevant. • Relevance does not increase proportionally with term frequency.
  133. 133. Log-frequency weighting • The log frequency weight of term t in d is    1 log tf , if tf 0    10 t,d t,d 0, otherwise t,d w • 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. • Score for a document-query pair: sum over terms t in both q and d: • score • The score is 0 if none of the query terms is present in the document.      t q d t d (1 log tf ) ,
  134. 134. Document frequency • Rare terms are more informative than frequent terms – Recall stop words • Consider a term in the query that is rare in the collection (e.g., arachnocentric) • A document containing this term is very likely to be relevant to the query arachnocentric • → We want a high weight for rare terms like arachnocentric.
  135. 135. Document frequency, continued • Frequent terms are less informative than rare terms • Consider a query term that is frequent in the collection (e.g., high, increase, line) • A document containing such a term is more likely to be relevant than a document that does not • But it’s not a sure indicator of relevance. • → For frequent terms, we want high positive weights for words like high, increase, and line • But lower weights than for rare terms. • We will use document frequency (df) to capture this.
  136. 136. idf weight • dft is the document frequency of t: the number of documents that contain t – dft is an inverse measure of the informativeness of t – dft  N • We define the idf (inverse document frequency) of t by – We use log (N/dft) instead of N/dft to “dampen” the effect of idf. idf log ( /df ) t 10 t  N
  137. 137. Effect of idf on ranking • Does idf have an effect on ranking for one-term queries, like – iPhone • idf has no effect on ranking one term queries – idf affects the ranking of documents for queries with at least two terms – For the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person. 138
  138. 138. tf-idf weighting • The tf-idf weight of a term is the product of its tf weight and its idf weight. w  log(1  tf )  log ( N / df ) t , d t ,d 10 t • Best known weighting scheme in information retrieval – Note: the “-” in tf-idf is a hyphen, not a minus sign! – Alternative names: tf.idf, tf x idf • Increases with the number of occurrences within a document • Increases with the rarity of the term in the collection
  139. 139. Score for a document given a query tÎqÇd å • There are many variants – How “tf” is computed (with/without logs) – Whether the terms in the query are also weighted – … 140 Score(q,d) = tf.idft,d
  140. 140. Documents as vectors • So we have a |V|-dimensional vector space • Terms are axes of the space • Documents are points or vectors in this space • Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine • These are very sparse vectors - most entries are zero.
  141. 141. Statistical similarity between documents (±1957) • Vector product – If the vector has binary components, then the product measures the number of shared terms – Vector components might be "weights"  score q d  q  d k k  matching terms ( , ) k  
  142. 142. Why distance is a bad idea The Euclidean distance between q and d2 is large even though the distribution of terms in the query q and the distribution of terms in the document d2 are very similar.
  143. 143. Vector space model (±1970) • Documents and queries are vectors in a high-dimensional space • Geometric measures (distances, angles)
  144. 144. Vector space model (±1970) • Cosine of an angle: – close to 1 if angle is small – 0 if vectors are orthogonal 2 m d q k k k d q m k 1 k   2 m k 1 k  1 ( ) ( )   cos( , )      d q 1 ( )2  m      k  k i i m k k k v v   d q n d n q n v 1 cos( , ) ( ) ( ), ( )
  145. 145. Vector space model (±1970) • PRO: Nice metaphor, easily explained; Mathematically sound: geometry; Great for relevance feedback • CON: Need term weighting (tf-idf); Hard to model structured queries
  146. 146. Probabilistic IR • An IR system has an uncertain understanding of user’s queries and makes uncertain guesses on whether a document satisfies a query or not. • Probability theory provides a principled foundation for reasoning under uncertainty. • Probabilistic models build upon this foundation to estimate how likely it is that a document is relevant for a query. 147
  147. 147. Event Space • Query representation • Document representation • Relevance • Event space • Conceptually there might be pairs with same q and d, but different r • Some times include include user u, context c, etc. 148
  148. 148. Probability Ranking Principle • Robertson (1977) – “If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.” • Basis for probabilistic approaches for IR 149
  149. 149. Dissecting PRP • Probability of relevance • Estimated accurately • Based on whatever data available • Best possible accuracy – The perfect IR system! – Assumes relevance is independent on other documents in the collection 150
  150. 150. Relevance? • What is ? – Isn’t it decided by the user? her opinion? • User doesn’t mean a human being! – We are working with representations – ... or parts of the reality available to us • 2/3 keywords, no profile, no context ... – relevance is uncertain • depends on what the system sees • may be marginalized over all the unseen context/profiles 151
  151. 151. Retrieval as binary classification • For every (q,d), r takes two values – Relevant and non-relevant documents – can be extended to multiple values • Retrieve using Bayes’ decision – PRP is related to the Bayes error rate (lowest possible error rate for a class) – How do we estimate this probability? 152
  152. 152. PRP ranking • How to represent the random variables? • How to estimate the model’s parameters? 153
  153. 153. • d is a binary vector • Multiple Bernoulli variables • Under MB, we can decompose into a product of probabilities, with likelihoods: 154
  154. 154. If the terms are not in the query: Otherwise we need estimates for them! 155
  155. 155. Estimates • Assign new weights for query terms based on relevant/non-relevant documents • Give higher weights to important terms: Relevant Non-relevant 156 Document with t r n-r n Document without t R-r N-r-R+r N-n R N-R
  156. 156. Robertson-Spark Jones weight 157 Relevant docs with t Relevant docs without t Non-relevant docs with t Non-relevant docs without t
  157. 157. Estimates without relevance info • If we pick a relevant document, words are equally like to be present or absent • Non-relevant can be approximated with the collection as a whole 158
  158. 158. Modeling term frequencies 159
  159. 159. Modeling TF • Naïve estimation: separate probability for every outcome • BIR had only two parameters, now we have plenty (~many outcomes) • We can plug in a parametric estimate for the term frequencies • For instance, a Poisson mixture 160
  160. 160. Okapi BM25 • Same ranking function as before but with new estimates. Models term frequencies and document length. • Words are generated by a mixture of two Poissons • Assumes an eliteness variable (elite ~ word occurs unusually frequently, non-elite ~ word occurs as expected by chance). 161
  161. 161. BM25 • As a graphical model 162
  162. 162. BM25 • In order to approximate the formula, Robertson and Walker came up with: • Two model parameters • Very effective • The more words in common with the query the better • Repetitions less important than different query words – But more important if the document is relatively long 163
  163. 163. Generative Probabilistic Language Models • The generative approach – A generator which produces events/tokens with some probability – Probability distribution over strings of text – URN Metaphor – a bucket of different colour balls (10 red, 5 blue, 3 yellow, 2 white) • What is the probability of drawing a yellow ball? 3/20 • what is the probability of drawing (with replacement) a red ball and a white ball? ½*1/10 – IR Metaphor: Documents are urns, full of tokens (balls) of (in) different terms (colors)
  164. 164. What is a language model? • How likely is a string of words in a “language”? – P1(“the cat sat on the mat”) – P2(“the mat sat on the cat”) – P3(“the cat sat en la alfombra”) – P4(“el gato se sentó en la alfombra”) • Given a model M and a observation s we want – Probability of getting s through random sampling from M – A mechanism to produce observations (strings) legal in M • User thinks of a relevant document and then picks some keywords to use as a query 165
  165. 165. Generative Probabilistic Models • What is the probability of producing the query from a document? p(q|d) • Referred to as query-likelihood • Assumptions: • The probability of a document being relevant is strongly correlated with the probability of a query given a document, i.e. p(d|r) is correlated with p(q|d) • User has a reasonable idea of the terms that are like to appear in the “ideal” document • User’s query terms can distinguish the “ideal” document from the rest of the corpus • The query is generated as a representative of the “ideal” document • System’s task is to estimate for each of the documents in the collection, which is most likely to be the “ideal” document
  166. 166. Language Models (1998/2001) • Let’s assume we point blindly, one at a time, at 3 words in a document – What is the probability that I, by accident, pointed at the words “Master”, “computer” and “Science”? – Compute the probability, and use it to rank the documents. • Words are “sampled” independently of each other – Joint probability decomposed into a product of marginals – Estimation of probabilities just by counting • Higher models or unigrams? – Parameter estimation can be very expensive
  167. 167. Standard LM Approach • Assume that query terms are drawn identically and independently from a document
  168. 168. Estimating language models • Usually we don’t know M • Maximum Likelihood Estimate of – Simply use the number of times the query term occurs in the document divided by the total number of term occurrences. • Zero Probability (frequency) problem 169
  169. 169. Document Models • Solution: Infer a language model for each document, where • Then we can estimate • Standard approach is to use the probability of a term to smooth the document model. • Interpolate the ML estimator with general language expectations
  170. 170. Estimating Document Models • Basic Components – Probability of a term given a document (maximum likelihood estimate) – Probability of a term given the collection – tf(t,d) is the number of times term t occurs in document d (term frequency)
  171. 171. Language Models • Implementation
  172. 172. Implementation as vector product df t tf t D p t    ' ( ) ( ' ) ( ) t df t  ' ( , ) ( ' , ) ( | ) t tf t D p t D Recall: score q d q dk q  tf k q ( , ) . ( , ) tf k d df t ( , ) ( ) k tf.idf of term k in document d    Odds of the probability of    Inverse length of d Term importance    1 . ( ) ( , ) log Matching Text t t k k k df k tf t d d
  173. 173. Document length normalization • Probabilistic models assume causes for documents differing in length – Scope – Verbosity • In practice, document length softens the term frequency contribution to the final score – We’ve seen it in BM25 and LMs – Usually with a tunable parameter that regulates the amount of softening – Can be a function of the deviation of the average document length – Can be incorporated into vanilla tf-idf 174
  174. 174. Other models • Modeling term dependencies (positions) in the language modeling framework – Markov Random Fields • Modeling matches (occurrences of words) in different parts of a document -> fielded models – BM25F – Markov Random Fields can account for this as well 175
  175. 175. More involved signals for ranking • From document understanding to query understanding • Query rewrites (gazetteers, spell correction), named entity recognition, query suggestions, query categories, query segmentation ... • Detecting query intent, triggering verticals – direct target towards answers – richer interfaces 176
  176. 176. Signals for Ranking • Signals for ranking: matches of query terms in documents, query-independent quality measures, CTR, among others • Probabilistic IR models are all about counting – occurrences of terms in documents, in sets of documents, etc. • How to aggregate efficiently a large number of “different” counts – coming from the same terms – no double counts! 177
  177. 177. Searching for food • New York’s greatest pizza ‣ New OR York’s OR greatest OR pizza ‣ New AND York’s AND greatest AND pizza ‣ New OR York OR great OR pizza ‣ “New York” OR “great pizza” ‣ “New York” AND “great pizza” ‣ York < New AND great OR pizza • among many more. 178
  178. 178. “Refined”matching • Extract a number of virtual regions in the document that match some version of the query (operators) – Each region provides a different evidence of relevance (i.e. signal) • Aggregate the scores over the different regions • Ex. :“at least any two words in the query appear either consecutively or with an extra word between them” 179
  179. 179. Probability of Relevance 180
  180. 180. Remember BM25 • Term (tf) independence • Vague Prior over terms not appearing in the query • Eliteness - topical model that perturbs the word distribution • 2-poisson distribution of term frequencies over relevant and non-relevant documents 181
  181. 181. Feature dependencies • Class-linearly dependent (or affine) features – add no extra evidence/signal – model overfitting (vs capacity) • Still, it is desirable to enrich the model with more involved features • Some features are surprisingly correlated • Positional information requires a large number of parameters to estimate • Potentially up to 182
  182. 182. Query concept segmentation • Queries are made up of basic conceptual units, comprising many words – “Indian summer victor herbert” • Spurious matches: “san jose airport” -> “san jose city airport” • Model to detect segments based on generative language models and Wikipedia • Relax matches using factors of the max ratio between span length and segment length 183
  183. 183. Virtual regions • Different parts of the document provide different evidence of relevance • Create a (finite) set of (latent) artificial regions and re-weight 184
  184. 184. Implementation • An operator maps a query to a set of queries, which could match a document • Each operator has a weight • The average term frequency in a document is 185
  185. 185. Remarks • Different saturation (eliteness) function? – learn the real functional shape! – log-logistic is good if the class-conditional distributions are drawn from an exp. family • Positions as variables? – kernel-like method or exp. #parameters • Apply operators on a per query or per query class basis? 186
  186. 186. Operator examples • BOW: maps a raw query to the set of queries whose elements are the single terms • p-grams: set of all p-gram of consecutive terms • p-and: all conjunctions of p arbitrary terms • segments: match only the “concepts” • Enlargement: some words might sneak in between the phrases/segments 187
  187. 187. How does it work in practice? 188
  188. 188. ... not that far away term frequency link information query intent information editorial information click-through information geographical information language information user preferences document length document fields other gazillion sources of information 189
  189. 189. Dictionaries • Fast look-up – Might need specific structures to scale up • Hash tables • Trees – Tolerant retrieval (prefixes) – Spell checking • Document correction (OCR) • Query misspellings (did you mean … ?) • (Weighted) edit distance – dynamic programming • Jaccard overlap (index character k-grams) • Context sensitive • – Wild-card queries • Permuterm index • K-gram indexes 190
  190. 190. Hardware basics • Access to data in memory is much faster than access to data on disk. • Disk seeks: No data is transferred from disk while the disk head is being positioned. • Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks. • Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). • Block sizes: 8KB to 256 KB. 191
  191. 191. Hardware basics • Many design decisions in information retrieval are based on the characteristics of hardware • Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB. • Available disk space is several (2-3) orders of magnitude larger. • Fault tolerance is very expensive: It is much cheaper to use many regular machines rather than one fault tolerant machine. 192
  192. 192. Data flow splits Parser Parser Parser Master a-f g-p q-z a-f g-p q-z a-f g-p q-z Inverter Inverter Inverter Postings a-f g-p q-z assign assign Map phase Segment files Reduce phase 193
  193. 193. MapReduce • The index construction algorithm we just described is an instance of MapReduce. • MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing … • … without having to write code for the distribution part. • They describe the Google indexing system (ca. 2002) as consisting of a number of phases, each implemented in MapReduce. • Open source implementation Hadoop – Widely used throughout industry 194
  194. 194. MapReduce • Index construction was just one phase. • Another phase: transforming a term-partitioned index into a document-partitioned index. – Term-partitioned: one machine handles a subrange of terms – Document-partitioned: one machine handles a subrange of documents • Msearch engines use a document-partitioned index for better load balancing, etc. 195
  195. 195. Distributed IR • Basic process – All queries sent to a director machine – Director then sends messages to many index servers • Each index server does some portion of the query processing – Director organizes the results and returns them to the user • Two main approaches – Document distribution • by far the most popular – Term distribution 196
  196. 196. Distributed IR (II) • Document distribution – each index server acts as a search engine for a small fraction of the total collection – director sends a copy of the query to each of the index servers, each of which returns the top k results – results are merged into a single ranked list by the director • Collection statistics should be shared for effective ranking 197
  197. 197. Caching • Query distributions similar to Zipf • About ½ each day are unique, but some are very popular – Caching can significantly improve effectiveness • Cache popular query results • Cache common inverted lists – Inverted list caching can help with unique queries – Cache must be refreshed to prevent stale data 198
  198. 198. Others • Efficiency (compression, storage, caching, distribution) • Novelty and diversity • Evaluation • Relevance feedback • Learning to rank • User models – Context, personalization • Sponsored Search • Temporal aspects • Social aspects 199
  199. 199. 200