Your SlideShare is downloading. ×
Information retreival, By Hadi Mohammadzadeh
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Information retreival, By Hadi Mohammadzadeh

822
views

Published on

Published in: Business, Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
822
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
38
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • SMART: Cornell (Salton) IR system of 1970s to 1990s.
  • Just as we modified the query in the vector space model, we can also modify it here. I’m not aware of work that uses language model based Ir this way.
  • Transcript

    • 1. . Seminar on Information Retrieval (IR)By : Hadi MohammadzadehInstitute of Applied Information ProcessingUniversity of Ulm – 3 Nov. 2009 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 1
    • 2. . Information Retrieval Definition• Information Retrieval (IR) is : finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need(query) from within large collections (usually stored on computers). Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 2
    • 3. .Basic assumptions of Information Retrieval• Collection: Fixed set of documents• Goal: Retrieve documents with information that is relevant to user’s information need and helps him complete a task Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 3
    • 4. . Search Methods for Finding DocumentsHadi Mohammadzadeh Information Retrieval )IR( 50 Pages 4
    • 5. . Searching Methods Grep method Term-document incidence matrix (Binary Ret.) Inverted index Inverted index mit Skip pointers/Skip lists Positional Postings (for Phrase queries) Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 5
    • 6. . Term-document incidence Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 1 if play contains word, 0 otherwise Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 6
    • 7. . Sec. 1.2 Inverted index• For each term T, we must store a list of all documents that contain T.• Do we use an array or a list for this?Brutus 2 4 8 16 32 64 128Calpurnia 1 2 3 5 8 13 21 34Caesar 13 16 What happens if the word Caesar is added to document 14? 7 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 7
    • 8. . Sec. 1.2 Inverted index• Linked lists generally preferred to arrays – Dynamic space allocation – Insertion of terms into documents easy Posting – Space overhead of pointers Brutus 2 4 8 16 32 64 128 Calpurnia 1 2 3 5 8 13 21 34 Caesar 13 16Dictionary Postings lists 8 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 8
    • 9. . Augment postings with skip pointers (at indexing time) 41 128 2 4 8 41 48 64 128 11 31 1 2 3 8 11 17 21 31• Why?• To skip postings that will not figure in the search results.• Where do we place skip pointers? Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 9
    • 10. . Where do we place skips?• Tradeoff: – More skips → shorter skip spans ⇒ more likely to skip. But lots of comparisons to skip pointers. – Fewer skips → few pointer comparison, but then long skip spans ⇒ few successful skips. Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 10
    • 11. . Positional index example<be: 993427;1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,52: 3, 149; could contain “to be4: 17, 191, 291, 430, 434; or not to be”?5: 363, 367, …> Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 11
    • 12. . Sec. 1.2 Steps of Inverted index constructionDocuments to Friends, Romans, countrymen.be indexed. TokenizerToken stream. Friends Romans Countrymen Linguistic modulesModified tokens. friend roman countryman Indexer friend 2 4 roman 1 2Inverted index. countryman 13 16 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 12
    • 13. . Parts of an Inverted Index• Dictionary – Commonly keep in memory• Posting lists – Commonly keep in disk Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 13
    • 14. . Inverted index constructionPreprocessing to form the term vocabulary Tokenization (problems)  Hyphens  apostrophes  Compounds  Chinese  numbers Dropping Stop Words  But you need them: Phrase queries, various song titles, Relational queries Normalization (Term equivalence classing)  Numbers  case folding (Reduce all letters to lower case)  Stemming ( Porter’s algorithm) Reduce terms to their “roots”  lemmatization (Reduce variant forms to base form) Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 14
    • 15. . Inverted index construction Index ConstructionBlocked Sort-based indexing (BSBI)  Algorithm  Accumulate posting for each block, sort, write to disk  Then merge (External sorting) the blocks into one long sorted orderDistributed indexing using MapReduce  Break up indexing into sets of 2 parallel tasks  Parsers  Invertors  Break the input document corpus into splits  Parsers  Master assign a split to an idle parser machine  Parser reads a document at a time and emit (term,doc) pairs  Parser writes pairs into j partitions  Each partition is for a range of terms first letters  Inverters  An inverter collects all (term,doc) pairs for one term-partition  Sorts and writes to postings listDynamic Indexing Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 15
    • 16. . Inverted index construction Data flow Index Construction assign Master assign Postings Parser a-f g-p q-z Inverter a-f Parser a-f g-p q-z Inverter g-psplits Inverter q-z Parser a-f g-p q-z Map Reduce Segment files phase phase Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 16
    • 17. .Search structures for Dictionary A naïve dictionary Hash tables Trees  Binary tree  B-Tree Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 17
    • 18. . Index compressionDictionary compression for Boolean indexes  Array of fixed/width entries (it is wasteful)  Dictionary as a string  Blocking  Front codingPostings compression  Gap encoding using prefix-unique codes  Variable-Byte  Gamma codes ( seldom used in practice) Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 18
    • 19. . Dictionary compression for Boolean indexes Dictionary-as-a-String ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….Freq. Postings ptr. Term ptr. Total string length =33 400K x 8B = 3.2MB2944 Pointers resolve 3.2M126 positions: log23.2M = 22bits = 3bytes Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 19
    • 20. . Dictionary compression for Boolean indexes Blocking ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….Freq. Postings ptr. Term ptr.3329  Save 9 bytes Lose 4 bytes on44  on 3 term lengths.126  pointers.7 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 20
    • 21. . Dictionary compression for Boolean indexes Front coding – Sorted words commonly have • long common prefix – store differences only – (for last k-1 in a block of k) 8automata8automate9automatic10automatio n →8automat*a1◊e2◊ic3◊ionEncodes automat Extra length beyond automat. Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 21
    • 22. . Information RetrievalRanked Retrieval Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 22
    • 23. . Information Retrieval Ranked retrieval• Thus far, our queries have all been Boolean.• Good for expert users• Also good for applications: Applications can easily consume 1000s of results. – Not good for the majority of users. – Most users incapable of writing Boolean queries (or they are, but they think it’s too much work).• Most users don’t want to wade through 1000s of results. – This is particularly true of web search Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 23
    • 24. . Term Weighting• Term frequency and Inverse document frequency – TF  +log10 tf t,d , 1 if tf t,d >0 wt,d =  0, otherwise – IDF: the number of docs in the collection that contain a term t idf t = log10 N/df t• td-idf weighting – The tf-idf weight of a term is the product of its tf weight and its idf weight w t ,d = (1 + log tf t ,d ) × log10 N / df t• td-idf is the best known weighting scheme in information retrieval Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 24
    • 25. .Vector space model for scoring– Represent the query as a weighted tf-idf vector– Represent each document as a weighted tf-idf vector– Compute the cosine similarity score for the query vector and each document vector     ∑ V   q •d q d qi d i cos( q , d ) =   =  •  = i =1 q d qd ∑i =1 qi2 ∑ V V i =1 d i2– Rank documents with respect to the query by score– Increases with the number of occurrences within a document– Increases with the rarity of the term in the collection Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 25
    • 26. . Providing heuristics methods for Speeding up Vector Space Scoring & Ranking – Many of these heuristics achieve their speed at risk of not finding quite top K documents matching query• Efficient Scoring & ranking 1. Inexact top K document retrieval 2. Index Elimination 3. Champion lists 4. Static quality scores • We want top-ranking documents to be both relevant and authoritative • Relevance is being modeled by cosine scores • Authority is typically a query-independent property of a document • Assign a query-independent quality score in [0,1] to each document d, Denote this by g(d) Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 26
    • 27. . Providing heuristics methods for Speeding up Vector Space Scoring & Ranking(Cont.)5 - Cluster pruning: preprocessing • Pick √N docs at random: call these leaders • For every other doc, pre-compute nearest leader – Docs attached to a leader: its followers; – Likely: each leader has ~ √N followers. • Process a query as follows: – Given query Q, find its nearest leader L. – Seek K nearest docs from among L’s followers– Net score for a document d • net-score can be computed as combination of cosine relevance and authority e.g. net-score(q,d) = g(d) + cosine(q,d) • Top K by net score – fast methods Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 27
    • 28. . Cluster Pruning QueryLeader FollowerHadi Mohammadzadeh Information Retrieval )IR( 50 Pages 28
    • 29. . Parametric and zone indexes• In fact documents have multiple parts, some with special semantics: – Author, Title, Date of publication, Language, Format, etc. • These constitute the metadata about a document • We sometimes wish to search by these metadata • Field or parametric index: postings for each field value – Field query typically treated as conjunction • A zone is a region of the doc that can contain an arbitrary amount of text e.g., Title, Abstract, References … – Build inverted indexes on zones as well to permit querying Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 29
    • 30. . Example zone indexesEncode zones in dictionary vs. postings. Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 30
    • 31. . Tiered indexes– Tiered indexes • Break postings up into a hierarchy of lists – Most important – … – Least important • Can be done by g(d) or another measure • Inverted index thus broken up into tiers of decreasing importance • At query time use top tier unless it fails to yield K docs – If so drop to lower tiers Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 31
    • 32. . Example tiered indexHadi Mohammadzadeh Information Retrieval )IR( 50 Pages 32
    • 33. .A Complete Search System Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 33
    • 34. . Evaluating Search Engine (Ranked Retrieval Method)Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 34
    • 35. . Measures for a search engineWhich parameters are very important in SE – How fast does a search engine index – How fast does a search engine search – Expressiveness of query language – Uncluttered User Interface(UI) – Is it free? Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 35
    • 36. . The key measure User happiness• Useless answers won’t make a user happy• Need a way of quantifying user happiness• Issue: who is the user we are trying to make happy? – Web engine – eCommerce site – Enterprise (company/govt/academic)• Happiness: elusive to measure Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 36
    • 37. .Evaluation of unranked retrieval – Precision: fraction of retrieved docs that are relevant = P( relevant | retrieved ) – Recall: fraction of relevant docs that are retrieved = P( retrieved | relevant ) Relevant Nonrelevant Retrieved tp fp Not Retrieved fn tn • Precision P = tp/(tp + fp) • Recall R = tp/(tp + fn) Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 37
    • 38. .Evaluation of unranked retrieval (Cont.)• What about Accuracy – The accuracy of an engine: the fraction of classifications that are correct – Accuracy is a used in machine learning classification work – Why is this not a very useful evaluation measure in IR? – How to build a 99.9999% accurate search engine on a low budget…. 38 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 38
    • 39. . Evaluation of unranked retrieval (Cont.)• F measure – Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean): 1 ( β 2 + 1) PR F= = 1 α + (1 − α ) 1 β 2P + R P R – People usually use balanced F1 measure i.e., with β = 1 or α = ½ – For F1 the best value is 1 and the worst value is 0 39 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 39
    • 40. .Evaluation of Ranked Retrieval• By taking various numbers of the top returned documents (levels of recall), the evaluator can produce a precision-recall curve• We can determine a value between the points using Interpolation• 11-point interpolated average precision• Other methods: Mean average precision (MAP) and R- precision 40 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 40
    • 41. .A precision-recall curve 1.0 0.8Precision 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall 41 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 41
    • 42. . Typical (good) 11 point precisions• SabIR/Cornell 8A1 11pt precision from TREC 8 (1999) 1 0.8 0.6 Precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 42 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 42
    • 43. .Relevance Feedback (RF) for Query Refinement In Search EngineHadi Mohammadzadeh Information Retrieval )IR( 50 Pages 43
    • 44. . Relevance Feedback• user feedback on relevance of docs in initial set of results – User issues a (short, simple) query – The user marks some results as relevant or non-relevant. – The system computes a better representation of the information need based on feedback. – Relevance feedback can go through one or more iterations.• Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 44
    • 45. . Relevance Feedback: Example• Image search engine http://nayana.ece.ucsb.edu/imsearch/imsearch.html Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 45
    • 46. . Results for Initial QueryHadi Mohammadzadeh Information Retrieval )IR( 50 Pages 46
    • 47. . Relevance FeedbackHadi Mohammadzadeh Information Retrieval )IR( 50 Pages 47
    • 48. .Results after Relevance Feedback Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 48
    • 49. . Key concept: Centroid• The centroid is the center of mass of a set of points• Recall that we represent documents as points in a high-dimensional space• Definition: Centroid  1  µ (C ) = ∑d | C | d∈Cwhere C is a set of documents. Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 49
    • 50. . Rocchio Algorithm• The Rocchio algorithm uses the vector space model to pick a relevance fed-back query• Rocchio seeks the query q opt that maximizes      qopt = arg max [cos( q , µ (Cr )) − cos( q , µ (Cnr ))]  q• Tries to separate docs marked relevant and non- relevant  1  1  qopt = C  ∑d j − C  ∑d j r d j ∈Cr nr d j ∉Cr• Problem: we don’t know the truly relevant docs Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 50
    • 51. . Rocchio 1971 Algorithm (SMART)• Used in practice:   1  1  q m = α q0 + β Dr ∑d j −γ D  ∑dj d j ∈Dr nr d j ∈Dnr• Dr = set of known relevant doc vectors• Dnr = set of known irrelevant doc vectors – Different from Cr and Cnr !• qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically)• New query moves toward relevant documents and Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 51
    • 52. . The Theoretically Best Query x x x x o x x x x x x x x o x o x o x x ∆ o o x x x non-relevant documentsOptimalquery o relevant documents Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 52
    • 53. . Relevance feedback on initial queryInitial x xquery x o x ∆ x x x x x x o x x o∆ x o x o o x x x x x known non-relevant documents Revised query o known relevant documents Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 53
    • 54. . Relevance Feedback in vector spaces• We can modify the query based on relevance feedback and apply standard vector space model.• Use only the docs that were marked.• Relevance feedback can improve recall and precision• Relevance feedback is most useful for increasing recall in situations where recall is important – Users can be expected to review results and to take time to iterate Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 54
    • 55. . Relevance feedback revisited• In relevance feedback, the user marks a number of documents as relevant/nonrelevant.• We then try to use this information to return better search results.• Suppose we just tried to learn a filter for nonrelevant documents• This is an instance of a text classification problem: – Two “classes”: relevant, nonrelevant – For each document, decide whether it is relevant or nonrelevant Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 55
    • 56. . Text ClassificationHadi Mohammadzadeh Information Retrieval )IR( 50 Pages 56
    • 57. . Classification Methods #1 Manual classification• Used by Yahoo! (originally; now present but downplayed), Looksmart, about.com, ODP, PubMed• Very accurate when job is done by experts• Consistent when the problem size and team is small• Difficult and expensive to scale Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 57
    • 58. . Classification Methods #2 Automatic document classification• Hand-coded rule-based systems – One technique used by CS dept’s spam filter, Reuters, CIA, etc. – Companies (Verity) provide “IDE” for writing such rules – Accuracy is often very high if a rule has been carefully refined over time by a subject expert – Building and maintaining these rules is expensive Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 58
    • 59. . Classification Methods #3 Supervised learning• Supervised learning of a document-label assignment function – Many systems partly rely on machine learning • k-Nearest Neighbors (simple, powerful) • Naive Bayes (simple, common method) • Support-vector machines (new, more powerful) • No free lunch: requires hand-classified training data • But data can be built up (and refined) by amateurs Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 59
    • 60. . References• Introduction to Information Retrieval-2008• Managing Gigabytes-1999 Hadi Mohammadzadeh Information Retrieval )IR( 50 Pages 60