Slides

523 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
523
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Slides

  1. 1. Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 http://www.ee.technion.ac.il/courses/049011
  2. 2. Information Retrieval
  3. 3. Information Retrieval Setting I want information about Michael Jordan , the machine learning expert query User Document Collection “ Information Need” +”Michael Jordan” -basketball <ul><li>Michael I. Jordan’s </li></ul><ul><li>homepage </li></ul><ul><li>NBA.com </li></ul><ul><li>Michael Jordan on TV </li></ul>Ranked list of retrieved documents IR System documents No. 1 is good, Rest are bad feedback Revised ranked list of retrieved documents <ul><li>Michael I. Jordan’s </li></ul><ul><li>homepage </li></ul><ul><li>M.I. Jordan’s pubs </li></ul><ul><li>Graphical Models </li></ul>
  4. 4. Information Retrieval vs. Data Retrieval <ul><li>Information Retrieval System: a system that allows a user to retrieve documents that match her “information need” from a large corpus. </li></ul><ul><ul><li>Ex: Get documents about Michael Jordan, the machine learning expert. </li></ul></ul><ul><li>Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. </li></ul><ul><ul><li>Ex: SELECT doc </li></ul></ul><ul><ul><li>FROM corpus </li></ul></ul><ul><ul><li>WHERE (doc.text CONTAINS “Michael Jordan”) AND </li></ul></ul><ul><ul><li> NOT (doc.text CONTAINS “basketball”). </li></ul></ul>
  5. 5. Information Retrieval vs. Data Retrieval Database tables, structured Free text, unstructured Data Knowledgeable users or automatic processes Non-expert humans Accessibility Unordered Ordered by relevance Results Exact matches Approximate matches Results SQL, Relational algebras Keywords, Natural language Queries Data Retrieval Information Retrieval
  6. 6. Information Retrieval Systems IR System query processor text processor user query ranked retrieved docs User Corpus ranking procedure system query retrieved docs index indexer tokenized docs postings raw docs
  7. 7. Search Engines Search Engine query processor text processor user query ranked retrieved docs User Web ranking procedure system query retrieved docs index indexer tokenized docs postings crawler global analyzer repository
  8. 8. Classical IR vs. Web IR Hypertext Text Documents Large Small # of matches Partially accessible Accessible Data accessibility Huge Large Volume Link-based Content-based IR techniques Widely diverse Homogeneous Format diversity In flux Infrequent Data change rate Noisy, dups Clean, no dups Data quality Web IR Classical IR
  9. 9. Outline <ul><li>Abstract formulation </li></ul><ul><li>Models for relevance ranking </li></ul><ul><li>Retrieval evaluation </li></ul><ul><li>Query languages </li></ul><ul><li>Text processing </li></ul><ul><li>Indexing and searching </li></ul>
  10. 10. Abstract Formulation <ul><li>Ingredients: </li></ul><ul><ul><li>D: document collection </li></ul></ul><ul><ul><li>Q: query space </li></ul></ul><ul><ul><li>f: D x Q  R : relevance scoring function </li></ul></ul><ul><ul><li>For every q in Q, f induces a ranking (partial order)  q on D </li></ul></ul><ul><li>Functions of an IR system: </li></ul><ul><ul><li>Preprocess D and create an index I </li></ul></ul><ul><ul><li>Given q in Q, use I to produce a permutation  on D </li></ul></ul><ul><li>Goals: </li></ul><ul><ul><li>Accuracy:  should be “close” to  q </li></ul></ul><ul><ul><li>Compactness: index should be compact </li></ul></ul><ul><ul><li>Response time: answers should be given quickly </li></ul></ul>
  11. 11. Document Representation <ul><li>T = { t 1 ,…, t k }: a “token space” </li></ul><ul><ul><li>(a.k.a. “feature space” or “term space”) </li></ul></ul><ul><ul><li>Ex: all words in English </li></ul></ul><ul><ul><li>Ex: phrases, URLs, … </li></ul></ul><ul><li>A document: a real vector d in R k </li></ul><ul><ul><li>d i : “weight” of token t i in d </li></ul></ul><ul><ul><li>Ex: d i = normalized # of occurrences of t i in d </li></ul></ul>
  12. 12. Classic IR (Relevance) Models <ul><li>The Boolean model </li></ul><ul><li>The Vector Space Model (VSM) </li></ul>
  13. 13. The Boolean Model <ul><li>A document: a boolean vector d in {0,1} k </li></ul><ul><ul><li>d i = 1 iff t i belongs to d </li></ul></ul><ul><li>A query: a boolean formula q over tokens </li></ul><ul><ul><li>q: {0,1} k  {0,1} </li></ul></ul><ul><ul><li>Ex: “Michael Jordan” AND (NOT basketball) </li></ul></ul><ul><ul><li>Ex: +“Michael Jordan” –basketball </li></ul></ul><ul><li>Relevance scoring function: </li></ul><ul><li> f(d,q) = q(d) </li></ul>
  14. 14. The Boolean Model: Pros & Cons <ul><li>Advantages: </li></ul><ul><ul><li>Simplicity for users </li></ul></ul><ul><li>Disadvantages: </li></ul><ul><ul><li>Relevance scoring is too coarse </li></ul></ul>
  15. 15. The Vector Space Model (VSM) <ul><li>A document: a real vector d in R k </li></ul><ul><ul><li>d i = weight of t i in d (usually TF-IDF score) </li></ul></ul><ul><li>A query: a real vector q in R k </li></ul><ul><ul><li>q i = weight of t i in q </li></ul></ul><ul><li>Relevance scoring function: </li></ul><ul><li>f(d,q) = sim(d,q) </li></ul><ul><ul><li>“ similarity” between d and q </li></ul></ul>
  16. 16. Popular Similarity Measures <ul><li>L 1 or L 2 distance </li></ul><ul><ul><li>d,q are first normalized </li></ul></ul><ul><ul><li>to have unit norm </li></ul></ul><ul><li>Cosine similarity </li></ul>d q d –q  d q
  17. 17. TF-IDF Score: Motivation <ul><li>Motivating principle: </li></ul><ul><ul><li>A term t i is relevant to a document d if: </li></ul></ul><ul><ul><ul><li>t i occurs many times in d relative to other terms that occur in d </li></ul></ul></ul><ul><ul><ul><li>t i occurs many times in d relative to its number of occurrences in other documents </li></ul></ul></ul><ul><li>Examples </li></ul><ul><ul><li>10 out of 100 terms in d are “java” </li></ul></ul><ul><ul><li>10 out of 10,000 terms in d are “java” </li></ul></ul><ul><ul><li>10 out of 100 terms in d are “the” </li></ul></ul>
  18. 18. TF-IDF Score: Definition <ul><li>n(d,t i ) = # of occurrences of t i in d </li></ul><ul><li>N =  i n(d,t i ) (# of tokens in d) </li></ul><ul><li>D i = # of documents containing t i </li></ul><ul><li>D = # of documents in the collection </li></ul><ul><li>TF(d,t i ): “Term Frequency” </li></ul><ul><ul><li>Ex: TF(d,t i ) = n(d,t i ) / N </li></ul></ul><ul><ul><li>Ex: TF(d,t i ) = n(d,t i ) / (max j { n(d,t j ) }) </li></ul></ul><ul><li>IDF(t i ): “Inverse Document Frequency” </li></ul><ul><ul><li>Ex: IDF(t i ) = log (D/D i ) </li></ul></ul><ul><li>TFIDF(d,t i ) = TF(d,t i ) x IDF(t i ) </li></ul>
  19. 19. VSM: Pros & Cons <ul><li>Advantages: </li></ul><ul><ul><li>Better granularity in relevance scoring </li></ul></ul><ul><ul><li>Good performance in practice </li></ul></ul><ul><ul><li>Efficient implementations </li></ul></ul><ul><li>Disadvantages: </li></ul><ul><ul><li>Assumes term independence </li></ul></ul>
  20. 20. Retrieval Evaluation <ul><li>Notations: </li></ul><ul><ul><li>D: document collection </li></ul></ul><ul><ul><li>D q : documents in D that are “relevant” to query q </li></ul></ul><ul><ul><ul><li>Ex: f(d,q) is above some threshold </li></ul></ul></ul><ul><ul><li>L q : list of results on query q </li></ul></ul>D L q D q Recall: Precision:
  21. 21. Recall & Precision: Example <ul><li>Recall(A) = 80% </li></ul><ul><li>Precision(A) = 40% </li></ul><ul><li>d 123 </li></ul><ul><li>d 84 </li></ul><ul><li>d 56 </li></ul><ul><li>d 6 </li></ul><ul><li>d 8 </li></ul><ul><li>d 9 </li></ul><ul><li>d 511 </li></ul><ul><li>d 129 </li></ul><ul><li>d 187 </li></ul><ul><li>d 25 </li></ul>List A Relevant docs: d 123 , d 56 , d 9 , d 25 , d 3 <ul><li>d 81 </li></ul><ul><li>d 74 </li></ul><ul><li>d 56 </li></ul><ul><li>d 123 </li></ul><ul><li>d 511 </li></ul><ul><li>d 25 </li></ul><ul><li>d 9 </li></ul><ul><li>d 129 </li></ul><ul><li>d 3 </li></ul><ul><li>d 5 </li></ul>List B <ul><li>Recall(B) = 100% </li></ul><ul><li>Precision(B) = 50% </li></ul>
  22. 22. Precision@k and Recall@k <ul><li>Notations: </li></ul><ul><ul><li>D q : documents in D that are “relevant” to q </li></ul></ul><ul><ul><li>L q,k : top k results on the list </li></ul></ul>Recall@k: Precision@k:
  23. 23. Precision@k: Example <ul><li>d 123 </li></ul><ul><li>d 84 </li></ul><ul><li>d 56 </li></ul><ul><li>d 6 </li></ul><ul><li>d 8 </li></ul><ul><li>d 9 </li></ul><ul><li>d 511 </li></ul><ul><li>d 129 </li></ul><ul><li>d 187 </li></ul><ul><li>d 25 </li></ul>List A <ul><li>d 81 </li></ul><ul><li>d 74 </li></ul><ul><li>d 56 </li></ul><ul><li>d 123 </li></ul><ul><li>d 511 </li></ul><ul><li>d 25 </li></ul><ul><li>d 9 </li></ul><ul><li>d 129 </li></ul><ul><li>d 3 </li></ul><ul><li>d 5 </li></ul>List B
  24. 24. Recall@k: Example <ul><li>d 123 </li></ul><ul><li>d 84 </li></ul><ul><li>d 56 </li></ul><ul><li>d 6 </li></ul><ul><li>d 8 </li></ul><ul><li>d 9 </li></ul><ul><li>d 511 </li></ul><ul><li>d 129 </li></ul><ul><li>d 187 </li></ul><ul><li>d 25 </li></ul>List A <ul><li>d 81 </li></ul><ul><li>d 74 </li></ul><ul><li>d 56 </li></ul><ul><li>d 123 </li></ul><ul><li>d 511 </li></ul><ul><li>d 25 </li></ul><ul><li>d 9 </li></ul><ul><li>d 129 </li></ul><ul><li>d 3 </li></ul><ul><li>d 5 </li></ul>List B
  25. 25. “ Interpolated” Precision <ul><li>Notations: </li></ul><ul><ul><li>D q : documents in D that are “relevant” to q </li></ul></ul><ul><ul><li>r: a recall level (e.g., 20%) </li></ul></ul><ul><ul><li>k(r): first k so that recall@k >= r </li></ul></ul>Interpolated precision@ recall level r = max { precision@k : k >= k(r) }
  26. 26. Precision vs. Recall: Example <ul><li>d 123 </li></ul><ul><li>d 84 </li></ul><ul><li>d 56 </li></ul><ul><li>d 6 </li></ul><ul><li>d 8 </li></ul><ul><li>d 9 </li></ul><ul><li>d 511 </li></ul><ul><li>d 129 </li></ul><ul><li>d 187 </li></ul><ul><li>d 25 </li></ul>List A <ul><li>d 81 </li></ul><ul><li>d 74 </li></ul><ul><li>d 56 </li></ul><ul><li>d 123 </li></ul><ul><li>d 511 </li></ul><ul><li>d 25 </li></ul><ul><li>d 9 </li></ul><ul><li>d 129 </li></ul><ul><li>d 3 </li></ul><ul><li>d 5 </li></ul>List B
  27. 27. Query Languages: Keyword-Based <ul><li>Singe-word queries </li></ul><ul><ul><li>Ex: Michael Jordan machine learning </li></ul></ul><ul><li>Context queries </li></ul><ul><ul><li>Phrases. Ex: “Michael Jordan” “machine learning” </li></ul></ul><ul><ul><li>Proximity. Ex: “Michael Jordan” at distance of at most 10 words from “machine learning” </li></ul></ul><ul><li>Boolean queries </li></ul><ul><ul><li>Ex: +”Michael Jordan” –basketball </li></ul></ul><ul><li>Natural language queries </li></ul><ul><ul><li>Ex: “Get me pages about Michael Jordan, the machine learning expert.” </li></ul></ul>
  28. 28. Query Languages: Pattern Matching <ul><li>Prefixes </li></ul><ul><ul><li>Ex: prefix:comput </li></ul></ul><ul><li>Suffixes </li></ul><ul><ul><li>Ex: suffix:net </li></ul></ul><ul><li>Regular Expressions </li></ul><ul><ul><li>Ex: [0-9]+th world-wide web conference </li></ul></ul>
  29. 29. Text Processing <ul><li>Lexical analysis & tokenization </li></ul><ul><ul><li>Split text into words, downcase letters, filter out punctuation marks, digits, hyphens </li></ul></ul><ul><li>Stopword elimination </li></ul><ul><ul><li>Better retrieval accuracy, more compact index </li></ul></ul><ul><ul><li>Ex: “to be or not to be” </li></ul></ul><ul><li>Stemming </li></ul><ul><ul><li>Ex: “computer”, “computing”, “computation”  comput </li></ul></ul><ul><li>Index term selection </li></ul><ul><ul><li>Keywords vs. full text </li></ul></ul>
  30. 30. Inverted Index Michael 1 Jordan 2 , the 3 author 4 of 5 “graphical 6 models 7 ”, is 8 a 9 professor 10 at 11 U.C. 12 Berkeley 13 . The 1 famous 2 NBA 3 legend 4 Michael 5 Jordan 6 liked 7 to 8 date 9 models 10 . d 1 d 2 author: (d 1 ,4) berkeley: (d 1 ,13) date: (d 2 ,9) famous: (d 2 , 2) graphical: (d 1 ,6) jordan: (d 1 ,2), (d 2 ,6) legend: (d 2 ,4) like: (d 2 ,7) michael: (d 1 ,1), (d 2 ,5) model: (d 1 ,7), (d 2 ,10) nba: (d 2 ,3) professor: (d 1 ,10) uc: (d 1 ,12) Vocabulary Postings
  31. 31. Inverted Index Structure Vocabulary File term1 term2 … Postings File postings list 1 postings list 2 … <ul><li>Usually, fits in main memory </li></ul><ul><li>Stored on disk </li></ul>
  32. 32. Searching an Inverted Index <ul><li>Given: </li></ul><ul><ul><li>t 1 , t 2 : query terms </li></ul></ul><ul><ul><li>L 1 ,L 2 : corresponding posting lists </li></ul></ul><ul><li>Need to get ranked list of docs in intersection of L 1 ,L 2 </li></ul><ul><li>Solution 1: If L 1 ,L 2 are comparable in size, “merge” L 1 and L 2 to find docs in their intersection, and then order them by rank. </li></ul><ul><li>(running time: O(|L 1 | + |L 2 |)) </li></ul><ul><li>Solution 2: If L 1 is considerably shorter than L 2 , binary search each posting of L 1 in L 2 to find the intersection, and then order them by rank. </li></ul><ul><li>(running time: O(|L 1 | x log(|L 2 |)) </li></ul>
  33. 33. Search Optimization <ul><li>Improvement: </li></ul><ul><li>Order docs in posting lists by static rank (e.g., PageRank). </li></ul><ul><li>Then, can output top matches, without scanning the whole lists. </li></ul>
  34. 34. Index Construction <ul><li>Given a stream of documents, store (did,tid,pos) triplets in a file </li></ul><ul><li>Sort and group file by tid </li></ul><ul><li>Extract posting lists </li></ul>
  35. 35. Index Maintenance <ul><li>Naïve updates of inverted index can be very costly </li></ul><ul><ul><li>Require random access </li></ul></ul><ul><ul><li>A single change may cause many insertions/deletions </li></ul></ul><ul><li>Batch updates </li></ul><ul><li>Two indices </li></ul><ul><ul><li>Main index (created in batch, large, compressed) </li></ul></ul><ul><ul><li>“ Stop-press” index (incremental, small, uncompressed) </li></ul></ul>
  36. 36. Index Maintenance <ul><li>If a page d is inserted/deleted, the “signed” postings (did,tid,pos,I/D) are added to the stop-press index. </li></ul><ul><li>Given a query term t , fetch its list L t from main index, and two lists L t,+ and L t,- from stop-press index. </li></ul><ul><li>Result is: </li></ul><ul><li>When stop-press index grows too large, it is merged into the main index. </li></ul>
  37. 37. Index Compression <ul><li>Delta compression </li></ul><ul><ul><li>Saves a lot for popular terms </li></ul></ul><ul><ul><li>Doesn’t save much for rare terms (but these don’t take much space anyway) </li></ul></ul>michael: (1000007,5), (1000009,12), (1000013,77), (1000035,88),… michael: (1000007,5), (2,12), (4,77), (22,88),…
  38. 38. Variable Length Encodings <ul><li>How to encode gaps succinctly? </li></ul><ul><ul><li>Option 1: Fixed-length binary encoding. </li></ul></ul><ul><ul><ul><li>Effective when all gap lengths are equally likely </li></ul></ul></ul><ul><ul><ul><li>No savings over storing doc ids. </li></ul></ul></ul><ul><ul><li>Option 2: Unary encoding. </li></ul></ul><ul><ul><ul><li>Gap x is encoded by x-1 1’s followed by a 0 </li></ul></ul></ul><ul><ul><ul><li>Effective when large gaps are very rare (Pr(x) = 1/2 x ) </li></ul></ul></ul><ul><ul><li>Option 3: Gamma encoding. </li></ul></ul><ul><ul><ul><li>Gap x is encoded by (  x  x ), where  x is the binary encoding of x and  x is the length of  x , encoded in unary. </li></ul></ul></ul><ul><ul><ul><li>Encoding length: about 2log(x). </li></ul></ul></ul>
  39. 39. End of Lecture 2

×