Your SlideShare is downloading. ×
  • Like
Snipets by FrancoSH
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Snipets by FrancoSH

  • 391 views
Published

Final work for EDA , Snippets

Final work for EDA , Snippets

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
391
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Fast Generation of Result xxxx Search Snippets in Web Franco Sánchez Huertas (UCSP) EDA – June, 2010 21/06/2010 UCSP -FASH 1
  • 2. Overview • What are snippets? • Research questions • Rationale • Baseline • Compressed Token System • Document caching for snippet generation • Sentence reordering • Conclusion 21/06/2010 UCSP -FASH 2
  • 3. What snippets are? 21/06/2010 UCSP -FASH 3
  • 4. What snippets are? 21/06/2010 UCSP -FASH 4
  • 5. Research question Which fast strategies can we use to generate snippets for web search results? 21/06/2010 UCSP -FASH 5
  • 6. Rationale • Two main reasons: – Snippet extraction is an integral part of the query evaluation process and speeding it will reduce the overall time (and resources) required to process a query 21/06/2010 UCSP -FASH 6
  • 7. Rationale • Two main reasons: – Snippet extraction is an integral part of the query evaluation process and speeding it will reduce the overall time (and resources) required to process a query – No prior literature exists which discusses how to efficiently generate snippets 21/06/2010 UCSP -FASH 7
  • 8. sigir Search 21/06/2010 UCSP -FASH 8
  • 9. sigir Search Identify relevant documents 21/06/2010 UCSP -FASH 9
  • 10. sigir Search Identify relevant documents Strip sentences Bag of sentences 21/06/2010 UCSP -FASH 10
  • 11. sigir Search Identify relevant documents Strip sentences Collect stats on sentences Sentence ranker Bag of sentences 21/06/2010 UCSP -FASH 11
  • 12. sigir Search Identify relevant documents Strip sentences Pick 2-3 sentences per document … generate result page Collect stats on sentences Sentence ranker Bag of sentences 21/06/2010 UCSP -FASH 12
  • 13. Sentence ranking • a-priori (without queries) ai – sentence position (titles, leading sentences) – term/sentence significance • Query time (with queries) – query terms count ci – unique query term count ui – query term proximity li • Using all the above features, sentence i can be ranked using some function f(ci, ui, li, ai) 21/06/2010 UCSP -FASH 13
  • 14. Baseline Indexing time <html> <body> text text texttext 0100101001 0100101001 0100101001 texttext 0010101101 text <br /> text text <eos> compress using 0010101101 0010101101 -strip out HTML text text text <eos> gzip -add EOS marker </body> </html> 21/06/2010 UCSP -FASH 14
  • 15. Baseline Indexing time <html> <body> text text texttext 0100101001 0100101001 0100101001 texttext 0010101101 text <br /> text text <eos> compress using 0010101101 0010101101 -strip out HTML text text text <eos> gzip -add EOS marker </body> </html> Query time results list query 21/06/2010 UCSP -FASH 15
  • 16. Baseline Indexing time <html> <body> text text texttext 0100101001 0100101001 0100101001 texttext 0010101101 text <br /> text text <eos> compress using 0010101101 0010101101 -strip out HTML text text text <eos> gzip -add EOS marker </body> </html> decompress Query time text text <eos> results list text text <eos> query 21/06/2010 UCSP -FASH 16
  • 17. Baseline Indexing time <html> <body> text text texttext 0100101001 0100101001 0100101001 texttext 0010101101 text <br /> text text <eos> compress using 0010101101 0010101101 -strip out HTML text text text <eos> gzip -add EOS marker </body> </html> decompress Query time text text <eos> results list text text <eos> two/three f(ci, ui, li, ai) string matcher sentences (full word matching) query 21/06/2010 UCSP -FASH 17
  • 18. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 the 1 the 1 of 2 of 2 in 3 in 3 … 21/06/2010 UCSP -FASH 18
  • 19. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 pass 2 single file the 1 of 2 in … 3 + 21/06/2010 UCSP -FASH 19
  • 20. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 pass 2 single file the 1 of 2 in … 3 + offset map is required to tell where a document starts 21/06/2010 UCSP -FASH 20
  • 21. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 pass 2 single file the 1 of 2 in … 3 + offset map is • Terms in lexicon are replaced with an integer required to tell • Those not in lexicon are spelt out as follows where ESC-length-word a document starts ESC-7-britney • Mark end of each sentence • Compress using integer compression scheme (vbyte) 21/06/2010 UCSP -FASH 21
  • 22. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 pass 2 single file the 1 of 2 in … 3 + offset map is • Terms in lexicon are replaced with an integer required to tell • Those not in lexicon are spelt out as follows where ESC-length-word a document starts ESC-7-britney • Mark end of each sentence • Compress using integer compression scheme (vbyte) 21/06/2010 UCSP -FASH 22
  • 23. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset of 2 in 3 mapping … 21/06/2010 UCSP -FASH 23
  • 24. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset of 2 query in 3 mapping … 21/06/2010 UCSP -FASH 24
  • 25. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset of 2 query in 3 mapping … convert to integer tokens 1 33 57 21/06/2010 UCSP -FASH 25
  • 26. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset results list of 2 query in 3 mapping … convert to integer tokens 1 33 57 21/06/2010 UCSP -FASH 26
  • 27. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset results list of 2 query in 3 mapping … convert to integer tokens text text texttext texttext 1 33 57 text12 1 1 98 33 text 57 98 integer documents 21/06/2010 UCSP -FASH 27
  • 28. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset results list of 2 query in 3 mapping … convert to integer tokens text text texttext texttext 1 33 57 text12 1 1 98 33 text 57 98 convert back to text integer documents integer + ESC sequence f(ci, ui, li, ai) two/three matcher sentences 21/06/2010 UCSP -FASH 28
  • 29. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset results list of 2 query in 3 mapping … convert to integer tokens text text texttext texttext 1 33 57 text12 1 1 98 33 text 57 98 convert back to text integer documents integer + ESC sequence f(ci, ui, li, ai) two/three matcher sentences - use compressed integer matching 21/06/2010 UCSP -FASH 29
  • 30. Data set and results - We used TREC WT10g and WT100g and collections. - WT50g is a 50 GB collection randomly sampled from WT100g 30 Percentage of full collection size Baseline 25 CTS 20 15 10 5 0 WT10g WT50g WT100g Compression effectiveness 21/06/2010 UCSP -FASH 30
  • 31. Efficiency comparison - Generated snippets for 10,000 queries using CTS and baseline (top 10 docs) - Measured snippet generation time for each document - First few queries caching effect was noticed, so we take the average of the last 7000 queries Baseline CTS 7 16 Time (msec) 21/06/2010 UCSP -FASH 31
  • 32. Efficiency comparison - Generated snippets for 10,000 queries using CTS and baseline (top 10 docs) - Measured snippet generation time for each document - First few queries caching effect was noticed, so we take the average of the last 7000 queries Disk access In-memory processing Baseline Seek Processing CTS 4.5 7 16 Time (msec) 21/06/2010 UCSP -FASH 32
  • 33. Efficiency comparison - Generated snippets for 10,000 queries using CTS and baseline (top 10 docs) - Measured snippet generation time for each document - First few queries caching effect was noticed, so we take the average of the last 7000 queries Disk access In-memory processing Baseline Seek Processing CTS 4.5 7 16 Time (msec) So can we get away with performing no disk access? 21/06/2010 UCSP -FASH 33
  • 34. Document caching - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML) - WT100g has around 18.6 million documents 21/06/2010 UCSP -FASH 34
  • 35. Document caching - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML) - WT100g has around 18.6 million documents - Snippet machine 4GB of RAM, - 1 GB is used by lexicon and document offset mapping - 2-3 GB can be dedicated for caching - In theory, using WT100g should be able to cache over 250k docs in memory - This is 5-7% of the collection size 21/06/2010 UCSP -FASH 35
  • 36. Document caching - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML) - WT100g has around 18.6 million documents - Snippet machine 4GB of RAM, - 1 GB is used by lexicon and document offset mapping - 2-3 GB can be dedicated for caching - In theory, using WT100g should be able to cache over 250k docs in memory - This is 5-7% of the collection size - But, in reality, how much disk access would we actually save? - We simulate this by caching the top 20 documents for > 500 k queries - Simulation allows us to control memory usage and exact hit and miss counts 21/06/2010 UCSP -FASH 36
  • 37. Caching simulation - We processed > 500 k queries and cached the top 20 documents for each query - The score of documents is half Okapi BM25 score and half query independent score (similar effect as PageRank) - We used two cache eviction policies: -Static – once cache is populated and full no entries are evicted -LRU (least recently used) – once cache is full documents are evicted based of the recency of their access - What is Q ? - Search engines cache results of most popular queries 21/06/2010 UCSP -FASH 37
  • 38. Caching simulation (results) % of doc requests that hit cache Cache size (% of collection, WT100G) • Cache of 1% of collection yields 80% hits and caching 3% accounts for more than 97% of hits 21/06/2010 UCSP -FASH 38
  • 39. Caching simulation Baseline Seek Process CTS CTS + caching 3.4 7 16 Time (msec) How can we further enhance doc cache performance? – Smaller cache entries mean more documents can fit in cache, so do we need to keep entire documents in cache? Perhaps not 21/06/2010 UCSP -FASH 39
  • 40. Sentence reordering Captain Feathersword is the friendliest Pirate on the open seas. He loves a good party, and making people giggle. It's lucky that he has a feather for a sword, which he can use to tickle everyone. Captain Feathersword's Pirate Ship is called "The Good Ship Feathersword", and he loves to cook, dance and sing with his crew on his ship. You'll hear Captain Feathersword on his ship or on dry land, saying a big "Ahoy there me hearties!" Blue terms = significant 21/06/2010 UCSP -FASH 40 Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
  • 41. Sentence reordering 4 Captain Feathersword is the friendliest 5 Pirate on the open seas. He loves a good party, and making people giggle. 2 It's lucky that he has a feather for a sword, which he can use to tickle 1 everyone. Captain Feathersword's Pirate Ship is called "The Good Ship Feathersword", and he loves to cook, dance and sing with his crew on his ship. 3 You'll hear Captain Feathersword on his ship or on dry land, saying a big "Ahoy there me hearties!" Blue terms = significant 21/06/2010 UCSP -FASH 41 Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
  • 42. Sentence reordering 1 Captain Feathersword's Pirate Ship is called "The Good Ship Feathersword", and he loves to cook, dance and sing with his crew on his ship. 2 It's lucky that he has a feather for a sword, which he can use to tickle everyone. 3 You'll hear Captain Feathersword on his ship or on dry land, saying a big "Ahoy there me hearties!" 4 Captain Feathersword is the friendliest Pirate on the open seas. 5 He loves a good party, and making people google. Blue terms = significant 21/06/2010 UCSP -FASH 42 Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
  • 43. Sentence reordering 1 Captain Feathersword's Pirate Ship is called "The Good Ship Feathersword", and he loves to cook, dance and sing with his crew on his ship. 2 It's lucky that he has a feather for a sword, which he can use to tickle everyone. Keep in cache 3 You'll hear Captain Feathersword on his ship or on dry land, saying a big "Ahoy there me hearties!“ 4 Captain Feathersword is the friendliest Pirate on the open seas. 5 He loves a good party, and making people google. Blue terms = significant 21/06/2010 UCSP -FASH 43 Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
  • 44. Sentence reordering methods • Significant sentence (ST): sentences that contain significant terms are good • Natural order: the original order sentences in a document 21/06/2010 UCSP -FASH 44
  • 45. Sentence reordering methods • Significant sentence (ST): sentences that contain significant terms are good • Natural order: the original order sentences in a document • Query log (Qlt): sentences that contain previously queried terms • Query log (Qlu): same as Qlt, but repeating query terms in a sentence are only counted once. 21/06/2010 UCSP -FASH 45
  • 46. Sentence reordering methods • Significant sentence (ST): sentences that contain significant terms are good • Natural order: the original order sentences in a document • Query log (Qlt): sentences that contain previously queried terms • Query log (Qlu): same as Qlt, but repeating query terms in a sentence are only counted once. • But where do we draw the cut-off line? – A trade-off between efficiency gains (more documents in cache) and effectiveness loss 21/06/2010 UCSP -FASH 46
  • 47. 21/06/2010 UCSP -FASH 47
  • 48. Conclusion • We proposed a practical document storage for snippet extraction system (CTS) • Compared to the baseline defined, using CTS, the in-memory processing time to generate a snippet is reduced by half of the baseline’s • Using document cache, we have shown that the 80% of seeks can be also be averted by caching only 1% of the collection size • Caching documents can be further enhanced by retaining only the important parts of a document through sentence re-ordering 21/06/2010 UCSP -FASH 48
  • 49. Questions 21/06/2010 UCSP -FASH 49