Fast Generation of Result
                        xxxx Search
              Snippets in Web


                    Franco S...
Overview

•   What are snippets?
•   Research questions
•   Rationale
•   Baseline
•   Compressed Token System
•   Documen...
What snippets are?




21/06/2010       UCSP -FASH   3
What snippets are?




21/06/2010       UCSP -FASH   4
Research question




        Which fast strategies can we use to
         generate snippets for web search
              ...
Rationale

•      Two main reasons:
      – Snippet extraction is an integral part of the query
         evaluation proces...
Rationale

•      Two main reasons:
      – Snippet extraction is an integral part of the query
         evaluation proces...
sigir        Search




21/06/2010            UCSP -FASH   8
sigir        Search          Identify relevant
                                documents




21/06/2010            UCSP -F...
sigir        Search          Identify relevant
                                documents




                             ...
sigir        Search          Identify relevant
                                documents




                             ...
sigir         Search          Identify relevant
                                 documents




                           ...
Sentence ranking
• a-priori (without queries) ai
      – sentence position (titles, leading sentences)
      – term/senten...
Baseline
Indexing time




                        <html>
                        <body>                          text
   ...
Baseline
Indexing time




                        <html>
                        <body>                          text
   ...
Baseline
Indexing time




                        <html>
                        <body>                          text
   ...
Baseline
Indexing time




                        <html>
                        <body>                            text
 ...
Compressed Token System (CTS)
Indexing time




                        <html>
                        <body>             ...
Compressed Token System (CTS)
Indexing time




                        <html>
                        <body>             ...
Compressed Token System (CTS)
Indexing time




                        <html>
                        <body>             ...
Compressed Token System (CTS)
Indexing time




                            <html>
                            <body>     ...
Compressed Token System (CTS)
Indexing time




                            <html>
                            <body>     ...
Compressed Token System (CTS)
             compressed                   vocabulary
             collection
Query time




...
Compressed Token System (CTS)
             compressed                   vocabulary
             collection
Query time




...
Compressed Token System (CTS)
             compressed                        vocabulary
             collection
Query time...
Compressed Token System (CTS)
             compressed                                    vocabulary
             collectio...
Compressed Token System (CTS)
             compressed                                                 vocabulary
         ...
Compressed Token System (CTS)
             compressed                                                 vocabulary
         ...
Compressed Token System (CTS)
             compressed                                                              vocabul...
Data set and results
      - We used TREC WT10g and WT100g and collections.
      - WT50g is a 50 GB collection randomly s...
Efficiency comparison
- Generated snippets for 10,000 queries using CTS and baseline (top 10 docs)
- Measured snippet gene...
Efficiency comparison
- Generated snippets for 10,000 queries using CTS and baseline (top 10 docs)
- Measured snippet gene...
Efficiency comparison
- Generated snippets for 10,000 queries using CTS and baseline (top 10 docs)
- Measured snippet gene...
Document caching
  - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML)
  - WT100g has around 18.6 million doc...
Document caching
  - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML)
  - WT100g has around 18.6 million doc...
Document caching
  - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML)
  - WT100g has around 18.6 million doc...
Caching simulation
  - We processed > 500 k queries and cached the top 20 documents for
  each query

  - The score of doc...
Caching simulation (results)
             % of doc requests that hit cache




                                           ...
Caching simulation

             Baseline   Seek               Process


                CTS


   CTS + caching

         ...
Sentence reordering

             Captain Feathersword is the friendliest
             Pirate on the open seas. He loves a...
Sentence reordering
             4
               Captain Feathersword is the friendliest
                                ...
Sentence reordering
             1
              Captain Feathersword's Pirate Ship is
             called "The Good Ship ...
Sentence reordering
             1
              Captain Feathersword's Pirate Ship is
             called "The Good Ship ...
Sentence reordering methods
• Significant sentence (ST): sentences that contain significant terms
  are good
• Natural ord...
Sentence reordering methods
• Significant sentence (ST): sentences that contain significant terms
  are good
• Natural ord...
Sentence reordering methods
• Significant sentence (ST): sentences that contain significant terms
  are good
• Natural ord...
21/06/2010   UCSP -FASH   47
Conclusion
   •   We proposed a practical document storage for snippet extraction
       system (CTS)

   •   Compared to ...
Questions




21/06/2010   UCSP -FASH   49
Upcoming SlideShare
Loading in...5
×

Snipets by FrancoSH

401

Published on

Final work for EDA , Snippets

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
401
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Snipets by FrancoSH

  1. 1. Fast Generation of Result xxxx Search Snippets in Web Franco Sánchez Huertas (UCSP) EDA – June, 2010 21/06/2010 UCSP -FASH 1
  2. 2. Overview • What are snippets? • Research questions • Rationale • Baseline • Compressed Token System • Document caching for snippet generation • Sentence reordering • Conclusion 21/06/2010 UCSP -FASH 2
  3. 3. What snippets are? 21/06/2010 UCSP -FASH 3
  4. 4. What snippets are? 21/06/2010 UCSP -FASH 4
  5. 5. Research question Which fast strategies can we use to generate snippets for web search results? 21/06/2010 UCSP -FASH 5
  6. 6. Rationale • Two main reasons: – Snippet extraction is an integral part of the query evaluation process and speeding it will reduce the overall time (and resources) required to process a query 21/06/2010 UCSP -FASH 6
  7. 7. Rationale • Two main reasons: – Snippet extraction is an integral part of the query evaluation process and speeding it will reduce the overall time (and resources) required to process a query – No prior literature exists which discusses how to efficiently generate snippets 21/06/2010 UCSP -FASH 7
  8. 8. sigir Search 21/06/2010 UCSP -FASH 8
  9. 9. sigir Search Identify relevant documents 21/06/2010 UCSP -FASH 9
  10. 10. sigir Search Identify relevant documents Strip sentences Bag of sentences 21/06/2010 UCSP -FASH 10
  11. 11. sigir Search Identify relevant documents Strip sentences Collect stats on sentences Sentence ranker Bag of sentences 21/06/2010 UCSP -FASH 11
  12. 12. sigir Search Identify relevant documents Strip sentences Pick 2-3 sentences per document … generate result page Collect stats on sentences Sentence ranker Bag of sentences 21/06/2010 UCSP -FASH 12
  13. 13. Sentence ranking • a-priori (without queries) ai – sentence position (titles, leading sentences) – term/sentence significance • Query time (with queries) – query terms count ci – unique query term count ui – query term proximity li • Using all the above features, sentence i can be ranked using some function f(ci, ui, li, ai) 21/06/2010 UCSP -FASH 13
  14. 14. Baseline Indexing time <html> <body> text text texttext 0100101001 0100101001 0100101001 texttext 0010101101 text <br /> text text <eos> compress using 0010101101 0010101101 -strip out HTML text text text <eos> gzip -add EOS marker </body> </html> 21/06/2010 UCSP -FASH 14
  15. 15. Baseline Indexing time <html> <body> text text texttext 0100101001 0100101001 0100101001 texttext 0010101101 text <br /> text text <eos> compress using 0010101101 0010101101 -strip out HTML text text text <eos> gzip -add EOS marker </body> </html> Query time results list query 21/06/2010 UCSP -FASH 15
  16. 16. Baseline Indexing time <html> <body> text text texttext 0100101001 0100101001 0100101001 texttext 0010101101 text <br /> text text <eos> compress using 0010101101 0010101101 -strip out HTML text text text <eos> gzip -add EOS marker </body> </html> decompress Query time text text <eos> results list text text <eos> query 21/06/2010 UCSP -FASH 16
  17. 17. Baseline Indexing time <html> <body> text text texttext 0100101001 0100101001 0100101001 texttext 0010101101 text <br /> text text <eos> compress using 0010101101 0010101101 -strip out HTML text text text <eos> gzip -add EOS marker </body> </html> decompress Query time text text <eos> results list text text <eos> two/three f(ci, ui, li, ai) string matcher sentences (full word matching) query 21/06/2010 UCSP -FASH 17
  18. 18. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 the 1 the 1 of 2 of 2 in 3 in 3 … 21/06/2010 UCSP -FASH 18
  19. 19. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 pass 2 single file the 1 of 2 in … 3 + 21/06/2010 UCSP -FASH 19
  20. 20. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 pass 2 single file the 1 of 2 in … 3 + offset map is required to tell where a document starts 21/06/2010 UCSP -FASH 20
  21. 21. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 pass 2 single file the 1 of 2 in … 3 + offset map is • Terms in lexicon are replaced with an integer required to tell • Those not in lexicon are spelt out as follows where ESC-length-word a document starts ESC-7-britney • Mark end of each sentence • Compress using integer compression scheme (vbyte) 21/06/2010 UCSP -FASH 21
  22. 22. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 pass 2 single file the 1 of 2 in … 3 + offset map is • Terms in lexicon are replaced with an integer required to tell • Those not in lexicon are spelt out as follows where ESC-length-word a document starts ESC-7-britney • Mark end of each sentence • Compress using integer compression scheme (vbyte) 21/06/2010 UCSP -FASH 22
  23. 23. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset of 2 in 3 mapping … 21/06/2010 UCSP -FASH 23
  24. 24. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset of 2 query in 3 mapping … 21/06/2010 UCSP -FASH 24
  25. 25. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset of 2 query in 3 mapping … convert to integer tokens 1 33 57 21/06/2010 UCSP -FASH 25
  26. 26. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset results list of 2 query in 3 mapping … convert to integer tokens 1 33 57 21/06/2010 UCSP -FASH 26
  27. 27. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset results list of 2 query in 3 mapping … convert to integer tokens text text texttext texttext 1 33 57 text12 1 1 98 33 text 57 98 integer documents 21/06/2010 UCSP -FASH 27
  28. 28. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset results list of 2 query in 3 mapping … convert to integer tokens text text texttext texttext 1 33 57 text12 1 1 98 33 text 57 98 convert back to text integer documents integer + ESC sequence f(ci, ui, li, ai) two/three matcher sentences 21/06/2010 UCSP -FASH 28
  29. 29. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset results list of 2 query in 3 mapping … convert to integer tokens text text texttext texttext 1 33 57 text12 1 1 98 33 text 57 98 convert back to text integer documents integer + ESC sequence f(ci, ui, li, ai) two/three matcher sentences - use compressed integer matching 21/06/2010 UCSP -FASH 29
  30. 30. Data set and results - We used TREC WT10g and WT100g and collections. - WT50g is a 50 GB collection randomly sampled from WT100g 30 Percentage of full collection size Baseline 25 CTS 20 15 10 5 0 WT10g WT50g WT100g Compression effectiveness 21/06/2010 UCSP -FASH 30
  31. 31. Efficiency comparison - Generated snippets for 10,000 queries using CTS and baseline (top 10 docs) - Measured snippet generation time for each document - First few queries caching effect was noticed, so we take the average of the last 7000 queries Baseline CTS 7 16 Time (msec) 21/06/2010 UCSP -FASH 31
  32. 32. Efficiency comparison - Generated snippets for 10,000 queries using CTS and baseline (top 10 docs) - Measured snippet generation time for each document - First few queries caching effect was noticed, so we take the average of the last 7000 queries Disk access In-memory processing Baseline Seek Processing CTS 4.5 7 16 Time (msec) 21/06/2010 UCSP -FASH 32
  33. 33. Efficiency comparison - Generated snippets for 10,000 queries using CTS and baseline (top 10 docs) - Measured snippet generation time for each document - First few queries caching effect was noticed, so we take the average of the last 7000 queries Disk access In-memory processing Baseline Seek Processing CTS 4.5 7 16 Time (msec) So can we get away with performing no disk access? 21/06/2010 UCSP -FASH 33
  34. 34. Document caching - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML) - WT100g has around 18.6 million documents 21/06/2010 UCSP -FASH 34
  35. 35. Document caching - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML) - WT100g has around 18.6 million documents - Snippet machine 4GB of RAM, - 1 GB is used by lexicon and document offset mapping - 2-3 GB can be dedicated for caching - In theory, using WT100g should be able to cache over 250k docs in memory - This is 5-7% of the collection size 21/06/2010 UCSP -FASH 35
  36. 36. Document caching - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML) - WT100g has around 18.6 million documents - Snippet machine 4GB of RAM, - 1 GB is used by lexicon and document offset mapping - 2-3 GB can be dedicated for caching - In theory, using WT100g should be able to cache over 250k docs in memory - This is 5-7% of the collection size - But, in reality, how much disk access would we actually save? - We simulate this by caching the top 20 documents for > 500 k queries - Simulation allows us to control memory usage and exact hit and miss counts 21/06/2010 UCSP -FASH 36
  37. 37. Caching simulation - We processed > 500 k queries and cached the top 20 documents for each query - The score of documents is half Okapi BM25 score and half query independent score (similar effect as PageRank) - We used two cache eviction policies: -Static – once cache is populated and full no entries are evicted -LRU (least recently used) – once cache is full documents are evicted based of the recency of their access - What is Q ? - Search engines cache results of most popular queries 21/06/2010 UCSP -FASH 37
  38. 38. Caching simulation (results) % of doc requests that hit cache Cache size (% of collection, WT100G) • Cache of 1% of collection yields 80% hits and caching 3% accounts for more than 97% of hits 21/06/2010 UCSP -FASH 38
  39. 39. Caching simulation Baseline Seek Process CTS CTS + caching 3.4 7 16 Time (msec) How can we further enhance doc cache performance? – Smaller cache entries mean more documents can fit in cache, so do we need to keep entire documents in cache? Perhaps not 21/06/2010 UCSP -FASH 39
  40. 40. Sentence reordering Captain Feathersword is the friendliest Pirate on the open seas. He loves a good party, and making people giggle. It's lucky that he has a feather for a sword, which he can use to tickle everyone. Captain Feathersword's Pirate Ship is called "The Good Ship Feathersword", and he loves to cook, dance and sing with his crew on his ship. You'll hear Captain Feathersword on his ship or on dry land, saying a big "Ahoy there me hearties!" Blue terms = significant 21/06/2010 UCSP -FASH 40 Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
  41. 41. Sentence reordering 4 Captain Feathersword is the friendliest 5 Pirate on the open seas. He loves a good party, and making people giggle. 2 It's lucky that he has a feather for a sword, which he can use to tickle 1 everyone. Captain Feathersword's Pirate Ship is called "The Good Ship Feathersword", and he loves to cook, dance and sing with his crew on his ship. 3 You'll hear Captain Feathersword on his ship or on dry land, saying a big "Ahoy there me hearties!" Blue terms = significant 21/06/2010 UCSP -FASH 41 Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
  42. 42. Sentence reordering 1 Captain Feathersword's Pirate Ship is called "The Good Ship Feathersword", and he loves to cook, dance and sing with his crew on his ship. 2 It's lucky that he has a feather for a sword, which he can use to tickle everyone. 3 You'll hear Captain Feathersword on his ship or on dry land, saying a big "Ahoy there me hearties!" 4 Captain Feathersword is the friendliest Pirate on the open seas. 5 He loves a good party, and making people google. Blue terms = significant 21/06/2010 UCSP -FASH 42 Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
  43. 43. Sentence reordering 1 Captain Feathersword's Pirate Ship is called "The Good Ship Feathersword", and he loves to cook, dance and sing with his crew on his ship. 2 It's lucky that he has a feather for a sword, which he can use to tickle everyone. Keep in cache 3 You'll hear Captain Feathersword on his ship or on dry land, saying a big "Ahoy there me hearties!“ 4 Captain Feathersword is the friendliest Pirate on the open seas. 5 He loves a good party, and making people google. Blue terms = significant 21/06/2010 UCSP -FASH 43 Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
  44. 44. Sentence reordering methods • Significant sentence (ST): sentences that contain significant terms are good • Natural order: the original order sentences in a document 21/06/2010 UCSP -FASH 44
  45. 45. Sentence reordering methods • Significant sentence (ST): sentences that contain significant terms are good • Natural order: the original order sentences in a document • Query log (Qlt): sentences that contain previously queried terms • Query log (Qlu): same as Qlt, but repeating query terms in a sentence are only counted once. 21/06/2010 UCSP -FASH 45
  46. 46. Sentence reordering methods • Significant sentence (ST): sentences that contain significant terms are good • Natural order: the original order sentences in a document • Query log (Qlt): sentences that contain previously queried terms • Query log (Qlu): same as Qlt, but repeating query terms in a sentence are only counted once. • But where do we draw the cut-off line? – A trade-off between efficiency gains (more documents in cache) and effectiveness loss 21/06/2010 UCSP -FASH 46
  47. 47. 21/06/2010 UCSP -FASH 47
  48. 48. Conclusion • We proposed a practical document storage for snippet extraction system (CTS) • Compared to the baseline defined, using CTS, the in-memory processing time to generate a snippet is reduced by half of the baseline’s • Using document cache, we have shown that the 80% of seeks can be also be averted by caching only 1% of the collection size • Caching documents can be further enhanced by retaining only the important parts of a document through sentence re-ordering 21/06/2010 UCSP -FASH 48
  49. 49. Questions 21/06/2010 UCSP -FASH 49
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×