5. Research question
Which fast strategies can we use to
generate snippets for web search
results?
21/06/2010 UCSP -FASH 5
6. Rationale
• Two main reasons:
– Snippet extraction is an integral part of the query
evaluation process and speeding it will reduce the
overall time (and resources) required to process a
query
21/06/2010 UCSP -FASH 6
7. Rationale
• Two main reasons:
– Snippet extraction is an integral part of the query
evaluation process and speeding it will reduce the
overall time (and resources) required to process a
query
– No prior literature exists which discusses how to
efficiently generate snippets
21/06/2010 UCSP -FASH 7
10. sigir Search Identify relevant
documents
Strip
sentences
Bag of sentences
21/06/2010 UCSP -FASH 10
11. sigir Search Identify relevant
documents
Strip
sentences
Collect stats on
sentences
Sentence
ranker
Bag of sentences
21/06/2010 UCSP -FASH 11
12. sigir Search Identify relevant
documents
Strip
sentences
Pick 2-3 sentences
per document …
generate result page
Collect stats on
sentences
Sentence
ranker
Bag of sentences
21/06/2010 UCSP -FASH 12
13. Sentence ranking
• a-priori (without queries) ai
– sentence position (titles, leading sentences)
– term/sentence significance
• Query time (with queries)
– query terms count ci
– unique query term count ui
– query term proximity li
• Using all the above features, sentence i can be
ranked using some function f(ci, ui, li, ai)
21/06/2010 UCSP -FASH 13
14. Baseline
Indexing time
<html>
<body> text
text
texttext
0100101001
0100101001
0100101001
texttext 0010101101
text <br /> text
text <eos> compress using 0010101101
0010101101
-strip out HTML text
text text <eos> gzip
-add EOS marker
</body>
</html>
21/06/2010 UCSP -FASH 14
15. Baseline
Indexing time
<html>
<body> text
text
texttext
0100101001
0100101001
0100101001
texttext 0010101101
text <br /> text
text <eos> compress using 0010101101
0010101101
-strip out HTML text
text text <eos> gzip
-add EOS marker
</body>
</html>
Query time
results list
query
21/06/2010 UCSP -FASH 15
16. Baseline
Indexing time
<html>
<body> text
text
texttext
0100101001
0100101001
0100101001
texttext 0010101101
text <br /> text
text <eos> compress using 0010101101
0010101101
-strip out HTML text
text text <eos> gzip
-add EOS marker
</body>
</html>
decompress
Query time
text
text <eos> results list
text
text <eos>
query
21/06/2010 UCSP -FASH 16
17. Baseline
Indexing time
<html>
<body> text
text
texttext
0100101001
0100101001
0100101001
texttext 0010101101
text <br /> text
text <eos> compress using 0010101101
0010101101
-strip out HTML text
text text <eos> gzip
-add EOS marker
</body>
</html>
decompress
Query time
text
text <eos> results list
text
text <eos>
two/three f(ci, ui, li, ai) string matcher
sentences (full word matching)
query
21/06/2010 UCSP -FASH 17
18. Compressed Token System (CTS)
Indexing time
<html>
<body> text
text
texttext
texttext
text <br /> text
text <eos>
strip out HTML text
text text <eos>
</body>
</html>
pass 1
the 1
the 1
of 2
of 2
in 3
in 3
…
21/06/2010 UCSP -FASH 18
19. Compressed Token System (CTS)
Indexing time
<html>
<body> text
text
texttext
texttext
text <br /> text
text <eos>
strip out HTML text
text text <eos>
</body>
</html>
pass 1 pass 2
single file
the 1
of 2
in
…
3 +
21/06/2010 UCSP -FASH 19
20. Compressed Token System (CTS)
Indexing time
<html>
<body> text
text
texttext
texttext
text <br /> text
text <eos>
strip out HTML text
text text <eos>
</body>
</html>
pass 1 pass 2
single file
the 1
of 2
in
…
3 +
offset map is
required to tell
where
a document starts
21/06/2010 UCSP -FASH 20
21. Compressed Token System (CTS)
Indexing time
<html>
<body> text
text
texttext
texttext
text <br /> text
text <eos>
strip out HTML text
text text <eos>
</body>
</html>
pass 1 pass 2
single file
the 1
of 2
in
…
3 +
offset map is
• Terms in lexicon are replaced with an integer required to tell
• Those not in lexicon are spelt out as follows where
ESC-length-word a document starts
ESC-7-britney
• Mark end of each sentence
• Compress using integer compression scheme (vbyte)
21/06/2010 UCSP -FASH 21
22. Compressed Token System (CTS)
Indexing time
<html>
<body> text
text
texttext
texttext
text <br /> text
text <eos>
strip out HTML text
text text <eos>
</body>
</html>
pass 1 pass 2
single file
the 1
of 2
in
…
3 +
offset map is
• Terms in lexicon are replaced with an integer required to tell
• Those not in lexicon are spelt out as follows where
ESC-length-word a document starts
ESC-7-britney
• Mark end of each sentence
• Compress using integer compression scheme (vbyte)
21/06/2010 UCSP -FASH 22
23. Compressed Token System (CTS)
compressed vocabulary
collection
Query time
the 1
offset of 2
in 3
mapping …
21/06/2010 UCSP -FASH 23
24. Compressed Token System (CTS)
compressed vocabulary
collection
Query time
the 1
offset of 2
query
in 3
mapping …
21/06/2010 UCSP -FASH 24
25. Compressed Token System (CTS)
compressed vocabulary
collection
Query time
the 1
offset of 2
query
in 3
mapping …
convert to integer tokens
1 33 57
21/06/2010 UCSP -FASH 25
26. Compressed Token System (CTS)
compressed vocabulary
collection
Query time
the 1
offset results list of 2
query
in 3
mapping …
convert to integer tokens
1 33 57
21/06/2010 UCSP -FASH 26
27. Compressed Token System (CTS)
compressed vocabulary
collection
Query time
the 1
offset results list of 2
query
in 3
mapping …
convert to integer tokens
text
text
texttext
texttext 1 33 57
text12 1 1 98 33
text
57 98
integer documents
21/06/2010 UCSP -FASH 27
28. Compressed Token System (CTS)
compressed vocabulary
collection
Query time
the 1
offset results list of 2
query
in 3
mapping …
convert to integer tokens
text
text
texttext
texttext 1 33 57
text12 1 1 98 33
text
57 98
convert back
to text
integer documents
integer +
ESC sequence f(ci, ui, li, ai) two/three
matcher sentences
21/06/2010 UCSP -FASH 28
29. Compressed Token System (CTS)
compressed vocabulary
collection
Query time
the 1
offset results list of 2
query
in 3
mapping …
convert to integer tokens
text
text
texttext
texttext 1 33 57
text12 1 1 98 33
text
57 98
convert back
to text
integer documents
integer +
ESC sequence f(ci, ui, li, ai) two/three
matcher sentences
- use compressed integer matching
21/06/2010 UCSP -FASH 29
30. Data set and results
- We used TREC WT10g and WT100g and collections.
- WT50g is a 50 GB collection randomly sampled from WT100g
30
Percentage of full collection size
Baseline
25 CTS
20
15
10
5
0
WT10g WT50g WT100g
Compression effectiveness
21/06/2010 UCSP -FASH 30
31. Efficiency comparison
- Generated snippets for 10,000 queries using CTS and baseline (top 10 docs)
- Measured snippet generation time for each document
- First few queries caching effect was noticed, so we take the average of the
last 7000 queries
Baseline
CTS
7 16
Time (msec)
21/06/2010 UCSP -FASH 31
32. Efficiency comparison
- Generated snippets for 10,000 queries using CTS and baseline (top 10 docs)
- Measured snippet generation time for each document
- First few queries caching effect was noticed, so we take the average of the
last 7000 queries
Disk access In-memory processing
Baseline Seek Processing
CTS
4.5 7 16
Time (msec)
21/06/2010 UCSP -FASH 32
33. Efficiency comparison
- Generated snippets for 10,000 queries using CTS and baseline (top 10 docs)
- Measured snippet generation time for each document
- First few queries caching effect was noticed, so we take the average of the
last 7000 queries
Disk access In-memory processing
Baseline Seek Processing
CTS
4.5 7 16
Time (msec)
So can we get away with performing no disk access?
21/06/2010 UCSP -FASH 33
34. Document caching
- With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML)
- WT100g has around 18.6 million documents
21/06/2010 UCSP -FASH 34
35. Document caching
- With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML)
- WT100g has around 18.6 million documents
- Snippet machine 4GB of RAM,
- 1 GB is used by lexicon and document offset mapping
- 2-3 GB can be dedicated for caching
- In theory, using WT100g should be able to cache over 250k docs in
memory
- This is 5-7% of the collection size
21/06/2010 UCSP -FASH 35
36. Document caching
- With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML)
- WT100g has around 18.6 million documents
- Snippet machine 4GB of RAM,
- 1 GB is used by lexicon and document offset mapping
- 2-3 GB can be dedicated for caching
- In theory, using WT100g should be able to cache over 250k docs in
memory
- This is 5-7% of the collection size
- But, in reality, how much disk access would we actually save?
- We simulate this by caching the top 20 documents for > 500 k queries
- Simulation allows us to control memory usage and exact hit and miss
counts
21/06/2010 UCSP -FASH 36
37. Caching simulation
- We processed > 500 k queries and cached the top 20 documents for
each query
- The score of documents is half Okapi BM25 score and half query
independent score (similar effect as PageRank)
- We used two cache eviction policies:
-Static – once cache is populated and full no entries are evicted
-LRU (least recently used) – once cache is full documents are
evicted based of the recency of their access
- What is Q ?
- Search engines cache results of most popular queries
21/06/2010 UCSP -FASH 37
38. Caching simulation (results)
% of doc requests that hit cache
Cache size (% of collection, WT100G)
• Cache of 1% of collection yields 80% hits and caching 3%
accounts for more than 97% of hits
21/06/2010 UCSP -FASH 38
39. Caching simulation
Baseline Seek Process
CTS
CTS + caching
3.4 7 16
Time (msec)
How can we further enhance doc cache performance?
– Smaller cache entries mean more documents can fit in cache, so do we need to
keep entire documents in cache? Perhaps not
21/06/2010 UCSP -FASH 39
40. Sentence reordering
Captain Feathersword is the friendliest
Pirate on the open seas. He loves a good
party, and making people giggle. It's
lucky that he has a feather for a sword,
which he can use to tickle everyone.
Captain Feathersword's Pirate Ship is
called "The Good Ship Feathersword",
and he loves to cook, dance and sing
with his crew on his ship.
You'll hear Captain Feathersword on his
ship or on dry land, saying a big "Ahoy
there me hearties!"
Blue terms = significant
21/06/2010 UCSP -FASH 40
Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
41. Sentence reordering
4
Captain Feathersword is the friendliest
5
Pirate on the open seas. He loves a
good party, and making people giggle.
2
It's lucky that he has a feather for a
sword, which he can use to tickle
1
everyone. Captain Feathersword's
Pirate Ship is called "The Good Ship
Feathersword", and he loves to cook,
dance and sing with his crew on his ship.
3
You'll hear Captain Feathersword on his
ship or on dry land, saying a big "Ahoy
there me hearties!"
Blue terms = significant
21/06/2010 UCSP -FASH 41
Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
42. Sentence reordering
1
Captain Feathersword's Pirate Ship is
called "The Good Ship Feathersword",
and he loves to cook, dance and sing
with his crew on his ship.
2
It's lucky that he has a feather for a
sword, which he can use to tickle
everyone.
3
You'll hear Captain Feathersword on his
ship or on dry land, saying a big "Ahoy
there me hearties!"
4
Captain Feathersword is the friendliest
Pirate on the open seas.
5
He loves a good party, and making
people google.
Blue terms = significant
21/06/2010 UCSP -FASH 42
Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
43. Sentence reordering
1
Captain Feathersword's Pirate Ship is
called "The Good Ship Feathersword",
and he loves to cook, dance and sing
with his crew on his ship.
2
It's lucky that he has a feather for a
sword, which he can use to tickle
everyone. Keep in cache
3
You'll hear Captain Feathersword on his
ship or on dry land, saying a big "Ahoy
there me hearties!“
4
Captain Feathersword is the friendliest
Pirate on the open seas.
5
He loves a good party, and making
people google. Blue terms = significant
21/06/2010 UCSP -FASH 43
Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
44. Sentence reordering methods
• Significant sentence (ST): sentences that contain significant terms
are good
• Natural order: the original order sentences in a document
21/06/2010 UCSP -FASH 44
45. Sentence reordering methods
• Significant sentence (ST): sentences that contain significant terms
are good
• Natural order: the original order sentences in a document
• Query log (Qlt): sentences that contain previously queried terms
• Query log (Qlu): same as Qlt, but repeating query terms in a
sentence are only counted once.
21/06/2010 UCSP -FASH 45
46. Sentence reordering methods
• Significant sentence (ST): sentences that contain significant terms
are good
• Natural order: the original order sentences in a document
• Query log (Qlt): sentences that contain previously queried terms
• Query log (Qlu): same as Qlt, but repeating query terms in a
sentence are only counted once.
• But where do we draw the cut-off line?
– A trade-off between efficiency gains (more documents in cache) and
effectiveness loss
21/06/2010 UCSP -FASH 46
48. Conclusion
• We proposed a practical document storage for snippet extraction
system (CTS)
• Compared to the baseline defined, using CTS, the in-memory
processing time to generate a snippet is reduced by half of the
baseline’s
• Using document cache, we have shown that the 80% of seeks can
be also be averted by caching only 1% of the collection size
• Caching documents can be further enhanced by retaining only the
important parts of a document through sentence re-ordering
21/06/2010 UCSP -FASH 48