Snipets by FrancoSH

Fast Generation of Result
xxxx Search
Snippets in Web

Franco Sánchez Huertas
(UCSP)

EDA – June, 2010

21/06/2010 UCSP -FASH 1

Overview

• What are snippets?
• Research questions
• Rationale
• Baseline
• Compressed Token System
• Document caching for snippet generation
• Sentence reordering
• Conclusion

21/06/2010 UCSP -FASH 2

What snippets are?

21/06/2010 UCSP -FASH 3

What snippets are?

21/06/2010 UCSP -FASH 4

Research question

Which fast strategies can we use to
generate snippets for web search
results?

21/06/2010 UCSP -FASH 5

Rationale

• Two main reasons:
– Snippet extraction is an integral part of the query
evaluation process and speeding it will reduce the
overall time (and resources) required to process a
query

21/06/2010 UCSP -FASH 6

Rationale

• Two main reasons:
– Snippet extraction is an integral part of the query
evaluation process and speeding it will reduce the
overall time (and resources) required to process a
query

– No prior literature exists which discusses how to
efficiently generate snippets

21/06/2010 UCSP -FASH 7

sigir Search

21/06/2010 UCSP -FASH 8

sigir Search Identify relevant
documents

21/06/2010 UCSP -FASH 9

documents

Strip
sentences

Bag of sentences
21/06/2010 UCSP -FASH 10

documents

Strip
sentences

Collect stats on
sentences
Sentence
ranker
Bag of sentences
21/06/2010 UCSP -FASH 11

documents

Strip
sentences
Pick 2-3 sentences
per document …
generate result page
Collect stats on
sentences
Sentence
ranker
Bag of sentences
21/06/2010 UCSP -FASH 12

Sentence ranking
• a-priori (without queries) ai
– sentence position (titles, leading sentences)
– term/sentence significance

• Query time (with queries)
– query terms count ci
– unique query term count ui
– query term proximity li

• Using all the above features, sentence i can be
ranked using some function f(ci, ui, li, ai)

21/06/2010 UCSP -FASH 13

Baseline
Indexing time

<html>
<body> text
text
texttext
0100101001
0100101001
0100101001
texttext 0010101101
text text
text <eos> compress using 0010101101
0010101101
-strip out HTML text

text text <eos> gzip
-add EOS marker
</body>
</html>

21/06/2010 UCSP -FASH 14

Baseline
Indexing time

<html>
<body> text
text
texttext
0100101001
0100101001
0100101001
texttext 0010101101
text text
0010101101

-add EOS marker
</body>
</html>
Query time

results list

query

21/06/2010 UCSP -FASH 15

Baseline
Indexing time

<html>
<body> text
text
texttext
0100101001
0100101001
0100101001
texttext 0010101101
text text
0010101101

-add EOS marker
</body>
</html>

decompress
Query time

text
text <eos> results list
text
text <eos>

query

21/06/2010 UCSP -FASH 16

Baseline
Indexing time

<html>
<body> text
text
texttext
0100101001
0100101001
0100101001
texttext 0010101101
text text
0010101101

-add EOS marker
</body>
</html>

decompress
Query time

text
text <eos> results list
text
text <eos>
two/three f(ci, ui, li, ai) string matcher
sentences (full word matching)

query

21/06/2010 UCSP -FASH 17

Compressed Token System (CTS)
Indexing time

<html>
<body> text
text
texttext
texttext
text text
text <eos>
strip out HTML text

text text <eos>

</body>
</html>
pass 1

the 1
the 1
of 2
of 2
in 3
in 3
…

21/06/2010 UCSP -FASH 18

Indexing time

<html>
<body> text
text
texttext
texttext
text text
text <eos>
strip out HTML text

text text <eos>

</body>
</html>
pass 1 pass 2
single file
the 1
of 2
in
…
3 +

21/06/2010 UCSP -FASH 19

Indexing time

<html>
<body> text
text
texttext
texttext
text text
text <eos>
strip out HTML text

text text <eos>

</body>
</html>
pass 1 pass 2
single file
the 1
of 2
in
…
3 +
offset map is
required to tell
where
a document starts

21/06/2010 UCSP -FASH 20

Indexing time

<html>
<body> text
text
texttext
texttext
text text
text <eos>
strip out HTML text

text text <eos>

</body>
</html>
pass 1 pass 2
single file
the 1
of 2
in
…
3 +
offset map is
• Terms in lexicon are replaced with an integer required to tell
• Those not in lexicon are spelt out as follows where
ESC-length-word a document starts
ESC-7-britney
• Mark end of each sentence
• Compress using integer compression scheme (vbyte)
21/06/2010 UCSP -FASH 21

Indexing time

<html>
<body> text
text
texttext
texttext
text text
text <eos>
strip out HTML text

text text <eos>

</body>
</html>
pass 1 pass 2
single file
the 1
of 2
in
…
3 +
offset map is
• Terms in lexicon are replaced with an integer required to tell
• Those not in lexicon are spelt out as follows where
ESC-length-word a document starts
ESC-7-britney
• Mark end of each sentence
• Compress using integer compression scheme (vbyte)
21/06/2010 UCSP -FASH 22

compressed vocabulary
collection
Query time

the 1
offset of 2
in 3
mapping …

21/06/2010 UCSP -FASH 23

collection
Query time

the 1
offset of 2
query
in 3
mapping …

21/06/2010 UCSP -FASH 24

collection
Query time

the 1
offset of 2
query
in 3
mapping …

convert to integer tokens

1 33 57

21/06/2010 UCSP -FASH 25

collection
Query time

the 1
offset results list of 2
query
in 3
mapping …


1 33 57

21/06/2010 UCSP -FASH 26

collection
Query time

the 1
query
in 3
mapping …


text
text
texttext
texttext 1 33 57
text12 1 1 98 33
text
57 98

integer documents

21/06/2010 UCSP -FASH 27

collection
Query time

the 1
query
in 3
mapping …


text
text
texttext
texttext 1 33 57
text12 1 1 98 33
text
57 98
convert back
to text
integer documents

integer +
ESC sequence f(ci, ui, li, ai) two/three
matcher sentences

21/06/2010 UCSP -FASH 28

collection
Query time

the 1
query
in 3
mapping …


text
text
texttext
texttext 1 33 57
text12 1 1 98 33
text
57 98
convert back
to text
integer documents

integer +
ESC sequence f(ci, ui, li, ai) two/three
matcher sentences

- use compressed integer matching
21/06/2010 UCSP -FASH 29

Data set and results
- We used TREC WT10g and WT100g and collections.
- WT50g is a 50 GB collection randomly sampled from WT100g

30
Percentage of full collection size

Baseline
25 CTS

20

15

10

5

0
WT10g WT50g WT100g
Compression effectiveness

21/06/2010 UCSP -FASH 30

Efficiency comparison
- Generated snippets for 10,000 queries using CTS and baseline (top 10 docs)
- Measured snippet generation time for each document
- First few queries caching effect was noticed, so we take the average of the
last 7000 queries

Baseline

CTS

7 16
Time (msec)

21/06/2010 UCSP -FASH 31

last 7000 queries

Disk access In-memory processing

Baseline Seek Processing

CTS

4.5 7 16
Time (msec)

21/06/2010 UCSP -FASH 32

last 7000 queries

Disk access In-memory processing

Baseline Seek Processing

CTS

4.5 7 16
Time (msec)

So can we get away with performing no disk access?
21/06/2010 UCSP -FASH 33

Document caching
- With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML)
- WT100g has around 18.6 million documents

21/06/2010 UCSP -FASH 34

Document caching

- Snippet machine 4GB of RAM,
- 1 GB is used by lexicon and document offset mapping
- 2-3 GB can be dedicated for caching
- In theory, using WT100g should be able to cache over 250k docs in
memory
- This is 5-7% of the collection size

21/06/2010 UCSP -FASH 35

Document caching

- Snippet machine 4GB of RAM,
- 1 GB is used by lexicon and document offset mapping
- 2-3 GB can be dedicated for caching
- In theory, using WT100g should be able to cache over 250k docs in
memory
- This is 5-7% of the collection size

- But, in reality, how much disk access would we actually save?

- We simulate this by caching the top 20 documents for > 500 k queries
- Simulation allows us to control memory usage and exact hit and miss
counts

21/06/2010 UCSP -FASH 36

Caching simulation
- We processed > 500 k queries and cached the top 20 documents for
each query

- The score of documents is half Okapi BM25 score and half query
independent score (similar effect as PageRank)

- We used two cache eviction policies:
-Static – once cache is populated and full no entries are evicted
-LRU (least recently used) – once cache is full documents are
evicted based of the recency of their access

- What is Q ?
- Search engines cache results of most popular queries

21/06/2010 UCSP -FASH 37

Caching simulation (results)
% of doc requests that hit cache

Cache size (% of collection, WT100G)

• Cache of 1% of collection yields 80% hits and caching 3%
accounts for more than 97% of hits
21/06/2010 UCSP -FASH 38

Caching simulation

Baseline Seek Process

CTS

CTS + caching

3.4 7 16
Time (msec)

How can we further enhance doc cache performance?
– Smaller cache entries mean more documents can fit in cache, so do we need to
keep entire documents in cache? Perhaps not

21/06/2010 UCSP -FASH 39

Sentence reordering

Captain Feathersword is the friendliest
Pirate on the open seas. He loves a good
party, and making people giggle. It's
lucky that he has a feather for a sword,
which he can use to tickle everyone.
Captain Feathersword's Pirate Ship is
called "The Good Ship Feathersword",
and he loves to cook, dance and sing
with his crew on his ship.
You'll hear Captain Feathersword on his
ship or on dry land, saying a big "Ahoy
there me hearties!"

Blue terms = significant
21/06/2010 UCSP -FASH 40
Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx

Sentence reordering
4
5
Pirate on the open seas. He loves a
good party, and making people giggle.
2
It's lucky that he has a feather for a
sword, which he can use to tickle
1
everyone. Captain Feathersword's
Pirate Ship is called "The Good Ship
Feathersword", and he loves to cook,
dance and sing with his crew on his ship.
3
there me hearties!"

21/06/2010 UCSP -FASH 41

Sentence reordering
1
2
everyone.
3
there me hearties!"
4
Pirate on the open seas.
5
He loves a good party, and making
people google.
21/06/2010 UCSP -FASH 42

Sentence reordering
1
2
everyone. Keep in cache
3
there me hearties!“

4
Pirate on the open seas.
5
He loves a good party, and making
people google. Blue terms = significant
21/06/2010 UCSP -FASH 43

Sentence reordering methods
• Significant sentence (ST): sentences that contain significant terms
are good
• Natural order: the original order sentences in a document

21/06/2010 UCSP -FASH 44

are good
• Query log (Qlt): sentences that contain previously queried terms
• Query log (Qlu): same as Qlt, but repeating query terms in a
sentence are only counted once.

21/06/2010 UCSP -FASH 45

are good
• Query log (Qlt): sentences that contain previously queried terms
• Query log (Qlu): same as Qlt, but repeating query terms in a
sentence are only counted once.

• But where do we draw the cut-off line?
– A trade-off between efficiency gains (more documents in cache) and
effectiveness loss

21/06/2010 UCSP -FASH 46

Conclusion
• We proposed a practical document storage for snippet extraction
system (CTS)

• Compared to the baseline defined, using CTS, the in-memory
processing time to generate a snippet is reduced by half of the
baseline’s

• Using document cache, we have shown that the 80% of seeks can
be also be averted by caching only 1% of the collection size

• Caching documents can be further enhanced by retaining only the
important parts of a document through sentence re-ordering

21/06/2010 UCSP -FASH 48

Questions

21/06/2010 UCSP -FASH 49

Snipets by FrancoSH

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Snipets by FrancoSH