Low-cost Management of Inverted
Files for Online Full-Text Search
Giorgos Margaritis , Stergios V. Anastasiadis
Systems Re...
Introduction
• Prices of hard disks drop
 Amount of stored data increases
 Need for automated search
• Full-text search
...
Inverted Index
• For each term store an inverted (or posting) list
 list of pointers to documents containing term
 point...
Build time vs. Search time Trade-off
• Contiguous disk storage of inverted lists:
 improves access efficiency
 needs com...
Inverted List Contiguity
• Recent methods
 Maintain a (relatively small) number of partial indexes
 Retrieve an inverted...
Inverted List Contiguity
• Hybrid Logarithmic Merge [Büttcher et al., 2006]
 Create inverted index for 426GB
November 200...
Design Objectives
• 1. Keep small lists contiguous
• 2. Keep large lists contiguous, or in few large fragments
• 3. Keep b...
Outline
• Motivation
• Background
• Proposed Algorithm
• Experiments
• Conclusions
Giorgos Margaritis - University of Ioan...
<it, 10, 1>
<was, 10, 2>
<a, 10, 3>
<dragon, 10, 4>
Creating an Inverted Index
November 2009 Giorgos Margaritis - Universi...
Postings Flushing: Re-Merge
• Re-Merge [Lester et al., 2004]
 Inverted lists stored in lexicographic order
 For each lis...
Postings Flushing: In-Place Update
• In-Place Update [Tomasic et al., 1994]
 Create list: pre-allocate some empty space a...
Postings Flushing: Hybrid Merge
• Hybrid Merge [Büttcher et al., 2008]
 Categorizes terms into short (rare) and long (fre...
Outline
• Motivation
• Background
• Proposed Algorithm
• Experiments
• Conclusions
Giorgos Margaritis - University of Ioan...
System Architecture
• Organize disk index in fixed-size blocks
• Categorize terms in short or long, based on their list si...
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
on-disk inverted lists
MEMORY
DISK
new postings
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
on-disk inverted lists
MEMORY
DISK
"a" – "lost" ...
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
on-disk inverted lists
MEMORY
DISK
"a" – "lost" ...
Parameters
• Preference Factor (F)
 Prefer to flush long term T instead of range R:
 flush long term: cheap – single dis...
"Selective Range Flush" Algorithm
November 2009 Giorgos Margaritis - University of Ioannina, Greece
1. Parse document into...
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
on-disk inverted lists
MEMORY
DISK
"a" – "lost" ...
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - Univ...
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - Univ...
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - Univ...
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - Univ...
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - Univ...
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - Univ...
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - Univ...
Block Overflow
Block of range R overflows:
 allocate block
 move half of postings to new block
 split R according to th...
Innovative Work
• Simplify disk management by splitting postings into disk blocks
 Relax requirement of full contiguity f...
Outline
• Motivation
• Background
• Proposed Algorithm
• Experiments
• Conclusions
Giorgos Margaritis - University of Ioan...
Experiments
• Experimental environment
 Nodes: quad-core at 2.33Ghz, 3GB memory, 2 SATA disks (500GB each)
 Text collect...
Parameters
Parameter Default value
Block size 8MB
Postings memory 1GB
Flush memory 20MB
Preference factor 3
Long list thre...
List Retrieval
• Proteus short list retrieval times similar to Hybrid Merge
 both systems keep short lists contiguous
• P...
Inverted Index Construction
• Proteus:
 30% reduction in build time comparing to H.M.
• Proteus CNT:
 3% more time (14 m...
• Increase block size
 increase range flushing time:
 more postings per range block
 more bytes transferred between mem...
• Small values:
 don’t create sufficient space to accumulate new postings
• Large values:
 flush ranges and long terms w...
• Small values:
 don’t create sufficient space to accumulate new postings
• Large values:
 flush ranges and long terms w...
• Small values:
 underestimate cost of flushing ranges
• Large values:
 underestimate cost of flushing long terms
Prefer...
• Small values:
 underestimate cost of flushing ranges
• Large values:
 underestimate cost of flushing long terms
Prefer...
Conclusions & Future work
• Fragmentation can harm retrieval times for short lists
• Proteus:
 manage inverted index in b...
Upcoming SlideShare
Loading in …5
×

Low-cost Management of Inverted Files for Online Full-Text Search

460 views
409 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
460
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Low-cost Management of Inverted Files for Online Full-Text Search

  1. 1. Low-cost Management of Inverted Files for Online Full-Text Search Giorgos Margaritis , Stergios V. Anastasiadis Systems Research Group, University of Ioannina, Greece CIKM ‘09, Hong Kong, November 2 - 6, 2009
  2. 2. Introduction • Prices of hard disks drop  Amount of stored data increases  Need for automated search • Full-text search  User: defines terms and operands (e.g. logical)  System: finds documents that contain terms and satisfy operands  usually sorts returned documents by "relevance" • Online full-text search  Index updates concurrent with document searches  Need to constantly maintain index up-to-date Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
  3. 3. Inverted Index • For each term store an inverted (or posting) list  list of pointers to documents containing term  pointers specify exact position in document (e.g. <doc, offset>)  each pointer is called posting • Lexicon  associates terms to their inverted lists November 2009 Giorgos Margaritis - University of Ioannina, Greece Document 2Document 1 and dark did had …….. where Lexicon Inverted List . . .
  4. 4. Build time vs. Search time Trade-off • Contiguous disk storage of inverted lists:  improves access efficiency  needs complex dynamic storage management  requires frequent or bulky relocations • List fragmentation  improves build time  degrades search time Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
  5. 5. Inverted List Contiguity • Recent methods  Maintain a (relatively small) number of partial indexes  Retrieve an inverted list: retrieve its fragments from all partial indexes • Majority of inverted lists  Disk size of hundreds of kilobytes • Fragmentation can really harm retrieval times of small lists  More than 99% of terms have list sizes < 10KB (426GB text collection)  Typical disk: 12ms positional time + 0.1ms to sequentially read 10KBb  N fragments N positional times N x (contiguous list retrieval time) Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
  6. 6. Inverted List Contiguity • Hybrid Logarithmic Merge [Büttcher et al., 2006]  Create inverted index for 426GB November 2009 Giorgos Margaritis - University of Ioannina, Greece 0 1 2 3 4 5 6 7 0 19 39 58 77 97 116 136 155 174 194 213 232 252 271 290 310 329 349 368 387 407 426 Gigabytes parsed #Partialindexes
  7. 7. Design Objectives • 1. Keep small lists contiguous • 2. Keep large lists contiguous, or in few large fragments • 3. Keep build time low • 4. Retrieve lists fast November 2009 Giorgos Margaritis - University of Ioannina, Greece
  8. 8. Outline • Motivation • Background • Proposed Algorithm • Experiments • Conclusions Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
  9. 9. <it, 10, 1> <was, 10, 2> <a, 10, 3> <dragon, 10, 4> Creating an Inverted Index November 2009 Giorgos Margaritis - University of Ioannina, Greece <9:2,7> <10:2>Parser doc: 10 It was a dragon. For each document: 1. Parse document into postings 2. Accumulate postings into memory 3. If memory full, flush postings to disk, updating inverted lists was: <3:8> <5:2> <9:2,7> <10:2> was memory full
  10. 10. Postings Flushing: Re-Merge • Re-Merge [Lester et al., 2004]  Inverted lists stored in lexicographic order  For each list: 1. Read in memory, 2. Add new postings, 3. Write to disk • Cost  Efficient for lists with few postings (sequential read)  Retrieve ALL lists from disk and write them ALL back to disk November 2009 Giorgos Margaritis - University of Ioannina, Greece Memory Disk
  11. 11. Postings Flushing: In-Place Update • In-Place Update [Tomasic et al., 1994]  Create list: pre-allocate some empty space after each it  For each list: append new postings on disk. If not enough space, relocate list in position with sufficient space • Cost  Efficient for lists with lots of postings (append instead of read + write)  Lists can't be kept in lexicographical order random seeks for each list November 2009 Giorgos Margaritis - University of Ioannina, Greece Memory Disk
  12. 12. Postings Flushing: Hybrid Merge • Hybrid Merge [Büttcher et al., 2008]  Categorizes terms into short (rare) and long (frequent), based on their inverted lists length  Use Re-Merge method for short terms and In-Place Update for long • Combines the benefits of the two methods  Re-Merge: proper for updating a big number of short lists  In-Place Update: proper for updating a small number of long lists • Two problems  Cost of reading and writing back all short lists for Re-Merge  Cost of relocations for In-Place Update November 2009 Giorgos Margaritis - University of Ioannina, Greece
  13. 13. Outline • Motivation • Background • Proposed Algorithm • Experiments • Conclusions Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
  14. 14. System Architecture • Organize disk index in fixed-size blocks • Categorize terms in short or long, based on their list size • Group short terms into disjoint lexicographical ranges  Lists of a range must fit into one block • A block may contain:  a group of short lists, updated using "read lists, add postings, write lists"  postings of a long list, updated using appends • When memory is full:  selectively flush some postings from memory to disk  update only those blocks that can be efficiently updated November 2009 Giorgos Margaritis - University of Ioannina, Greece
  15. 15. Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece on-disk inverted lists MEMORY DISK new postings
  16. 16. Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece on-disk inverted lists MEMORY DISK "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable new postings
  17. 17. Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece on-disk inverted lists MEMORY DISK "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable new postings
  18. 18. Parameters • Preference Factor (F)  Prefer to flush long term T instead of range R:  flush long term: cheap – single disk append  flush range: costly – read all lists, add postings, write all lists  Caveat: if T has too few memory postings does not worth flushing  if T has F times less postings than R, flush R • Flush Memory (M)  Flush only a small portion of memory (e.g. M = 20MB)  Update only blocks that can be efficiently updated  Free some memory space November 2009 Giorgos Margaritis - University of Ioannina, Greece
  19. 19. "Selective Range Flush" Algorithm November 2009 Giorgos Margaritis - University of Ioannina, Greece 1. Parse document into postings 2. When memory is full: While ( size of flushed postings < M ) R := range with max postings in memory ; T := long term with max postings in memory ; If ( R.mem_postings < F  T.mem_postings ) then flush T ; /* append new postings on disk */ Else flush R ; /* read lists from block, add new postings, write lists back to block */ 3. Return to parsing phase
  20. 20. Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece on-disk inverted lists MEMORY DISK "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable new postings F = 2 M = 15
  21. 21. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece on-disk inverted lists R T MEMORY DISK new postings F = 2 M = 15
  22. 22. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece on-disk inverted lists R T MEMORY DISK new postings F = 2 M = 15 postings of R < F  postings of T
  23. 23. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece new postings on-disk inverted lists R MEMORY DISK T F = 2 M = 15
  24. 24. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece new postings on-disk inverted lists TR MEMORY DISK F = 2 M = 15
  25. 25. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece new postings on-disk inverted lists TR MEMORY DISK F = 2 M = 15 postings of R > F  postings of T
  26. 26. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece new postings on-disk inverted lists TR MEMORY DISK F = 2 M = 15
  27. 27. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece new postings on-disk inverted lists TR MEMORY DISK F = 2 M = 15 we have flushed M postings
  28. 28. Block Overflow Block of range R overflows:  allocate block  move half of postings to new block  split R according to the lists each block contains • Last block of list of long term T overflows:  allocate block  move overflown postings to new block November 2009 Giorgos Margaritis - University of Ioannina, Greece "cat" – "dog" "cat" – "day" "daze" - dog"
  29. 29. Innovative Work • Simplify disk management by splitting postings into disk blocks  Relax requirement of full contiguity for long terms  Maintain requirement of full contiguity for each short term • Treat short terms in ranges for efficiency • Partially flush both short and long terms when memory full • Dynamically decide whether to flush long term or range November 2009 Giorgos Margaritis - University of Ioannina, Greece
  30. 30. Outline • Motivation • Background • Proposed Algorithm • Experiments • Conclusions Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
  31. 31. Experiments • Experimental environment  Nodes: quad-core at 2.33Ghz, 3GB memory, 2 SATA disks (500GB each)  Text collection: 426GB GOV2 TREC Terabyte track  Search queries: Efficiency Topics query set of TREC 2005 Terabyte Track • Proteus: Selective Range Flush prototype implementation  Parser from Zettair search engine (RMIT University, Australia) • Compare with Hybrid Merge  Wumpus search engine (U. Waterloo, Canada) November 2009 Giorgos Margaritis - University of Ioannina, Greece
  32. 32. Parameters Parameter Default value Block size 8MB Postings memory 1GB Flush memory 20MB Preference factor 3 Long list threshold 1MB November 2009 Giorgos Margaritis - University of Ioannina, Greece
  33. 33. List Retrieval • Proteus short list retrieval times similar to Hybrid Merge  both systems keep short lists contiguous • Proteus long list retrieval times 2-3 times lower than Hybrid Merge  due to Wumpus implementation, not H.M. algorithm • Proteus "CNT" variant:  keeps long lists contiguous using relocations November 2009 Giorgos Margaritis - University of Ioannina, Greece 22,2 22,9 0 10 20 30 40 50 H.M. Wumpus Proteus milliseconds Short Terms: Average Retrieval Time 416 145 126 0 100 200 300 400 500 H.M. Wumpus Proteus Proteus CNT milliseconds Long Terms: Average Retrieval Time
  34. 34. Inverted Index Construction • Proteus:  30% reduction in build time comparing to H.M. • Proteus CNT:  3% more time (14 min) to keep all lists contiguous November 2009 Giorgos Margaritis - University of Ioannina, Greece 730 514 528 0 100 200 300 400 500 600 700 800 H.M. Wumpus Proteus Proteus CNT minutes Index build time
  35. 35. • Increase block size  increase range flushing time:  more postings per range block  more bytes transferred between memory and disk per flush  decrease long term flushing time:  same amount of bytes flushed to disk, fewer appends Sensitivity Analysis: Block size November 2009 Giorgos Margaritis - University of Ioannina, Greece Block size 0 200 400 600 800 2MB 4MB 8MB 16MB 32MB minutes Index Build Flush ranges Flush long terms Document parsing
  36. 36. • Small values:  don’t create sufficient space to accumulate new postings • Large values:  flush ranges and long terms with very few postings   disk head movement overhead Flush memory November 2009 Giorgos Margaritis - University of Ioannina, Greece Flush Memory 0 100 200 300 400 500 600 2MB 5MB 10MB 20MB 40MB 80MB minutes Index Build Flush ranges Flush long terms Document parsing
  37. 37. • Small values:  don’t create sufficient space to accumulate new postings • Large values:  flush ranges and long terms with very few postings   disk head movement overhead Flush memory November 2009 Giorgos Margaritis - University of Ioannina, Greece Flush Memory 0 100 200 300 400 500 600 2MB 5MB 10MB 20MB 40MB 80MB minutes Index Build Flush ranges Flush long terms Document parsing "Flush only a small percentage (2% - 3%) of total memory ..."
  38. 38. • Small values:  underestimate cost of flushing ranges • Large values:  underestimate cost of flushing long terms Preference Factor November 2009 Giorgos Margaritis - University of Ioannina, Greece 0 100 200 300 400 500 600 1 2 3 4 8 16 minutes Index Build Flush ranges Flush long terms Document parsing Preference Factor
  39. 39. • Small values:  underestimate cost of flushing ranges • Large values:  underestimate cost of flushing long terms Preference Factor November 2009 Giorgos Margaritis - University of Ioannina, Greece 0 100 200 300 400 500 600 1 2 3 4 8 16 minutes Index Build Flush ranges Flush long terms Document parsing Preference Factor "Postpone flushing of ranges, but not too much …"
  40. 40. Conclusions & Future work • Fragmentation can harm retrieval times for short lists • Proteus:  manage inverted index in blocks  keep short lists contiguous, long lists in large fragments  memory full: selectively flush some postings  update only blocks that can be efficiently updated  30% reduction in build time  constant 15% increase in retrieval times of lists > 8MB  3% increase in build time to keep all lists contiguous  relatively insensitive to parameter values • Future work:  different block sizes for short and long terms, auto adjust parameter values based on dataset / hardware, different cost models for flushing Giorgos Margaritis - University of Ioannina, GreeceNovember 2009

×