Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Low-cost Management of Inverted Files for Online Full-Text Search

on

  • 452 views

 

Statistics

Views

Total Views
452
Views on SlideShare
452
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Low-cost Management of Inverted Files for Online Full-Text Search Low-cost Management of Inverted Files for Online Full-Text Search Presentation Transcript

  • Low-cost Management of Inverted Files for Online Full-Text Search Giorgos Margaritis , Stergios V. Anastasiadis Systems Research Group, University of Ioannina, Greece CIKM ‘09, Hong Kong, November 2 - 6, 2009
  • Introduction• Prices of hard disks drop  Amount of stored data increases  Need for automated search• Full-text search  User: defines terms and operands (e.g. logical)  System: finds documents that contain terms and satisfy operands  usually sorts returned documents by "relevance"• Online full-text search  Index updates concurrent with document searches  Need to constantly maintain index up-to-dateNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Inverted Index• For each term store an inverted (or posting) list  list of pointers to documents containing term  pointers specify exact position in document (e.g. <doc, offset>)  each pointer is called posting• Lexicon  associates terms to their inverted lists Lexicon Inverted List and ... dark did had …….. Document 1 Document 2 whereNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Build time vs. Search time Trade-off• Contiguous disk storage of inverted lists:  improves access efficiency  needs complex dynamic storage management  requires frequent or bulky relocations• List fragmentation  improves build time  degrades search timeNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Inverted List Contiguity• Recent methods  Maintain a (relatively small) number of partial indexes  Retrieve an inverted list: retrieve its fragments from all partial indexes• Majority of inverted lists  Disk size of hundreds of kilobytes• Fragmentation can really harm retrieval times of small lists  More than 99% of terms have list sizes < 10KB (426GB text collection)  Typical disk: 12ms positional time + 0.1ms to sequentially read 10KBb  N fragments N positional times N x (contiguous list retrieval time)November 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Inverted List Contiguity• Hybrid Logarithmic Merge [Büttcher et al., 2006]  Create inverted index for 426GB 7 6 # Partial indexes 5 4 3 2 1 0 19 39 58 77 97 0 174 116 136 155 194 213 232 252 271 290 310 329 349 368 387 407 426 Gigabytes parsedNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Design Objectives• 1. Keep small lists contiguous• 2. Keep large lists contiguous, or in few large fragments• 3. Keep build time low• 4. Retrieve lists fastNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Outline • Motivation • Background • Proposed Algorithm • Experiments • ConclusionsNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Creating an Inverted Index <it, 10, 1>doc: 10It was a <was, 10, 2> Parser was <9:2,7> <10:2>dragon. <a, 10, 3> <dragon, 10, 4> memory full For each document: 1. Parse document into postings 2. Accumulate postings into memory was: <3:8> <5:2> <9:2,7> <10:2> 3. If memory full, flush postings to disk, updating inverted listsNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Postings Flushing: Re-Merge• Re-Merge [Lester et al., 2004]  Inverted lists stored in lexicographic order  For each list: 1. Read in memory, 2. Add new postings, 3. Write to disk• Cost  Efficient for lists with few postings (sequential read)  Retrieve ALL lists from disk and write them ALL back to disk Memory DiskNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Postings Flushing: In-Place Update• In-Place Update [Tomasic et al., 1994]  Create list: pre-allocate some empty space after each it  For each list: append new postings on disk. If not enough space, relocate list in position with sufficient space• Cost  Efficient for lists with lots of postings (append instead of read + write)  Lists cant be kept in lexicographical order random seeks for each list Memory DiskNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Postings Flushing: Hybrid Merge• Hybrid Merge [Büttcher et al., 2008]  Categorizes terms into short (rare) and long (frequent), based on their inverted lists length  Use Re-Merge method for short terms and In-Place Update for long• Combines the benefits of the two methods  Re-Merge: proper for updating a big number of short lists  In-Place Update: proper for updating a small number of long lists• Two problems  Cost of reading and writing back all short lists for Re-Merge  Cost of relocations for In-Place UpdateNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Outline • Motivation • Background • Proposed Algorithm • Experiments • ConclusionsNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • System Architecture• Organize disk index in fixed-size blocks• Categorize terms in short or long, based on their list size• Group short terms into disjoint lexicographical ranges  Lists of a range must fit into one block• A block may contain:  a group of short lists, updated using "read lists, add postings, write lists"  postings of a long list, updated using appends• When memory is full:  selectively flush some postings from memory to disk  update only those blocks that can be efficiently updatedNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • ProteusMEMORY new postings on-disk inverted listsDISKNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • ProteusMEMORY new postings rangetable "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtable on-disk inverted listsDISKNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • ProteusMEMORY new postings rangetable "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtable on-disk inverted listsDISKNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Parameters• Preference Factor (F)  Prefer to flush long term T instead of range R:  flush long term: cheap – single disk append  flush range: costly – read all lists, add postings, write all lists  Caveat: if T has too few memory postings does not worth flushing  if T has F times less postings than R, flush R• Flush Memory (M)  Flush only a small portion of memory (e.g. M = 20MB)  Update only blocks that can be efficiently updated  Free some memory spaceNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • "Selective Range Flush" Algorithm1. Parse document into postings2. When memory is full: While ( size of flushed postings < M ) R := range with max postings in memory ; T := long term with max postings in memory ; If ( R.mem_postings < F  T.mem_postings ) then flush T ; /* append new postings on disk */ Else flush R ; /* read lists from block, add new postings, write lists back to block */3. Return to parsing phaseNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • ProteusMEMORY new postings F=2 M = 15 rangetable "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtable on-disk inverted listsDISKNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • ProteusMEMORY new postings F=2 M = 15 R T rangetable "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtable on-disk inverted listsDISKNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • ProteusMEMORY postings of R < F  postings of T new postings F=2 M = 15 R T rangetable "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtable on-disk inverted listsDISKNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • ProteusMEMORY new postings F=2 M = 15 R T rangetable "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtable on-disk inverted listsDISKNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • ProteusMEMORY new postings F=2 M = 15 R T rangetable "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtable on-disk inverted listsDISKNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • ProteusMEMORY postings of R > F  postings of T new postings F=2 M = 15 R T rangetable "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtable on-disk inverted listsDISKNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • ProteusMEMORY new postings F=2 M = 15 R T rangetable "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtable on-disk inverted listsDISKNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • ProteusMEMORY we have flushed M postings new postings F=2 M = 15 R T rangetable "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtable on-disk inverted listsDISKNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Block Overflow Block of range R overflows: "cat" – "dog"  allocate block  move half of postings to new block  split R according to the lists each block contains "cat" – "day" "daze" - dog"• Last block of list of long term T overflows:  allocate block  move overflown postings to new blockNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Innovative Work• Simplify disk management by splitting postings into disk blocks  Relax requirement of full contiguity for long terms  Maintain requirement of full contiguity for each short term• Treat short terms in ranges for efficiency• Partially flush both short and long terms when memory full• Dynamically decide whether to flush long term or rangeNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Outline • Motivation • Background • Proposed Algorithm • Experiments • ConclusionsNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Experiments• Experimental environment  Nodes: quad-core at 2.33Ghz, 3GB memory, 2 SATA disks (500GB each)  Text collection: 426GB GOV2 TREC Terabyte track  Search queries: Efficiency Topics query set of TREC 2005 Terabyte Track• Proteus: Selective Range Flush prototype implementation  Parser from Zettair search engine (RMIT University, Australia)• Compare with Hybrid Merge  Wumpus search engine (U. Waterloo, Canada)November 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Parameters Parameter Default value Block size 8MB Postings memory 1GB Flush memory 20MB Preference factor 3 Long list threshold 1MBNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • List Retrieval Short Terms: Average Retrieval Time Long Terms: Average Retrieval Time 50 500 416 milliseconds 40 400 milliseconds 30 300 22,2 22,9 145 126 200 20 100 10 0 0 H.M. Proteus Proteus CNT H.M. Wumpus Proteus Wumpus• Proteus short list retrieval times similar to Hybrid Merge  both systems keep short lists contiguous• Proteus long list retrieval times 2-3 times lower than Hybrid Merge  due to Wumpus implementation, not H.M. algorithm• Proteus "CNT" variant:  keeps long lists contiguous using relocationsNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Inverted Index Construction Index build time 800 730 700 600 514 528 500 minutes 400 300 200 100 0 H.M. Wumpus Proteus Proteus CNT• Proteus:  30% reduction in build time comparing to H.M.• Proteus CNT:  3% more time (14 min) to keep all lists contiguousNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Sensitivity Analysis: Block size 800 Index Build 600 minutes Flush ranges 400 Flush long terms 200 Document parsing 0 2MB 4MB 8MB 16MB 32MB Block size• Increase block size  increase range flushing time:  more postings per range block  more bytes transferred between memory and disk per flush  decrease long term flushing time:  same amount of bytes flushed to disk, fewer appendsNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Flush memory 600 Index Build 500 400 minutes Flush ranges 300 Flush long terms 200 Document parsing 100 0 2MB 5MB 10MB 20MB 40MB 80MB Flush Memory• Small values:  don’t create sufficient space to accumulate new postings• Large values:  flush ranges and long terms with very few postings   disk head movement overheadNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Flush memory 600 Index Build 500 400 minutes Flush ranges 300 Flush long terms 200 Document parsing 100 "Flush only a small percentage (2% - 3%) 0 2MB 5MB of total 20MB 40MB ..." 10MB memory 80MB Flush Memory• Small values:  don’t create sufficient space to accumulate new postings• Large values:  flush ranges and long terms with very few postings   disk head movement overheadNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Preference Factor 600 Index Build 500 400 minutes Flush ranges 300 Flush long terms 200 Document parsing 100 0 1 2 3 4 8 16 Preference Factor• Small values:  underestimate cost of flushing ranges• Large values:  underestimate cost of flushing long termsNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Preference Factor 600 Index Build 500 400 minutes Flush ranges 300 Flush long terms 200 Document parsing 100 0 "Postpone flushing of ranges, 1 2 3 4 8 16 but not too much …" Preference Factor• Small values:  underestimate cost of flushing ranges• Large values:  underestimate cost of flushing long termsNovember 2009 Giorgos Margaritis - University of Ioannina, Greece
  • Conclusions & Future work• Fragmentation can harm retrieval times for short lists• Proteus:  manage inverted index in blocks  keep short lists contiguous, long lists in large fragments  memory full: selectively flush some postings  update only blocks that can be efficiently updated  30% reduction in build time  constant 15% increase in retrieval times of lists > 8MB  3% increase in build time to keep all lists contiguous  relatively insensitive to parameter values• Future work:  different block sizes for short and long terms, auto adjust parameter values based on dataset / hardware, different cost models for flushingNovember 2009 Giorgos Margaritis - University of Ioannina, Greece