Munching & crunching - Lucene index post-processing
Upcoming SlideShare
Loading in...5
×
 

Munching & crunching - Lucene index post-processing

on

  • 5,741 views

Lucene EuroCon 10 presentation on index post-processing (splitting, merging, sorting, pruning), tiered search, bitwise search, and a few slides on MapReduce indexing models (I ran out of time to show ...

Lucene EuroCon 10 presentation on index post-processing (splitting, merging, sorting, pruning), tiered search, bitwise search, and a few slides on MapReduce indexing models (I ran out of time to show them, but they are there...)

Statistics

Views

Total Views
5,741
Views on SlideShare
5,737
Embed Views
4

Actions

Likes
8
Downloads
92
Comments
0

1 Embed 4

http://www.slideshare.net 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Munching & crunching - Lucene index post-processing Munching & crunching - Lucene index post-processing Presentation Transcript

  • 1 Munching & crunching Lucene index post-processing and applications Andrzej Białecki <andrzej.bialecki@lucidimagination.com>
  • Intro  Started using Lucene in 2003 (1.2-dev?)  Created Luke – the Lucene Index Toolbox  Nutch, Hadoop committer, Lucene PMC member  Nutch project lead
  • Munching and crunching? But really...  Stir your imagination  Think outside the box  Show some unorthodox use and practical applications  Close ties to scalability, performance, distributed search and query latency
  • Agenda ● Post-processing ● Splitting, merging, sorting, pruning ● Tiered search ● Bit-wise search ● (Map-reduce indexing models) Apache Lucene EuroCon 20 May 2010
  • Why post-process indexes?  Isn't it better to build them right from the start?  Sometimes it's not convenient or feasible  Correcting impact of unexpected common words  Targetting specific index size or composition:  Creating evenly-sized shards  Re-balancing shards across servers  Fitting indexes completely in RAM  … and sometimes impossible to do it right  Trimming index size while retaining quality of top-N results Apache Lucene EuroCon 20 May 2010
  • Merging indexes  It's easy to merge several small indexes into one  Fundamental Lucene operation during indexing (SegmentMerger)  Command-line utilities exist: IndexMergeTool  API:  IndexWriter.addIndexes(IndexReader...)  IndexWriter.addIndexesNoOptimize(Directory...)  Hopefully a more flexible API on the flex branch  Solr: through CoreAdmin action=mergeindexes  Note: schema must be compatible Apache Lucene EuroCon 20 May 2010
  • Splitting indexes original index segments_2  IndexSplitter tool:  Moves whole segments to standalone indexes _0 _1 _2  Pros: nearly no IO/CPU involved – just rename & create new SegmentInfos file Cons: segments_0 segments_0  segments_0  Requires a multi-segment index!  Very limited control over content of resulting indexes → MergePolicy new indexes Apache Lucene EuroCon 20 May 2010
  • Splitting indexes, take 2 original index del2 del1 d1  MultiPassIndexSplitter tool: d2  Uses an IndexReader that keeps the list of deletions in memory  The source index remains unmodified d3  For each partition: d4  Marks all source documents not in the partition as deleted  Writes a target split using IndexWriter.addIndexes(IndexReader)  IndexWriter knows how to skip deleted documents  Removes the “deleted” mark from all source documents pass 1 pass 2  Pros: d1 d2  Arbitrary splits possible (even partially overlapping) d3 d4  Source index remains intact  Cons: new indexes  Reads complete index N times – I/O is O(N * indexSize)  Takes twice as much space (source index remains intact) … but maybe it's a feature?  Apache Lucene EuroCon 20 May 2010
  • Splitting indexes, take 3 1 2 3 4 5 6 7 8 9 10 ... stored fields term dict  SinglePassSplitter postings+payloads  Uses the same processing workflow as term vectors SegmentMerger, only with multiple outputs  Write new SegmentInfos and FieldInfos 1 3 5… 1' 2' 3' 4' 5' 6'... stored  Merge (pass-through) stored fields terms partitioner  Merge (pass-through) term dictionary postings term vectors  Merge (pass-through) postings with payloads 246… 1' 2' 3' 4' 5' 6'...  Merge (pass-through) term vectors stored  Renumbers document id-s on-the-fly to form terms contiguous space postings term vectors  Pros: flexibility as with MultiPassIndexSplitter  Status: work started, to be contributed soon... renumber Apache Lucene EuroCon 20 May 2010
  • Splitting indexes, summary  SinglePassSplitter – best tradeoff of flexibility/IO/CPU  Interesting scenarios with SinglePassSplitter:  Split by ranges, round-robin, by field value, by frequency, to a target size, etc...  “Extract” handful of documents to a separate index  “Move” documents between indexes:  “extract” from source  Add to target (merge)  Delete from source  Now the source index may reside on a network FS – the amount of IO is O(1 * indexSize) Apache Lucene EuroCon 20 May 2010
  • Index sorting - introduction  “Early termination” technique  If full execution of a query takes too long then terminate and estimate  Termination conditions:  Number of documents – LimitedCollector in Nutch  Time – TimeLimitingCollector (see also extended LUCENE-1720 TimeLimitingIndexReader)  Problems:  Difficult to estimate total hits  Important docs may not be collected if they have high docID-s Apache Lucene EuroCon 20 May 2010
  • Index sorting - details early termination == poor original index  Define a global ordering of 0 1 2 3 4 5 6 7 doc ID c e h f a d g b rank documents (e.g. PageRank, popularity, quality, etc)  Documents with good rank ID mapping should generally score higher 4 7 0 5 1 3 6 2 old doc ID 0 1 2 3 4 5 6 7 new doc ID  Sort (internal) ID-s by this ordering, descending sorted index  Map from old to new ID-s 0 1 2 3 4 5 6 7 doc ID to follow this ordering a b c d e f g h rank early termination == good  Change the ID-s in postings Apache Lucene EuroCon 20 May 2010
  • Index sorting - summary  Implementation in Nutch: IndexSorter  Based on PageRank – sorts by decreasing page quality  Uses FilterIndexReader  NOTE: “Early termination” will (significantly) reduce quality of results with non-sorted indexes – use both or neither Apache Lucene EuroCon 20 May 2010
  • Index pruning  Quick refresh on the index composition:  Stored fields  Term dictionary  Term frequency data  Positional data (postings)  With or without payload data  Term frequency vectors  Number of documents may be into millions  Number of terms commonly is well into millions  Not to mention individual postings … Apache Lucene EuroCon 20 May 2010
  • Index pruning & top-N retrieval  N is usually << 1000  Very often search quality is judged based on top-20  Question:  Do we really need to keep and process ALL terms and ALL postings for a good-quality top-N search for common queries? Apache Lucene EuroCon 20 May 2010
  • Index pruning hypothesis  There should be a way to remove some of the less important data  While retaining the quality of top-N results!  Question: what data is less important?  Some answers:  That of poorly-scoring documents  That of common (less selective) terms  Dynamic pruning: skips less relevant data during query processing → runtime cost...  But can we do this work in advance (static pruning)? Apache Lucene EuroCon 20 May 2010
  • What do we need for top-N results?  Work backwards  “Foreach” common query:  Run it against the full index  Record the top-N matching documents  “Foreach” document in results:  Record terms and term positions that contributed to the score  Finally: remove all non-recorded postings and terms  First proposed by D. Carmel (2001) for single term queries Apache Lucene EuroCon 20 May 2010
  • … but it's too simplistic: 0 quick 0 quick before pruning 1 brown 1 brown after pruning 2 fox 2 fox Query 1: brown - topN(full) == topN(pruned) Query 2: “brown fox” - topN(full) != topN(pruned)  Hmm, what about less common queries?  80/20 rule of “good enough”?  Term-level is too primitive  Document-centric pruning  Impact-centric pruning  Position-centric pruning Apache Lucene EuroCon 20 May 2010
  • Smarter pruning Freq  Not all term positions are equally corpus language important model document language  Metrics of term and position model importance:  Plain in-document term frequency (TF)  TF-IDF score obtained from top-N results of TermQuery (Carmel method)  Residual IDF – measure of term informativeness (selectivity)  Key-phrase positions, or term clusters  Kullback-Leibler divergence from a Term language model → Apache Lucene EuroCon 20 May 2010
  • Applications  Obviously, performance-related  Some papers claim a modest impact on quality when pruning up to 60% of postings  See LUCENE-1812 for some benchmarks confirming this claim  Removal / restructuring of (some) stored content  Legacy indexes, or ones created with a fossilized external chain Apache Lucene EuroCon 20 May 2010
  • Stored field pruning  Some stored data can be compacted, removed, or restructured  Use case: source text for generating “snippets”  Split content into sentences  Reorder sentences by a static “importance” score (e.g. how many rare terms they contain)  NOTE: this may use collection wide statistics!  Remove the bottom x% of sentences Apache Lucene EuroCon 20 May 2010
  • LUCENE-1812: contrib/pruning tools and API  Based on FilterIndexReader  Produces output indexes via IndexWriter.addIndexes(IndexReader[])  Design:  PruningReader – subclass of FilterIndexReader with necessary boilerplate and hooks for pruning policies  StorePruningPolicy – implements rules for modifying stored fields (and list of field names)  TermPruningPolicy – implements rules for modifying term dictionary, postings and payloads  PruningTool – command-line utility to configure and run PruningReader Apache Lucene EuroCon 20 May 2010
  • Details of LUCENE-1812 source index target index stored fields StorePruningPolicy stored fields IndexWriter term dict term dict postings+payloads TermPruningPolicy postings+payloads term vectors term vectors PruningReader IW.addIndexes(IndexReader...)  IndexWriter consumes source data filtered via PruningReader  Internal document ID-s are preserved – suitable for bitset ops and retrieval by internal ID  If source index has no deletions  If target index is empty Apache Lucene EuroCon 20 May 2010
  • API: StorePruningPolicy  May remove (some) fields from (some) documents  May as well modify the values  May rename / add fields Apache Lucene EuroCon 20 May 2010
  • API: TermPruningPolicy  Thresholds (in the order of precedence):  Per term  Per field  Default  Plain TF pruning – TFTermPruningPolicy  Removes all postings for a term where TF (in-document term frequency) is below a threshold  Top-N term-level – CarmelTermPruningPolicy  TermQuery search for top-N docs  Removes all postings for a term outside the top-N docs Apache Lucene EuroCon 20 May 2010
  • Results so far...  TF pruning:  Term query recall very good  Phrase query recall very poor – expected...  Carmel pruning – slightly better term position selection, but still heavy negative impact on phrase queries  Recognizing and keeping key phrases would help  Use query log for frequent-phrase mining?  Use collocation miner (Mahout)?  Savings on pruning will be smaller, but quality will significantly improve Apache Lucene EuroCon 20 May 2010
  • References  Static Index Pruning for Information Retrieval Systems, Carmel et al, SIGIR'01  A document-centric approach to static index pruning in text retrieval systems, Büttcher & Clark, CIKM'06  Locality-based pruning methods for web search, deMoura et al, ACM TIS '08  Pruning strategies for mixed-mode querying, Anh & Moffat, CIKM'06 Apache Lucene EuroCon 20 May 2010
  • Index pruning applied ...  Index 1: A heavily pruned index that fits in RAM:  excellent speed  poor search quality for many less-common query types  Index 2: Slightly pruned index that fits partially in RAM:  good speed, good quality for many common query types,  still poor quality for some other rare query types  Index 3: Full index on disk:  Slow speed  Excellent quality for all query types  QUESTION: Can we come up with a combined search strategy? Apache Lucene EuroCon 20 May 2010
  • Tiered search search box 1 search box 1 RAM 70% pruned search box 2 search box 2 SSD 30% pruned ? predict evaluate search box 3 search box 3 HDD 0% pruned  Can we predict the best tier without actually running the query?  How to evaluate if the predictor was right? Apache Lucene EuroCon 20 May 2010
  • Tiered search: tier selector and evaluator  Best tier can be predicted (often enough ):  Carmel pruning yields excellent results for simple term queries  Phrase-based pruning yields good results for phrase queries (though less often)  Quality evaluator: when is predictor wrong?  Could be very complex, based on gold standard and qrels  Could be very simple: acceptable number of results  Fall-back strategy:  Serial: poor latency, but minimizes load on bulkier tiers  Partially parallel:  submit to the next tier only the border-line queries  Pick the first acceptable answer – reduces latency Apache Lucene EuroCon 20 May 2010
  • Tiered versus distributed  Both applicable to indexes and query loads exceeding single machine capabilities  Distributed sharded search:  increases latency for all queries (send + execute + integrate from all shards)  … plus replicas to increase QPS:  Increases hardware / management costs  While not improving latency  Tiered search:  Excellent latency for common queries  More complex to build and maintain  Arguably lower hardware cost for comparable scale / QPS Apache Lucene EuroCon 20 May 2010
  • Tiered search benefits  Majority of common queries handled by first tier: RAM-based, high QPS, low latency  Partially parallel mode reduces average latency for more complex queries  Hardware investment likely smaller than for distributed search setup of comparable QPS / latency Apache Lucene EuroCon 20 May 2010
  • Example Lucene API for tiered search Could be implemented as a Solr SearchComponent... Apache Lucene EuroCon 20 May 2010
  • Lucene implementation details Apache Lucene EuroCon 20 May 2010
  • References  Efficiency trade-offs in two-tier web search systems, Baeza- Yates et al., SIGIR'09  ResIn: A combination of results caching and index pruning for high-performance web search engines, Baeza-Yates et al, SIGIR'08  Three-level caching for efficient query processing in large Web search engines, Long & Suel, WWW'05 Apache Lucene EuroCon 20 May 2010
  • Bit-wise search  Given a bit pattern query: 1010 1001 0101 0001  Find documents with matching bit patterns in a field  Applications:  Permission checking  De-duplication  Plagiarism detection  Two variants: non-scoring (filtering) and scoring Apache Lucene EuroCon 20 May 2010
  • Non-scoring bitwise search (LUCENE-2460)  Builds a Filter from intersection of: 0 1 2 3 4 docID 0x01 0x02 0x03 0x04 0x05 flags  DocIdSet of documents matching a Query a b b a a type  Integer value and operation (AND, OR, XOR) “type:a”  “Value source” that caches integer values of a field (from FieldCache) 0x01 0x02 0x03 0x04 0x05 flags  Corresponding Solr field type and QParser: SOLR-1913 op=AND val=0x01  Useful for filtering (not scoring) Filter Apache Lucene EuroCon 20 May 2010
  • Scoring bitwise search (SOLR-1918)  BooleanQuery in disguise: docID D1 D2 D3 flags 1010 1011 0011 1010 = Y-1000 | N-0100 | Y1000 Y1000 N1000 Y-0010 | N-0001 bits N0100 N0100 N0100 Y0010 Y0010 Y0010  Solr 32-bit BitwiseField N0001 Y0001 Y0001  Analyzer creates the bitmasks field  Currently supports only single value per field Q = bits:Y1000 bits:N0100 bits:Y0010 bits:N0001  Creates BooleanQuery from query int value Results:  Useful when searching for best matching (ranked) bit patterns D1 matches 4 of 4 → #1 D2 matches 3 of 4 → #2 D3 matches 2 of 4 → #3 Apache Lucene EuroCon 20 May 2010
  • Summary  Index post-processing covers a range of useful scenarios:  Merging and splitting, remodeling, extracting, moving ...  Pruning less important data  Tiered search + pruned indexes:  High performance  Practically unchanged quality  Less hardware  Bitwise search:  Filtering by matching bits  Ranking by best matching patterns Apache Lucene EuroCon 20 May 2010
  • Meta-summary  Stir your imagination  Think outside the box  Show some unorthodox use and practical applications  Close ties to scalability, performance, distributed search and query latency Apache Lucene EuroCon 20 May 2010
  • Q&A Apache Lucene EuroCon 20 May 2010
  • Thank you! Apache Lucene EuroCon 05/25/10
  • Massive indexing with map-reduce  Map-reduce indexing models  Google model  Nutch model  Modified Nutch model  Hadoop contrib/indexing model  Tradeoff analysis and recommendations Apache Lucene EuroCon 20 May 2010
  • Google model  Map():  Reduce() IN: <seq, docText> IN: <term, list(<seq,pos>)>  terms = analyze(docText)  foreach(<seq,pos>)  foreach (term) docId = calculate(seq, taskId) emit(term, <seq,position>) Postings(term).append(docId, pos)  Pros: analysis on the map side  Cons:  Too many tiny intermediate records → Combiner  DocID synchronization across map and reduce tasks  Lucene: very difficult (impossible?) to create index this way Apache Lucene EuroCon 20 May 2010
  • Nutch model (also in SOLR-1301)  Map():  Reduce() IN: <seq, docPart> IN: <docId, list(docPart)>  docId = docPart.get(“url”)  doc = luceneDoc(list(docPart))  emit(docId, docPart)  indexWriter.addDocument(doc)  Pros: easy to build Lucene index  Cons:  Analysis on the reduce side  Many costly merge operations (large indexes built from scratch on reduce side) (plus currently needs copy from local FS to HDFS – see LUCENE-2373) Apache Lucene EuroCon 20 May 2010
  • Modified Nutch model (N/A...)  Map():  Reduce() IN: <seq, docPart> IN: <docId, list(<docPart,ts>)>  docId = docPart.get(“url”)  doc = luceneDoc(list(<docPart,ts>))  ts = analyze(docPart)  indexWriter.addDocument(doc)  emit(docId, <docPart,ts>)  Pros:  Analysis on map side  Easy to build Lucene index  Cons:  Many costly merge operations (large indexes built from scratch on reduce side) (plus currently needs copy from local FS to HDFS – see LUCENE-2373) Apache Lucene EuroCon 20 May 2010
  • Hadoop contrib/indexing model  Map():  Reduce() IN: <seq, docText> IN: <random, list(indexData)>  doc = luceneDoc(docText)  foreach(indexData)  indexWriter.addDocument(doc) indexWriter.addIndexes(indexData)  emit(random, indexData)  Pros:  analysis on the map side  Many merges on the map side  Supports also other operations (deletes, updates)  Cons:  Serialization is costly, records are big and require more RAM to sort Apache Lucene EuroCon 20 May 2010
  • Massive indexing - summary  If you first need to collect document parts → SOLR-1301 model  If you use complex analysis → Hadoop contrib/index  NOTE: there is no good integration yet of Solr and Hadoop contrib/index module... Apache Lucene EuroCon 20 May 2010