Lucene with Bloom filtered segments
Upcoming SlideShare
Loading in...5
×
 

Lucene with Bloom filtered segments

on

  • 1,518 views

A 2 x performance improvement to low-frequency term searches e.g. primary keys

A 2 x performance improvement to low-frequency term searches e.g. primary keys

Statistics

Views

Total Views
1,518
Views on SlideShare
1,510
Embed Views
8

Actions

Likes
0
Downloads
4
Comments
0

2 Embeds 8

http://www.linkedin.com 5
https://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Lucene with Bloom filtered segments Lucene with Bloom filtered segments Presentation Transcript

    • Lucene and Bloom-Filtered SegmentsPerformance improvements to be gained from“knowing what we don’t know” Mark Harwood
    • Benefits2 x speed up on primary-key lookupsSmall speed-up on general text searches (1.06 x )Optimised memory overheadMinimal impact on indexing speedsMinimal extra disk space
    • Approach One appropriately sized Bitset is held per segment, per Bloom-filtered field e.g. 4 segments x 2 filtered fields = 8 bitsetsURL 000010001000000101000001 URL 000010001000001 URL 000000001 URL 000000001PKey 0010000001001001000001 PKey 001000001 PKey 001000001 PKey 001000000100001 Segment 3 Segment 4 Segment 2 Segment 1
    • Fail-fast searches: modified TermInfosReader int hash=searchTerm.hashcode(); int bitIndex=hash%bitsetSize; if(!bitset.contains(bitIndex)) return false; //term might be in index – continue as normal search URL 00001000100000010100000 1 PKey 0010000001001001000001An unset bitguarantees the termis missing from thesegment and asearch can be Is most effective on fieldsavoided. Segment 1 with many low doc-frequency terms or scenarios where query terms often don’t exist in the index.
    • Memory efficiency Bitset sizes are automatically tuned according to: 1. the volume of terms in the segment 2. desired saturation settings (more sparse=more accurate)URL 000010001000000101000001 URL 000010001000001 URL 000000001 URL 000010000PKey 0010000001001001000001 PKey 001000000100001 PKey 001000001 PKey 001000001 Segment 3 Segment 4 Segment 2 Segment 1
    • Indexing: a modified TermInfosWriter Term writes are gathered in a large bitset00000000000000000000000000000000001000000000000000000000000000010000000 The final flush operation consolidates information in the big bitset into a suitably compact bitset for storage on disk based on how many set bits were accumulated. This re-mapping saves disk space and the RAM 000000000000001000000000010000000 required when servicing queries
    • NotesSee JIRA LUCENE-4069 for patch to Lucene 3.6Core modifications pass existing 3.6 Junit tests (but withoutexercising any Bloom filtering logic).Benchmarks contrasting Bloom-filtered indexes with non-filtered arehere: http://goo.gl/X7QqUTODOs Currently relies on a field naming convention to introduce a Bloom filter to the index (use “_blm” on end of indexed field name when writing) How to properly declare need for Bloom filter?Changes to IndexWriterConfig? A new Fieldable/FieldInfo setting? Dare I invoke the “schema” word? Where to expose tuning settings e.g. saturation preferences? Can we give some up-front hints to TermInfosWriter about segment size being written so initial choice of BitSet size can be reduced? Formal Junit tests required to exercise Bloom-filtered indexes – no false negatives. Can this be covered as part of existing random testing frameworks which exercise various index config options?