Lucene with Bloom filtered segments

2,132 views
1,884 views

Published on

A 2 x performance improvement to low-frequency term searches e.g. primary keys

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,132
On SlideShare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lucene with Bloom filtered segments

  1. 1. Lucene and Bloom-Filtered SegmentsPerformance improvements to be gained from“knowing what we don’t know” Mark Harwood
  2. 2. Benefits2 x speed up on primary-key lookupsSmall speed-up on general text searches (1.06 x )Optimised memory overheadMinimal impact on indexing speedsMinimal extra disk space
  3. 3. Approach One appropriately sized Bitset is held per segment, per Bloom-filtered field e.g. 4 segments x 2 filtered fields = 8 bitsetsURL 000010001000000101000001 URL 000010001000001 URL 000000001 URL 000000001PKey 0010000001001001000001 PKey 001000001 PKey 001000001 PKey 001000000100001 Segment 3 Segment 4 Segment 2 Segment 1
  4. 4. Fail-fast searches: modified TermInfosReader int hash=searchTerm.hashcode(); int bitIndex=hash%bitsetSize; if(!bitset.contains(bitIndex)) return false; //term might be in index – continue as normal search URL 00001000100000010100000 1 PKey 0010000001001001000001An unset bitguarantees the termis missing from thesegment and asearch can be Is most effective on fieldsavoided. Segment 1 with many low doc-frequency terms or scenarios where query terms often don’t exist in the index.
  5. 5. Memory efficiency Bitset sizes are automatically tuned according to: 1. the volume of terms in the segment 2. desired saturation settings (more sparse=more accurate)URL 000010001000000101000001 URL 000010001000001 URL 000000001 URL 000010000PKey 0010000001001001000001 PKey 001000000100001 PKey 001000001 PKey 001000001 Segment 3 Segment 4 Segment 2 Segment 1
  6. 6. Indexing: a modified TermInfosWriter Term writes are gathered in a large bitset00000000000000000000000000000000001000000000000000000000000000010000000 The final flush operation consolidates information in the big bitset into a suitably compact bitset for storage on disk based on how many set bits were accumulated. This re-mapping saves disk space and the RAM 000000000000001000000000010000000 required when servicing queries
  7. 7. NotesSee JIRA LUCENE-4069 for patch to Lucene 3.6Core modifications pass existing 3.6 Junit tests (but withoutexercising any Bloom filtering logic).Benchmarks contrasting Bloom-filtered indexes with non-filtered arehere: http://goo.gl/X7QqUTODOs Currently relies on a field naming convention to introduce a Bloom filter to the index (use “_blm” on end of indexed field name when writing) How to properly declare need for Bloom filter?Changes to IndexWriterConfig? A new Fieldable/FieldInfo setting? Dare I invoke the “schema” word? Where to expose tuning settings e.g. saturation preferences? Can we give some up-front hints to TermInfosWriter about segment size being written so initial choice of BitSet size can be reduced? Formal Junit tests required to exercise Bloom-filtered indexes – no false negatives. Can this be covered as part of existing random testing frameworks which exercise various index config options?

×