Updates from Lucene land
Robert Muir
{ } BEER-WARE r42
OUTLINE
2
• Introduction
• Query Execution
• Bitset Compression
• Index Compression
• Indexing Performance
• Index Safety
• Other Changes
!
!
{ } BEER-WARE r42 3
Introduction to Lucene
{ } BEER-WARE r42
INTRODUCTION
4
LUCENE
• Search engine library in Java
• Produced by the Apache Software Foundation
• 1999-Present
!
!
{ } BEER-WARE r42
FULL-TEXT SEARCH
5
{ } BEER-WARE r42
INVERTED INDEX
6
{ } BEER-WARE r42
RELEASE TIMELINE
7
{ } BEER-WARE r42 8
Query Execution
{ } BEER-WARE r42
TWO PHASE INTERSECTION
9
Dense
Medium
Sparse 128.24 QPS
67.53 QPS
19.98 QPS
93.39 QPS
58.27 QPS
21.19 QPS
WIKIPEDIA ENGLISH: FILTERED PHRASE QUERY
Before
After
{ } BEER-WARE r42
FASTER PROHIBITED CLAUSES
10
Dense
Medium
Sparse 922.24 QPS
183.47 QPS
50.06 QPS
62.74 QPS
57.14 QPS
33.71 QPS
WIKIPEDIA ENGLISH: MUST_NOT
Before
After
{ } BEER-WARE r42
OPTIMIZE QUERY FOR FILTER CLAUSES
11
Dense
Medium
Sparse 1,144.49 QPS
205.45 QPS
49.25 QPS
959.96 QPS
185.01 QPS
49.49 QPS
WIKIPEDIA ENGLISH: MUST_NOT
Before
After
{ } BEER-WARE r42
OPTIMIZE QUERY FOR FILTER CLAUSES
12
Dense
Medium
Sparse 59.38 QPS
26.37 QPS
5.14 QPS
45.19 QPS
21.3 QPS
5.14 QPS
WIKIPEDIA ENGLISH: FILTERED SLOPPY PHRASE
Before
After
{ } BEER-WARE r42
QUERY EXECUTION
13
• Merge Query and Filter
• Automatic caching
• Cost-based execution
• Two-phase intersection
!
!
!
!
!
{ } BEER-WARE r42 14
Bitset Compression
{ } BEER-WARE r42
COMPRESSED BITSETS
15
Fixed
Sparse
Roaring
0% 20% 40% 60% 80% 100%
2%
12%
100%
MEMORY USAGE (0.1%)
{ } BEER-WARE r42
COMPRESSED BITSETS
16
Fixed
Sparse
Roaring
0x 1x 2x 3x 4x
3.9x
2x
1x
ITERATION SPEED (0.1%)
{ } BEER-WARE r42
COMPRESSED BITSETS
17
• Cached Filters
• Range, Prefix, Wildcard query execution
• Nested Documents (join)
• Scoring Factors (norms)
!
!
{ } BEER-WARE r42 18
Index Compression
{ } BEER-WARE r42
INDEX COMPRESSION
19
RAW DATA
BEST SPEED
BEST SIZE
0 MB 4,000 MB 8,000 MB 12,000 MB 16,000 MB
2,322 MB
4,691 MB
14,641 MB
FIELDS STORAGE (_source) APACHE LOGS
{ } BEER-WARE r42
INDEX COMPRESSION
20
Lucene 4.8
Lucene 4.10
Lucene 5
0 MB 40 MB 80 MB 120 MB 160 MB
41 MB
89 MB
160 MB
28 MB
42 MB
160 MB
RAM USAGE (all lucene features) geonames.org
Clean
Dirty
{ } BEER-WARE r42
INDEX COMPRESSION
21
• “best space” option (archive/cold storage)
• optimized merge
• sparse normalization factors, docvalues
• patched compression for outliers, exceptions
!
{ } BEER-WARE r42 22
Indexing Performance
{ } BEER-WARE r42
INDEXING PERFORMANCE
23
Lucene 4.10
Lucene 5 18.7
12.1
K DOCS/SEC (Apache logs)
{ } BEER-WARE r42
INDEXING PERFORMANCE
24
• Adaptive merge throttling
• Reduced cpu usage (stored fields data)
• Reduced memory usage
• SSD auto-detection in merge scheduler
!
!
!
{ } BEER-WARE r42 25
Index Safety
{ } BEER-WARE r42
INDEX SAFETY
26
• segment and commit identifiers
• atomic commits
• verify integrity at merge
• test filesystems
• faster checkindex
• improved error messages
!
!
!
{ } BEER-WARE r42 27
Other Changes
{ } BEER-WARE r42
OTHER CHANGES
28
• Verbose memory reporting
• Improved parallel execution
• Result diversification support
• Faster index sorting
• …
!
!
{ }
Thank you
twitter.com/rcmuir
{ }
/*
* ---------------------------------------------------------------
* "THE BEER-WARE LICENSE" (Revision 42):
* <rmuir@apache.org> wrote this file. As long as you retain this notice you
* can do whatever you want with this stuff. If we meet some day, and you
* think this stuff is worth it, you can buy me a beer in return. Robert Muir
* ---------------------------------------------------------------
*/
BEER-WARE r42

updates from lucene lands 2015

  • 1.
    Updates from Luceneland Robert Muir
  • 2.
    { } BEER-WAREr42 OUTLINE 2 • Introduction • Query Execution • Bitset Compression • Index Compression • Indexing Performance • Index Safety • Other Changes ! !
  • 3.
    { } BEER-WAREr42 3 Introduction to Lucene
  • 4.
    { } BEER-WAREr42 INTRODUCTION 4 LUCENE • Search engine library in Java • Produced by the Apache Software Foundation • 1999-Present ! !
  • 5.
    { } BEER-WAREr42 FULL-TEXT SEARCH 5
  • 6.
    { } BEER-WAREr42 INVERTED INDEX 6
  • 7.
    { } BEER-WAREr42 RELEASE TIMELINE 7
  • 8.
    { } BEER-WAREr42 8 Query Execution
  • 9.
    { } BEER-WAREr42 TWO PHASE INTERSECTION 9 Dense Medium Sparse 128.24 QPS 67.53 QPS 19.98 QPS 93.39 QPS 58.27 QPS 21.19 QPS WIKIPEDIA ENGLISH: FILTERED PHRASE QUERY Before After
  • 10.
    { } BEER-WAREr42 FASTER PROHIBITED CLAUSES 10 Dense Medium Sparse 922.24 QPS 183.47 QPS 50.06 QPS 62.74 QPS 57.14 QPS 33.71 QPS WIKIPEDIA ENGLISH: MUST_NOT Before After
  • 11.
    { } BEER-WAREr42 OPTIMIZE QUERY FOR FILTER CLAUSES 11 Dense Medium Sparse 1,144.49 QPS 205.45 QPS 49.25 QPS 959.96 QPS 185.01 QPS 49.49 QPS WIKIPEDIA ENGLISH: MUST_NOT Before After
  • 12.
    { } BEER-WAREr42 OPTIMIZE QUERY FOR FILTER CLAUSES 12 Dense Medium Sparse 59.38 QPS 26.37 QPS 5.14 QPS 45.19 QPS 21.3 QPS 5.14 QPS WIKIPEDIA ENGLISH: FILTERED SLOPPY PHRASE Before After
  • 13.
    { } BEER-WAREr42 QUERY EXECUTION 13 • Merge Query and Filter • Automatic caching • Cost-based execution • Two-phase intersection ! ! ! ! !
  • 14.
    { } BEER-WAREr42 14 Bitset Compression
  • 15.
    { } BEER-WAREr42 COMPRESSED BITSETS 15 Fixed Sparse Roaring 0% 20% 40% 60% 80% 100% 2% 12% 100% MEMORY USAGE (0.1%)
  • 16.
    { } BEER-WAREr42 COMPRESSED BITSETS 16 Fixed Sparse Roaring 0x 1x 2x 3x 4x 3.9x 2x 1x ITERATION SPEED (0.1%)
  • 17.
    { } BEER-WAREr42 COMPRESSED BITSETS 17 • Cached Filters • Range, Prefix, Wildcard query execution • Nested Documents (join) • Scoring Factors (norms) ! !
  • 18.
    { } BEER-WAREr42 18 Index Compression
  • 19.
    { } BEER-WAREr42 INDEX COMPRESSION 19 RAW DATA BEST SPEED BEST SIZE 0 MB 4,000 MB 8,000 MB 12,000 MB 16,000 MB 2,322 MB 4,691 MB 14,641 MB FIELDS STORAGE (_source) APACHE LOGS
  • 20.
    { } BEER-WAREr42 INDEX COMPRESSION 20 Lucene 4.8 Lucene 4.10 Lucene 5 0 MB 40 MB 80 MB 120 MB 160 MB 41 MB 89 MB 160 MB 28 MB 42 MB 160 MB RAM USAGE (all lucene features) geonames.org Clean Dirty
  • 21.
    { } BEER-WAREr42 INDEX COMPRESSION 21 • “best space” option (archive/cold storage) • optimized merge • sparse normalization factors, docvalues • patched compression for outliers, exceptions !
  • 22.
    { } BEER-WAREr42 22 Indexing Performance
  • 23.
    { } BEER-WAREr42 INDEXING PERFORMANCE 23 Lucene 4.10 Lucene 5 18.7 12.1 K DOCS/SEC (Apache logs)
  • 24.
    { } BEER-WAREr42 INDEXING PERFORMANCE 24 • Adaptive merge throttling • Reduced cpu usage (stored fields data) • Reduced memory usage • SSD auto-detection in merge scheduler ! ! !
  • 25.
    { } BEER-WAREr42 25 Index Safety
  • 26.
    { } BEER-WAREr42 INDEX SAFETY 26 • segment and commit identifiers • atomic commits • verify integrity at merge • test filesystems • faster checkindex • improved error messages ! ! !
  • 27.
    { } BEER-WAREr42 27 Other Changes
  • 28.
    { } BEER-WAREr42 OTHER CHANGES 28 • Verbose memory reporting • Improved parallel execution • Result diversification support • Faster index sorting • … ! !
  • 29.
  • 30.
    { } /* * --------------------------------------------------------------- *"THE BEER-WARE LICENSE" (Revision 42): * <rmuir@apache.org> wrote this file. As long as you retain this notice you * can do whatever you want with this stuff. If we meet some day, and you * think this stuff is worth it, you can buy me a beer in return. Robert Muir * --------------------------------------------------------------- */ BEER-WARE r42