WHAT IS IN A LUCENE INDEX
Adrien Grand
@jpountz

Software engineer at Elasticsearch
About me
•
•

Lucene/Solr committer
Software engineer at Elasticsearch

•

I like changing the index file formats!
– store...
Why should I
learn about
Lucene internals?
Why should I learn about Lucene internals?
•

Know the cost of the APIs
– to build blazing fast search applications
– don’...
Indexing
•

Make data fast to search
– duplicate data if it helps
– decide on how to index based on the queries

•

Trade ...
Let’s create an index
•

Tree structure
– sorted for range queries
– O(log(n)) search

sql
index

data

term

Lucene

Luce...
Lucene doesn’t
work this way
Another index
•

Store terms and documents in arrays
– binary search

0

data

0,1

1

index

0,1

2

Lucene

0

3

term

...
Another index
•

Store terms and documents in arrays
– binary search

0

0,1

1

Segment

data
index

0,1

2

Lucene

0

3...
Insertions?
•
•

Insertion = write a new segment
Merge segments when there are too many of them
– concatenate docs, merge ...
Insertions?
•
•

Insertion = write a new segment
Merge segments when there are too many of them
– concatenate docs, merge ...
Deletions?
•
•
•

Deletion = turn a bit off
Ignore deleted documents when searching and merging (reclaims space)
Merge pol...
Pros/cons
•

•

•
•

•

Updates require writing a new segment
– single-doc updates are costly, bulk updates preferred
– wr...
Lucene can use
several indexes
Many databases can’t
Index intersection
1

red
shoe

2

4

6

7

9

1, 2, 10, 11, 20, 30, 50, 100
2, 20, 21, 22, 30, 40, 100
3

5

8

Lucene’s ...
What else?
•
•

We just covered search
Lucene does more
– term vectors
– norms
– numeric doc values
– binary doc values
– ...
Term vectors
•
•
•

Per-document inverted index
Useful for more-like-this
Sometimes used for highlighting
0

Lucene in act...
Numeric/binary doc values
•
•
•

Per doc and per field single numeric values, stored in a column-stride fashion
Useful for...
Sorted (set) doc values
•

Ordinal-enabled per-doc and per-field values
– sorted: single-valued, useful for sorting
– sort...
Faceting
•

Compute value counts for docs that match a query
– eg. category counts on an ecommerce website

•

Naive solut...
How can I use these APIs?
•

These are the low-level Lucene APIs, everything is built on top of these APIs:
searching, fac...
Wrap up
•

•

Data duplicated up to 4 times
– not a waste of space!
– easy to manage thanks to immutability
Stored fields ...
File formats
Important rules
•

Save file handles
– don’t use one file per field or per doc

•

Avoid disk seeks whenever possible
– di...
Codecs
•

File formats are codec-dependent

•

Default codec tries to get the best speed for little memory
– To trade memo...
Compression techniques
•

Bit packing / vInt encoding
– postings lists
– numeric doc values

•

LZ4
– code.google.com/p/lz...
What happens
when I run a
TermQuery?
1. Terms index
•

Lookup the term in the terms index
– In-memory FST storing terms prefixes
– Gives the offset to look at ...
2. Terms dictionary
•

•

Jump to the given offset in the terms dictionary
– compressed based on shared prefixes, similarl...
3. Postings lists
•
•

Jump to the given offset in the postings lists
Encoded using modified FOR (Frame of Reference) delt...
4. Stored fields
•

•

In-memory index for a subset of the doc ids
– memory-efficient thanks to monotonic compression
– se...
Query execution
•
•

2 disk seeks per field for search
1 disk seek per doc for stored fields

•

It is common that the ter...
Quizz
What is happening here?
qps

1
2

#docs in the index
What is happening here?
qps

1

Index grows larger than the filesystem
cache: stored fields not fully in the cache
anymore

...
What is happening here?
qps

1

Index grows larger than the filesystem
cache: stored fields not fully in the cache
anymore

...
Thank you!
What is in a Lucene index?
Upcoming SlideShare
Loading in...5
×

What is in a Lucene index?

9,673

Published on

Presented by Adrien Grand, Software Engineer, Elasticsearch

Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.

Published in: Technology
2 Comments
25 Likes
Statistics
Notes
  • @ghost1511 You can swap 5 and 6 and they're still the same. Look up details on 'merge sort'.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Can anybody tell me why, on the slide 16th, 5 is before 6? I don't get it.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
9,673
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
238
Comments
2
Likes
25
Embeds 0
No embeds

No notes for slide

What is in a Lucene index?

  1. 1. WHAT IS IN A LUCENE INDEX Adrien Grand @jpountz Software engineer at Elasticsearch
  2. 2. About me • • Lucene/Solr committer Software engineer at Elasticsearch • I like changing the index file formats! – stored fields – term vectors – doc values – ...
  3. 3. Why should I learn about Lucene internals?
  4. 4. Why should I learn about Lucene internals? • Know the cost of the APIs – to build blazing fast search applications – don’t commit all the time – when to use stored fields vs. doc values – maybe Lucene is not the right tool • Understand index size – oh, term vectors are 1/2 of the index size! – I removed 20% of my documents and index size hasn’t changed • This is a lot of fun!
  5. 5. Indexing • Make data fast to search – duplicate data if it helps – decide on how to index based on the queries • Trade update speed for search speed – Grep vs full-text indexing – Prefix queries vs edge n-grams – Phrase queries vs shingles • Indexing is fast – 220 GB/hour for 4K docs! – http://people.apache.org/~mikemccand/lucenebench/indexing.html
  6. 6. Let’s create an index • Tree structure – sorted for range queries – O(log(n)) search sql index data term Lucene Lucene in action Databases
  7. 7. Lucene doesn’t work this way
  8. 8. Another index • Store terms and documents in arrays – binary search 0 data 0,1 1 index 0,1 2 Lucene 0 3 term 0 4 sql 1 0 Lucene in action 1 Databases
  9. 9. Another index • Store terms and documents in arrays – binary search 0 0,1 1 Segment data index 0,1 2 Lucene 0 3 term 0 4 sql 1 term ordinal terms dict postings list 0 Lucene in action 1 Databases doc id document
  10. 10. Insertions? • • Insertion = write a new segment Merge segments when there are too many of them – concatenate docs, merge terms dicts and postings lists (merge sort!) 0 data 0 1 index 0 2 Lucene 0 term 0 0 data 0 1 index 0 2 sql 0 0 Databases 1 index 0,1 Lucene 0 term 0 4 Lucene in action 0,1 2 0 data 3 3 0 sql 1 0 Lucene in action 1 Databases
  11. 11. Insertions? • • Insertion = write a new segment Merge segments when there are too many of them – concatenate docs, merge terms dicts and postings lists (merge sort!) 0 data 0 1 index 0 2 Lucene 0 term 0 0 data 1 1 index 1 2 sql 1 1 Databases 1 index 0,1 Lucene 0 term 0 4 Lucene in action 0,1 2 0 data 3 3 0 sql 1 0 Lucene in action 1 Databases
  12. 12. Deletions? • • • Deletion = turn a bit off Ignore deleted documents when searching and merging (reclaims space) Merge policies favor segments with many deletions 0 data 0,1 1 index 0,1 2 Lucene 0 3 term 0 4 sql 1 0 Lucene in action 1 1 Databases 0 live docs: 1 = live, 0 = deleted
  13. 13. Pros/cons • • • • • Updates require writing a new segment – single-doc updates are costly, bulk updates preferred – writes are sequential Segments are never modified in place – filesystem-cache-friendly – lock-free! Terms are deduplicated – saves space for high-freq terms Docs are uniquely identified by an ord – useful for cross-API communication – Lucene can use several indexes in a single query Terms are uniquely identified by an ord – important for sorting: compare longs, not strings – important for faceting (more on this later)
  14. 14. Lucene can use several indexes Many databases can’t
  15. 15. Index intersection 1 red shoe 2 4 6 7 9 1, 2, 10, 11, 20, 30, 50, 100 2, 20, 21, 22, 30, 40, 100 3 5 8 Lucene’s postings lists support skipping that can be use to “leap-frog” Many databases just pick the most selective index and ignore the other ones
  16. 16. What else? • • We just covered search Lucene does more – term vectors – norms – numeric doc values – binary doc values – sorted doc values – sorted set doc values
  17. 17. Term vectors • • • Per-document inverted index Useful for more-like-this Sometimes used for highlighting 0 Lucene in action 0 data 0 0 data 0,1 1 index 0 1 index 0,1 2 Lucene 0 2 Lucene 0 3 term 0 3 term 0 0 data 0 4 sql 1 1 index 0 2 sql 0 1 Databases
  18. 18. Numeric/binary doc values • • • Per doc and per field single numeric values, stored in a column-stride fashion Useful for sorting and custom scoring Norms are numeric doc values field_a field_b 0 Lucene in action 42 afc 1 Databases 1 gce 2 Solr in action 3 ppy 3 Java 10 ccn
  19. 19. Sorted (set) doc values • Ordinal-enabled per-doc and per-field values – sorted: single-valued, useful for sorting – sorted set: multi-valued, useful for faceting 0 Lucene in action 1,2 0 distributed 1 Databases 0 1 Java 2 Solr in action 0,1,2 2 search 3 Java 1 Ordinals Terms dictionary for this dv field
  20. 20. Faceting • Compute value counts for docs that match a query – eg. category counts on an ecommerce website • Naive solution – hash table: value to count – O(#docs) ordinal lookups – O(#doc) value lookups • 2nd solution – hash table: ord to count – resolve values in the end – O(#docs) ordinal lookups – O(#values) value lookups Since ordinals are dense, this can be a simple array
  21. 21. How can I use these APIs? • These are the low-level Lucene APIs, everything is built on top of these APIs: searching, faceting, scoring, highlighting, etc. API Useful for Method Inverted index Term -> doc ids, positions, offsets AtomicReader.fields Stored fields Summaries of search results IndexReader.document Live docs Ignoring deleted docs AtomicReader.liveDocs Term vectors More like this IndexReader.termVectors Doc values / Norms Sorting/faceting/scoring AtomicReader.get*Values
  22. 22. Wrap up • • Data duplicated up to 4 times – not a waste of space! – easy to manage thanks to immutability Stored fields vs doc values – Optimized for different access patterns – get many field values for a few docs: stored fields – get a few field values for many docs: doc values Stored fields 0,A 0,B 0,C Doc values 0,A 1,A 2,A 0,B 1,B 2,B 0,B 1,B 2,B 1,A 1,B 1,C 2,A 2,B 2,C At most 1 seek per doc At most 1 seek per doc per field BUT more disk / file-system cache-friendly
  23. 23. File formats
  24. 24. Important rules • Save file handles – don’t use one file per field or per doc • Avoid disk seeks whenever possible – disk seek on spinning disk is ~10 ms • BUT don’t ignore the filesystem cache – random access in small files is fine • Light compression helps – less I/O – smaller indexes – filesystem-cache-friendly
  25. 25. Codecs • File formats are codec-dependent • Default codec tries to get the best speed for little memory – To trade memory for speed, don’t use RAMDirectory: – MemoryPostingsFormat, MemoryDocValuesFormat, etc. • Detailed file formats available in javadocs – http://lucene.apache.org/core/4_5_1/core/org/apache/lucene/codecs/packagesummary.html –
  26. 26. Compression techniques • Bit packing / vInt encoding – postings lists – numeric doc values • LZ4 – code.google.com/p/lz4 – lightweight compression algorithm – stored fields, term vectors • FSTs – conceptually a Map<String, ?> – keys share prefixes and suffixes – terms index
  27. 27. What happens when I run a TermQuery?
  28. 28. 1. Terms index • Lookup the term in the terms index – In-memory FST storing terms prefixes – Gives the offset to look at in the terms dictionary – Can fast-fail if no terms have this prefix r b/2 l/4 a/1 c u y/3 r br = 2 brac = 3 luc = 4 lyr = 7
  29. 29. 2. Terms dictionary • • Jump to the given offset in the terms dictionary – compressed based on shared prefixes, similarly to a burst trie – called the “BlockTree terms dict” read sequentially until the term is found – Jump here Not found Not found Found [prefix=luc] a, freq=1, offset=101 as, freq=1, offset=149 ene, freq=9, offset=205 ky, frea=7, offset=260 rative, freq=5, offset=323
  30. 30. 3. Postings lists • • Jump to the given offset in the postings lists Encoded using modified FOR (Frame of Reference) delta – 1. delta-encode – 2. split into block of N=128 values – 3. bit packing per block – 4. if remaining docs, encode with vInt Example with N=4 1,3,4,6,8,20,22,26,30,31 1,2,1,2,2,12,2,4,4,1 [1,2,1,2] [2,12,2,4] 4, 1 2 bits per value vInt-encoded 4 bits per value
  31. 31. 4. Stored fields • • In-memory index for a subset of the doc ids – memory-efficient thanks to monotonic compression – searched using binary search Stored fields – stored sequentially – compressed (LZ4) in 16+KB blocks docId=3 offset=127 docId=0 offset=42 0 1 16KB 2 docId=4 offset=199 3 16KB 4 5 16KB 6
  32. 32. Query execution • • 2 disk seeks per field for search 1 disk seek per doc for stored fields • It is common that the terms dict / postings lists fits into the file-system cache • “Pulse” optimization – For unique terms (freq=1), postings are inlined in the terms dict – Only 1 disk seek – Will always be used for your primary keys
  33. 33. Quizz
  34. 34. What is happening here? qps 1 2 #docs in the index
  35. 35. What is happening here? qps 1 Index grows larger than the filesystem cache: stored fields not fully in the cache anymore 2 #docs in the index
  36. 36. What is happening here? qps 1 Index grows larger than the filesystem cache: stored fields not fully in the cache anymore 2 Terms dict/Postings lists not fully in the cache #docs in the index
  37. 37. Thank you!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×