Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Elasticsearch
Under the hood
August 2018
Disclaimer
● Elasticsearch 6.3
● Lucene 7.3
What is it? What is it good for?
● Full text search
● Scalable and robust full text search
What is Lucene? What it has to do with Elasticsearch?
● Full text search library
● Elasticsearch is just a wrapper around ...
API level
Index document
POST doc/default
{
"class": "ConfigurationParser",
"method": "build",
"description": "<p class=’p...
API level
Update document
POST doc/default/1/_update
{
"doc": { "description": "New description value" }
}
API level
Delete document
DELETE doc/default/1
API level
Search
GET doc/default/_search
{
"query": {
"match": {
"title": {
"query": "ObjectGenerator"
}
}
}
}
Cluster level
How does that look like?
Transaction
Log
LuceneREST
ES node
UUID name
ES
ES Cluster
“Elasticsearch”
Transaction
Log
Lucene...
All nodes are of the same type?
● Master node
● Data node
● Ingest node
● Tribe node
● Coordinator node
● Default case
Shards
Index
Doc1
Doc2
Doc3
Doc4
Doc5
Doc6
Doc7
Doc8
Doc9
Doc10
Doc11
Doc12
Node
Node
Node
Node
Node
Node
Index - S1
Doc1
...
Replicas
Node
Node
Index - S1 R1
Doc1
Doc4
Doc8
Node
Node
Node
Index - S1 R2
Doc1
Doc4
Doc8
Index - S1 R3
Doc1
Doc4
Doc8
I...
Scaling
● Single node cluster with 3 shards and 1 replica
● Unassigned shards problem
Node 1
S1
P
S2
P
S3
P
Single node cl...
Scaling out
● Two node cluster with 3 shards and 1 replica
● To prevent unassigned shards: number of nodes > number of rep...
Scaling out
● Three node cluster with 3 shards and 1 replica
● Load spread across all nodes
Node 1
S1
P
S2
P
Three node cl...
Scaling out
● Seven node cluster with 3 shards and 1 replica, one node unused
● Increase number of shards (not possible)
●...
Scaling in
● Make sure you have enough nodes to support replication factor when node is killed
● Wait for green status
● I...
Replacing the node
● Same as scale out / scale in
How does the write look like?
Node 2
S2 P
S3 R
Node 4
S1 R
S4 P
Node 3
S1 ISR
S2 ISR
Node 5
S1 ISR
S4 ISR
Node 6
S1 P
S3 I...
How does the read look like?
Node 2
S2 P
S3 R
Node 4
S1 R
S4 P
Node 3
S1 ISR
S2 ISR
Node 5
S1 ISR
S4 ISR
Node 6
S1 P
S3 IS...
What if something goes wrong?
● Network partition (one master)
● Network partition (two masters)
● Network partition (thre...
What if something goes wrong?
Network partition (one master)
● discovery.zen.no_master_block
Shard 2 P
Shard 2 R
Shard 1 P...
What if something goes wrong?
Network partition (two masters)
● discovery.zen.minimum_master_nodes
Shard 2 P
Master
Shard ...
What if something goes wrong?
Network partition (three masters)
● discovery.zen.minimum_master_nodes
Shard 2 P
Master
Shar...
What if something goes wrong?
Primary shard node failure (in sync replicas available)
Shard 2 P
Shard 1 ISR
Shard 1 P
Shar...
What if something goes wrong?
Primary shard node failure (no in sync replicas available)
Shard 2 P
Shard 1 R
Shard 1 P
Sha...
What if something goes wrong?
Write replication (replica write failure)
Coordinator
node
Shard 1 ISR
Shard 1 P
Shard 1 RSh...
What if something goes wrong?
Node failure (read)
Coordinator
node
Shard 1 P
Shard 1 R
Shard 2 PShard 2 R
Master
ES Cluster
Lucene level
How is write processed?
Elasticsearch
Lucene
Character filters Tokenizer Token filters File storage
How is write processed?
● Mapping character filter
● HTML strip character filter
● Pattern replace character filter
Charac...
How is write processed?
● Standard tokenizer
● Keyword tokenizer
● Letter tokenizer
● Lowercase tokenizer
● N gram tokeniz...
How is write processed?
● Lower case filter
● English possessive filter
● Stop filter
● Synonym filter
● Reversed wildcard...
Inverted index
File storage - Logical view
● Difference between forward and inverted index
● Doc1: “Peter has a brown dog ...
Inverted index
File storage - Logical view (deleting documents)
Inverted index
0 black 1
1 brown 0, 2
2 cat 0, 2
3 dog 0, ...
Inverted index
File storage - Logical view (merge segments)
Inverted index
0 black 1
1 brown 0, 2
2 cat 0, 2
3 dog 0, 1
4 ...
Elasticsearch
How does the write look like?
Logical view (write path, refresh & commit)
Lucene
write
Memory Disk
Seg
1
Seg...
How does the read look like?
Lucene
Memory
read
merge
Seg1
Disk
Seg
1
Seg
2
Commit point
Seg
3
Seg
4
Seg2
Seg3
Seg4
respon...
Compaction
Lucene
Disk
Seg1 Seg2 Seg3 Seg4
Compact
Seg5
Commit point
Lucene low level
Lucene codecs
● Abstraction over data format within files
● Keeps low level details away from Lucene
● File formats are co...
File formats
● Each index in separate UUD-named dir (Elasticsearch’s doing, prevent index corruption when recreating)
● Se...
File formats
● Segment info (.si) - lucene ver, num of docs, os, os ver, java ver, files included
● Term index (.tip) - In...
How does the read look like?
Term query
Term = tomato
How does the read look like?
File view - Term Index
● FST (Finite State Transducer)
● Stores term prefixes
● Map String ->...
How does the read look like?
File view - Term Dictionary
● Jump to the given block (offset)
● Number of items within the b...
How does the read look like?
File view - Postings lists
● Jump to the given offset in the postings list
● Encoded using mo...
How does the read look like?
File view - Field Index & Field Data
● Stored sequentially
● Compressed using LZ4 in 16+KB bl...
References
● https://www.elastic.co/guide/en/elasticsearch/reference/6.3/index.html
● https://lucene.apache.org/core/7_3_1...
Thank you
Elasticsearch - under the hood
Upcoming SlideShare
Loading in …5
×

Elasticsearch - under the hood

Elasticsearch is quite common tool nowadays. Usually as a part of ELK stack, but in some cases to support main feature of the system as search engine. Documentation on regular use cases and on usage in general is pretty good, but how it really works, how it behaves beneath the surface of the API? This talk is about that, we will look under the hood of Elasticsearch and dive deep in the largely unknown implementation details. Talk covers cluster behaviour, communication with Lucene and Lucene internals to literally bits and pieces. Come and see Elasticsearch dissected.

  • Be the first to comment

Elasticsearch - under the hood

  1. 1. Elasticsearch Under the hood August 2018
  2. 2. Disclaimer ● Elasticsearch 6.3 ● Lucene 7.3
  3. 3. What is it? What is it good for? ● Full text search ● Scalable and robust full text search
  4. 4. What is Lucene? What it has to do with Elasticsearch? ● Full text search library ● Elasticsearch is just a wrapper around it which provides scalability, durability and REST API
  5. 5. API level Index document POST doc/default { "class": "ConfigurationParser", "method": "build", "description": "<p class=’paragraph’>Creates an instance of <link>ObjectGenerator</link> based on provided configuration. Resulting <link>ObjectGenerator</link> will try to convert configured output to specified <code>objectType</code>.</p>" }
  6. 6. API level Update document POST doc/default/1/_update { "doc": { "description": "New description value" } }
  7. 7. API level Delete document DELETE doc/default/1
  8. 8. API level Search GET doc/default/_search { "query": { "match": { "title": { "query": "ObjectGenerator" } } } }
  9. 9. Cluster level
  10. 10. How does that look like? Transaction Log LuceneREST ES node UUID name ES ES Cluster “Elasticsearch” Transaction Log LuceneREST ES node UUID name ES Transaction Log LuceneREST ES node UUID name ES Transaction Log LuceneREST ES node UUID name ES
  11. 11. All nodes are of the same type? ● Master node ● Data node ● Ingest node ● Tribe node ● Coordinator node ● Default case
  12. 12. Shards Index Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7 Doc8 Doc9 Doc10 Doc11 Doc12 Node Node Node Node Node Node Index - S1 Doc1 Doc4 Doc8 Index - S5 Doc9 Doc12 Index - S4 Doc7 Doc10 Index - S2 Doc3 Doc5 Doc6 Index - S3 Doc2 Doc11
  13. 13. Replicas Node Node Index - S1 R1 Doc1 Doc4 Doc8 Node Node Node Index - S1 R2 Doc1 Doc4 Doc8 Index - S1 R3 Doc1 Doc4 Doc8 Index - S1 R4 Doc1 Doc4 Doc8 Index - S1 P Doc1 Doc4 Doc8
  14. 14. Scaling ● Single node cluster with 3 shards and 1 replica ● Unassigned shards problem Node 1 S1 P S2 P S3 P Single node cluster State:
  15. 15. Scaling out ● Two node cluster with 3 shards and 1 replica ● To prevent unassigned shards: number of nodes > number of replicas + 1 Node 1 S1 P S2 P S3 P Two node cluster State: Node 2 S1 R S2 R S3 R
  16. 16. Scaling out ● Three node cluster with 3 shards and 1 replica ● Load spread across all nodes Node 1 S1 P S2 P Three node cluster State: Node 2 S2 R Node 3 S1 R S3 P S3 R
  17. 17. Scaling out ● Seven node cluster with 3 shards and 1 replica, one node unused ● Increase number of shards (not possible) ● Increase replication factor (possible on running cluster) Node 1 S1 P Seven node cluster State: Node 2 S2 P Node 3 S3 P Node 4 S1 R Node 5 S2 R Node 6 S3 R Node 7
  18. 18. Scaling in ● Make sure you have enough nodes to support replication factor when node is killed ● Wait for green status ● If necessary, lower the replication factor
  19. 19. Replacing the node ● Same as scale out / scale in
  20. 20. How does the write look like? Node 2 S2 P S3 R Node 4 S1 R S4 P Node 3 S1 ISR S2 ISR Node 5 S1 ISR S4 ISR Node 6 S1 P S3 ISR S4 ISR Node 1 S2 ISR S3 P ES Cluster Generate doc ID if not present Hash ID to determine replication group (routing param) Coordinator node
  21. 21. How does the read look like? Node 2 S2 P S3 R Node 4 S1 R S4 P Node 3 S1 ISR S2 ISR Node 5 S1 ISR S4 ISR Node 6 S1 P S3 ISR S4 ISR Node 1 S2 ISR S3 P ES Cluster Coordinator node Resolve search request to relevant shards Combine the results
  22. 22. What if something goes wrong? ● Network partition (one master) ● Network partition (two masters) ● Network partition (three masters) ● Primary shard node failure (in sync replicas available) ● Primary shard node failure (no in sync replicas available) ● Write replication (replica write failure) ● Node failure (read)
  23. 23. What if something goes wrong? Network partition (one master) ● discovery.zen.no_master_block Shard 2 P Shard 2 R Shard 1 P Shard 2 RShard 1 R Master ES Cluster
  24. 24. What if something goes wrong? Network partition (two masters) ● discovery.zen.minimum_master_nodes Shard 2 P Master Shard 1 P Shard 2 RShard 1 R Master ES Cluster
  25. 25. What if something goes wrong? Network partition (three masters) ● discovery.zen.minimum_master_nodes Shard 2 P Master Shard 1 P MasterShard 1 R Master ES Cluster
  26. 26. What if something goes wrong? Primary shard node failure (in sync replicas available) Shard 2 P Shard 1 ISR Shard 1 P Shard 2 ISRShard 1 R Master ES Cluster Shard 1 P (new)
  27. 27. What if something goes wrong? Primary shard node failure (no in sync replicas available) Shard 2 P Shard 1 R Shard 1 P Shard 2 ISRShard 1 R Master ES Cluster allocate_stale_primary cmd Shard 1 ISR
  28. 28. What if something goes wrong? Write replication (replica write failure) Coordinator node Shard 1 ISR Shard 1 P Shard 1 RShard 1 ISR Master ES Cluster Shard 1 R
  29. 29. What if something goes wrong? Node failure (read) Coordinator node Shard 1 P Shard 1 R Shard 2 PShard 2 R Master ES Cluster
  30. 30. Lucene level
  31. 31. How is write processed? Elasticsearch Lucene Character filters Tokenizer Token filters File storage
  32. 32. How is write processed? ● Mapping character filter ● HTML strip character filter ● Pattern replace character filter Character fiters <p class=’paragraph’>Creates an instance of <link>ObjectGenerator</link> based on provided configuration. Resulting <link>ObjectGenerator</link> will try to convert configured output to specified <code>objectType</code>.</p> Creates an instance of ObjectGenerator based on provided configuration. Resulting ObjectGenerator will try to convert configured output to specified objectType.
  33. 33. How is write processed? ● Standard tokenizer ● Keyword tokenizer ● Letter tokenizer ● Lowercase tokenizer ● N gram tokenizer ● Edge n gram tokenizer ● Regular expression pattern tokenizer Tokenizer Creates an instance of ObjectGenerator based on provided configuration. Resulting ObjectGenerator will try to convert configured output to specified objectType. [Creates, an, instance, of, ObjectGenerator, based, on, provided, configuration, Resulting, ObjectGenerator, will, try, to, convert, configured, output, to, specified, objectType]
  34. 34. How is write processed? ● Lower case filter ● English possessive filter ● Stop filter ● Synonym filter ● Reversed wildcard filter ● English minimal stem filter Token filters [Creates, an, instance, of, ObjectGenerator, based, on, provided, configuration, Resulting, ObjectGenerator, will, try, to, convert, configured, output, to, specified, objectType] [creates, instance, ObjectGenerator, based, provided, configuration, resulting, try, convert, configured, output, specified, objectType]
  35. 35. Inverted index File storage - Logical view ● Difference between forward and inverted index ● Doc1: “Peter has a brown dog and a white cat” ● Doc2: “Mike has a black dog” ● Doc3: “Rachel has a brown cat” Forward index Doc1 brown, cat, dog, peter, white Doc2 black, dog, mike Doc3 brown, cat, rachel Inverted index 0 black 1 1 brown 0, 2 2 cat 0, 2 3 dog 0, 1 4 mike 1 5 peter 0 6 rachel 2 7 white 0 Documents 0 Doc1 (Peter has a brown dog and a white cat.) 1 Doc2 (Mike has a black dog.) 2 Doc3 (Rachel has a brown cat.) term ordinal terms dict postings list doc id document Segment
  36. 36. Inverted index File storage - Logical view (deleting documents) Inverted index 0 black 1 1 brown 0, 2 2 cat 0, 2 3 dog 0, 1 4 mike 1 5 peter 0 6 rachel 2 7 white 0 Documents 0 Doc1 (Peter has a brown dog and a white cat.) 1 Doc2 (Mike has a black dog.) 2 Doc3 (Rachel has a brown cat.) term ordinal terms dict postings list doc id document Segment Live documents 0 1 2
  37. 37. Inverted index File storage - Logical view (merge segments) Inverted index 0 black 1 1 brown 0, 2 2 cat 0, 2 3 dog 0, 1 4 mike 1 5 peter 0 6 rachel 2 7 white 0 Documents 0 Doc1 1 Doc2 2 Doc3 Segment 1 Inverted index 0 balloon 3 1 boy 3, 4 2 brown 4 3 cat 4 4 little 3, 4 5 red 3 Documents 3 Doc4 4 Doc5 Segment 2 Inverted index 0 balloon 3 1 black 1 2 boy 3, 4 3 brown 2, 4 4 cat 2, 4 5 little 3, 4 6 mike 1 7 rachel 2 8 red 3 Documents 1 Doc2 2 Doc3 3 Doc4 4 Doc5 Merged segment Live documents 1 2 3 4 Live documents 0 1 2 Live documents 1 2 3 4
  38. 38. Elasticsearch How does the write look like? Logical view (write path, refresh & commit) Lucene write Memory Disk Seg 1 Seg 2 Seg 3 Seg 4 write In-mem buffer flush Transaction log Commit point Seg 3 Seg 5 Seg 4
  39. 39. How does the read look like? Lucene Memory read merge Seg1 Disk Seg 1 Seg 2 Commit point Seg 3 Seg 4 Seg2 Seg3 Seg4 response Logical view (read path)
  40. 40. Compaction Lucene Disk Seg1 Seg2 Seg3 Seg4 Compact Seg5 Commit point
  41. 41. Lucene low level
  42. 42. Lucene codecs ● Abstraction over data format within files ● Keeps low level details away from Lucene ● File formats are codec-dependent
  43. 43. File formats ● Each index in separate UUD-named dir (Elasticsearch’s doing, prevent index corruption when recreating) ● Segments file (segments_1, segments_2, …) ● Lock file (write.lock) Per-index files
  44. 44. File formats ● Segment info (.si) - lucene ver, num of docs, os, os ver, java ver, files included ● Term index (.tip) - Index into the Term Dictionary ● Term Dictionary (.tim) - Stores term info ● Postings (.pos) - Stores information where document is located within stored fields ● Field index (.fdx) - Index into Field Data ● Field Data (.fdt) - Stored fields for documents (real values) ● ... Per-segment file
  45. 45. How does the read look like? Term query Term = tomato
  46. 46. How does the read look like? File view - Term Index ● FST (Finite State Transducer) ● Stores term prefixes ● Map String -> Object t / 1 b / 7 o / 1 r / 2 h / 3 e / 6 to = 2 thr = 6 br = 9 the = 10 be = 13
  47. 47. How does the read look like? File view - Term Dictionary ● Jump to the given block (offset) ● Number of items within the block (25 - 48) ... ... ... [Prefix = to] Suffix Frequency Offset ad 1 102 ast 1 135 aster 2 167 day 7 211 e 2 233 ilette 3 251 mato 8 287 nic 5 309 oth 3 355 Jump here Not found Not found Not found Not found Not found Not found Found
  48. 48. How does the read look like? File view - Postings lists ● Jump to the given offset in the postings list ● Encoded using modified FOR (Frame of Reference) delta ○ delta-encode ○ split into blocks of N = 128 values ○ bit packing per block ○ if remaining, encode with vInt Example with N = 4 Offset Document ids ... ... 287 1, 3, 4, 6, 8, 20, 22, 26, 30, 31 ... ... Delta encode: 1, 2, 1, 2, 2, 12, 2, 4, 4, 1 Split to blocks: [1, 2, 1, 2] [2, 12, 2, 4] 4, 1 2 bits per value total: 1 byte 4 bits per value total: 2 bytes vInt encoded 1 byte per value total: 2 bytes Uncompressed: 40 (10 * 4) bytes Compressed: 5 (1 + 2 + 2) bytes
  49. 49. How does the read look like? File view - Field Index & Field Data ● Stored sequentially ● Compressed using LZ4 in 16+KB blocks Starting Doc id Offset 0 33 4 188 5 312 7 423 12 605 13 811 20 934 25 1084 Field Index Field Data [Offset = 33] Doc 0 Doc 1 Doc 2 Doc 3 [Offset = 188] Doc 4 [Offset = 312] Doc 5 Doc 6 [Offset = 423] Doc 7 ... 16KB 16KB 16KB 16KB
  50. 50. References ● https://www.elastic.co/guide/en/elasticsearch/reference/6.3/index.html ● https://lucene.apache.org/core/7_3_1/index.html ● https://www.elastic.co/blog/tracking-in-sync-shard-copies ● https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up ● https://www.youtube.com/watch?v=T5RmMNDR5XI ● https://www.youtube.com/watch?v=c9O5_a50aOQ ● https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-cluster.html ● http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html
  51. 51. Thank you

×