Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Elasticsearch cluster deep dive

707 views

Published on

This is a 10 minutes talk about how Elasticsearch manages its cluster. It goes over, master election, fault detection, cluster state update protocol, network partitionning, shard allocation and shard recovery.

Published in: Technology
  • Be the first to comment

Elasticsearch cluster deep dive

  1. 1. Elasticsearch Cluster deep dive
  2. 2. NoSQL: Text Search and Document
  3. 3. Elasticsearch cluster
  4. 4. Cluster documentation
  5. 5. Distributed Client Nodes Data Nodes Master Nodes Ingest Nodes
  6. 6. Today view of the cluster Other Nodes Master Nodes
  7. 7. What happen when a node starts? Starting
  8. 8. What happen when a node starts? E D A B C Starting 1. Get a list of nodes to ping from config Master
  9. 9. What happen when a node starts? E D A B C Starting 1. Get a list of nodes to ping from config 2. Each response contains: a. cluster name b. node details c. master node details d. cluster state version
  10. 10. What happen when a node starts? E D A B C Starting 1. Get a list of nodes to ping from config 2. Each response contains: a. cluster name b. node details c. master node details d. cluster state version 3. Only keeps master eligible responses based on discovery.zen.master_election.i gnore_non_master_pings
  11. 11. What happen when a node starts? E D A B C Starting ● List of master nodes: [C, C] ● List of eligible master nodes: [A, B, C]
  12. 12. What happen when a node starts? E D A B C Starting 1. Join master node (C) sending: internal:discovery/zen/join
  13. 13. What happen when a node starts? E D A B C Starting 1. Join master node (C) sending: internal:discovery/zen/join 2. Master validates join sending: internal:discovery/zen/join/val idate
  14. 14. Cluster state update
  15. 15. What happen when a node starts? E D A B C Starting 1. Join master node (C) sending: internal:discovery/zen/join 2. Master validates join sending: internal:discovery/zen/join/val idate 3. Master update the cluster state with the new node
  16. 16. What happen when a node starts? E D A B C Starting 1. Join master node (C) sending: internal:discovery/zen/join 2. Master validates join sending: internal:discovery/zen/join/val idate 3. Master update the cluster state with the new node 4. Master waits for discovery.zen.minimum_master_no des master eligible to respond
  17. 17. What happen when a node starts? E D A B C Starting 1. Join master node (C) sending: internal:discovery/zen/join 2. Master validates join sending: internal:discovery/zen/join/val idate 3. Master update the cluster state with the new node 4. Master waits for discovery.zen.minimum_master_no des master eligible to respond 5. Change commited and confirmation sent
  18. 18. What happen when a node starts? E D A B C Starting 1. Join master node (C) sending: internal:discovery/zen/join 2. Master validates join sending: internal:discovery/zen/join/val idate 3. Master update the cluster state with the new node 4. Master waits for discovery.zen.minimum_master_no des master eligible to respond 5. Change commited and confirmation sent
  19. 19. What happen when a node starts? E D A B C Starting 1. New node check the received state for a. new master node b. no master node in the state
  20. 20. Master fault detection E D F A B C Started ● Every discovery.zen.fd.ping_interval nodes ping master (default 1s) ● Timeout is discovery.zen.fd.ping_timeout (default 30s) ● Retry is discovery.zen.fd.ping_retries (default is 3)
  21. 21. Node fault detection E D F A B C Started ● Every discovery.zen.fd.ping_interval nodes ping master (default 1s) ● Timeout is discovery.zen.fd.ping_timeout (default 30s) ● Retry is discovery.zen.fd.ping_retries (default is 3)
  22. 22. Master election
  23. 23. Minimum of candidate required
  24. 24. Master election E D F A B C
  25. 25. Network Partition E D F A B C Master election cannot happen, master steps down
  26. 26. Network Partition E D F A B C Master fault detection triggers new master election
  27. 27. Master election 1. Based on the list of master eligible nodes it chooses in priority: a. The node with the higher cluster state version (part of the ping response) b. Master eligible node c. Sort alphabetically the id of the remaining a take the first 2. Sends a join to this new master. In the meantime it accumulates join requests If the current node elected itself as master it waits for the minimum join requests to declare itself as master (discovery.zen.minimum_master_nodes) In case of master failure detection, each node removes the failed master from the candidates.
  28. 28. Latest cluster version
  29. 29. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v18 v18 v18
  30. 30. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v18 v19 v19
  31. 31. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v18 v20 v20
  32. 32. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v18 v20 v20
  33. 33. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v18 v20 v20
  34. 34. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v19 v20 v20
  35. 35. Lost update partially fixed in 5.0 found by jepsen test E D F A B C v19 v20 v20 Cannot become the master
  36. 36. Shard allocation
  37. 37. Shard assigned to new node 1. Master will rebalance shard allocation to have: a. same average number of shard per node b. same average of shard per index per node avoiding 2 shard with the same id on the same node 2. Uses deciders to decide which shard goes where based on a. Hot/Warm setup (time based indices) b. Disk usage allocation (low watermark and high watermark) c. Throttling (node is already recovering, master might again later)
  38. 38. Shard initialization (Primary) 1. Master communicate through cluster state a new shard assignment 2. Node initialize an empty shard 3. Node notify the master 4. Master mark the shard as started 5. If this is the first shard with a specific id, it is marked as primary is receives requests
  39. 39. Shard initialization (Replica) 1. Master communicate through cluster state a new shard assignment 2. Node initialize recovery from the primary 3. Node notify the master 4. Master mark the replica as started 5. Node activate the replica
  40. 40. Shard recovery
  41. 41. Shard S1S2S3 DISK Memory S1S2S3 Commit point In memory buffer Translog
  42. 42. Recovery from primary Node with Primary Node with Replica Start Recovery 1. Validate request 2. Prevent translog from deletion 3. Snapshot Lucene
  43. 43. Recovery from primary Node with Primary Node with Replica Start Recovery 1. Validate request 2. Prevent translog from deletion 3. Snapshot Lucene Segments
  44. 44. Recovery from primary Node with Primary Node with Replica Start Recovery 1. Validate request 2. Prevent translog from deletion 3. Snapshot Lucene Segments Translog
  45. 45. Recovery from primary Node with Primary Node with Replica Start Recovery 1. Validate request 2. Prevent translog from deletion 3. Snapshot Lucene Segments Translog Notifies master
  46. 46. Thank you !

×