Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Understanding performance aspects of etcd and Raft

4,405 views

Published on

At CoreOS Fest '17

Published in: Engineering
  • Sex in your area is here: ♥♥♥ http://bit.ly/36cXjBY ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ❶❶❶ http://bit.ly/36cXjBY ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Paid To Facebook? Earn up to $200/day on social media sites. ◆◆◆ https://tinyurl.com/rbrfd6j
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • High paying jobs on Facebook? $25 per hour, start immediately ➤➤ http://t.cn/AieX6y8B
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Understanding performance aspects of etcd and Raft

  1. 1. Copyright©2017 NTT Corp. All Rights Reserved. Understanding performance aspects of etcd and Raft NTT Laboratories Hitoshi Mitake
  2. 2. 2Copyright©2017 NTT Corp. All Rights Reserved. • Who am I? • github.com/mitake • Software engineer@NTT Labs (a Japanese telecom) • Working on distributed storage systems for 5 years • A maintainer of etcd project • Mainly working on components related to authentication • Todayʼs talk, Iʼll share • Lessons learned during etcd development • Some ongoing idea • These would be useful for • Managing your etcd cluster • Developing your own applications based on the raft package (github.com/coreos/etcd/raft) Self introduction
  3. 3. 3Copyright©2017 NTT Corp. All Rights Reserved. 1. Background of Raft and state machine replication (SMR) techniques • Why performance aspects of SMR based systems are important • Especially, why state machines replicated by Raft matter 2. Few tips for etcd management and raft package usage 1. Performance impact from compaction 2. How to reduce time spent in the state machine execution: a case of bcrypt password checking of etcd 3. An idea and ongoing work of optimization based on group commit 3. Conclusion Agenda
  4. 4. 4Copyright©2017 NTT Corp. All Rights Reserved. BACKGROUND OF RAFT AND STATE MACHINE REPLICATION TECHNIQUES
  5. 5. 5Copyright©2017 NTT Corp. All Rights Reserved. • Assume you have a KVS like this • And you want the KVS to work in a highly available and consistent manner • What can we do for the purpose? Why do we need SMR techniques? KVS Requests: Read(k), Write(k, v) Update keys Responses: Value, Ack
  6. 6. 6Copyright©2017 NTT Corp. All Rights Reserved. mainframe • Methodology 1: use a mainframe • Or any other reliable hardware • We can achieve reliability with the expensive hardware Why do we need SMR techniques? KVS request response https://en.wikipedia.org/wiki/Mainframe_computer
  7. 7. 7Copyright©2017 NTT Corp. All Rights Reserved. • Methodology 2: replicate the KVS Why do we need SMR techniques? Commodity server KVS request response Commodity server KVS Commodity server KVS replicate
  8. 8. 8Copyright©2017 NTT Corp. All Rights Reserved. • Methodology 2: replicate the KVS Why do we need SMR techniques? Commodity server KVS Commodity server KVS Commodity server KVS replicate request response
  9. 9. 9Copyright©2017 NTT Corp. All Rights Reserved. • Methodology 2: replicate the KVS • Achieve reliability with software • Replicating the system with software functionality will enable high availability • But, how should the functionality be designed and implemented? Why do we need SMR techniques? If the system state changes, the entire state should be copied? Or the working system should forward input to replicas? How the cluster should elect a leader? Or multiple nodes can work at once? etc, etc…
  10. 10. 10Copyright©2017 NTT Corp. All Rights Reserved. • Our goal of availability • Even f nodes fail at once, the entire system must survive if enough nodes are alive • And f should be configurable! • The failure includes temporal failure (e.g. power outage, network disconnection) and permanent failure Why do we need SMR techniques? client servers talking 1 node is failing client servers talking 2 node is failing
  11. 11. 11Copyright©2017 NTT Corp. All Rights Reserved. • Our goal of consistency • What is the ideal goal of consistency? • Eventual consistency, causal consistency, external consistency, etc… • Our goal here: linearizability • Replicated systems behave as a non replicated system • e.g. Clients must not see stale state of the servers Why do we need SMR techniques? client server talking I’m talking with a single server From the perspective of the client client servers talking Reality We are team
  12. 12. 12Copyright©2017 NTT Corp. All Rights Reserved. • Our goal of availability and consistency • If one of them can be sacrificed, things are not so complicated • Many reasonable alternative can be found • How to achieve the goals? • Seems to be difficult Why do we need SMR techniques?
  13. 13. 13Copyright©2017 NTT Corp. All Rights Reserved. • SMR: replicate a service as a state machine Why do we need SMR techniques? State machine input output change internal state KVS Read(k), Write(k, v) Value, Ack Update keys Model KVS as a deterministic state machine
  14. 14. 14Copyright©2017 NTT Corp. All Rights Reserved. • SMR: replicate a service as a state machine • Wrap the state machine with an SMR framework • Every inputs that change the state must be supplied by the framework Why do we need SMR techniques? State machine input output SMR framework Consensus module apply
  15. 15. 15Copyright©2017 NTT Corp. All Rights Reserved. • SMR: replicate a service as a state machine • Replicate the framework • The consensus module decides which inputs should be supplied and an order of the inputs • If the state machines are deterministic, every state machine should be identical • If one server goes down, others can be an alternative Why do we need SMR techniques? State machine input output SMR framework Consensus module apply State machine SMR framework Consensus module apply State machine SMR framework Consensus module apply replicate
  16. 16. 16Copyright©2017 NTT Corp. All Rights Reserved. • SMR: replicate a service as a state machine • Also, the inputs must be persisted on a non volatile media (e.g. HDD, SSD) Why do we need SMR techniques? State machine input output Log SMR framework Consensus module append apply non volatile media
  17. 17. 17Copyright©2017 NTT Corp. All Rights Reserved. • When should the consensus module issue append and apply? • If a quorum of nodes can agree, the module can issue apply • Quorum: 𝒬 is a quorum system if it can satisfy • In natural language: if every member of 𝒬 has an intersection that is not empty with each other, 𝒬 is a quorum system Why do we need SMR techniques? ∀ 𝑄% , 𝑄' ∈ 𝒬 , 𝑄% ∩ 𝑄' ≠ ∅
  18. 18. 18Copyright©2017 NTT Corp. All Rights Reserved. • For making progress, agreement of every node isnʼt required • Agreement of quorum system is enough • Any two quorum nodes have at least one intersection • If a cluster has 2f + 1 nodes, it can tolerate f faults Why do we need SMR techniques? agree agree The intersection of two different quorum nodes Note that quorum is not always equal to majority: https://fpaxos.github.io/
  19. 19. 19Copyright©2017 NTT Corp. All Rights Reserved. • For making progress, agreement of every node isnʼt required • If the cluster cannot collect agreements of quorum nodes, it cannot make progress • With this idea, SMR techniques can handle the problem of network partition Why do we need SMR techniques? agree Cannot make progress… 3 node fails
  20. 20. 20Copyright©2017 NTT Corp. All Rights Reserved. • Raft is a methodology for SMR based systems • In Search of an Understandable Consensus Algorithm [Ongaro and Ousterhout, USENIX ATC 2014] • The performance and functionalities of Raft are same to Paxos (Multi Paxos) [Lamport TOCS ʻ98] • And Raft is more understandable: details required by implementations are well specified Why do we need SMR techniques?
  21. 21. 21Copyright©2017 NTT Corp. All Rights Reserved. • Raft is a methodology for SMR based systems • Its important properties related to safety and liveness are specified and proven in TLA+ (some parts are proven manually) • Isnʼt it enough? Why do we need SMR techniques?
  22. 22. 22Copyright©2017 NTT Corp. All Rights Reserved. • Now we have the methodology for replicating state machines • In a highly available and linearizable manner • Can we replicate any state machines easily? • According to Leslie Lamport, (it seems to be) yes • From Part-time Parliament [Lamport TOCS ʻ98] Why do we need SMR techniques?
  23. 23. 23Copyright©2017 NTT Corp. All Rights Reserved. • Can we replicate any state machines easily? • Unfortunately, it is a little bit hard to agree from the perspective of practitioner L • Some of them are tricky to replicate • Some of them are extremely hard to replicate • Let me explain some examples • Idempotency • Probabilistic behaviour • Time triggered action • non deterministic state machines Why do we need SMR techniques?
  24. 24. 24Copyright©2017 NTT Corp. All Rights Reserved. • Idempotency of operations • Raft (and other SMR techniques) does not guarantee that a client request is delivered to its state machine exactly once • What happens in a case of duplicate requests e.g. retrying? • Serious problem for operations like locking Properties of well understood SMR techniques Figure 6.2 of Diego Ongaro’s PhD dissertation
  25. 25. 25Copyright©2017 NTT Corp. All Rights Reserved. • Solution: version aware state machine • If the state machine and client are version aware, duplicated requests can be stopped by the request assumption • In the case of etcd, it forms the foundation of rich transactional operations: https://coreos.com/blog/transactional-memory-with- etcd3.html • Related discussion: https://github.com/coreos/etcd/issues/7062 Properties of well understood SMR techniques • key1, value1, revision1 • key2, value2, revision2 • key3, value3, revision3 Update key1 to value1’ If its revision == revision1 Update key1 to value1’ If its revision == revision1 First try Second try The second try can be failed because its assumption (revision == revision1) isn’t satisfied
  26. 26. 26Copyright©2017 NTT Corp. All Rights Reserved. • A little bit tricky to replicate: probabilistic state machine • If state transition is probabilistic, the state of each replica can be divergent • Also, replaying log entries can produce different state Properties of well understood SMR techniques S S’1 S’2 S’3 20% 30% 50%
  27. 27. 27Copyright©2017 NTT Corp. All Rights Reserved. • A little bit tricky to replicate: probabilistic state machine • Solution: share the seed of the random number generator • e.g. identical random(3)ʼs seed should be copied in each replica Properties of well understood SMR techniques Replica 1 initstate(seed); Replica 1 Replica 1 x = random(); Replica 2 initstate(seed’); Replica 2 Replica 2 x’ = random(); If the seeds differ, the results of random number generator will differ
  28. 28. 28Copyright©2017 NTT Corp. All Rights Reserved. • A little bit tricky to replicate: probabilistic state machine • Solution: share the seed of the probabilistic state • e.g. identical random(3)ʼs seed should be copied in each replica Properties of well understood SMR techniques Replica 1 initstate(seed); Replica 1 Replica 1 x = random(); Replica 2 initstate(seed); Replica 2 Replica 2 x = random(); Same seed will produce identical random numbers
  29. 29. 29Copyright©2017 NTT Corp. All Rights Reserved. • A little bit tricky to replicate: time triggered state machine • How to handle state transition triggered by time passing? • e.g. TTL of key value store Properties of well understood SMR techniques Key=value Key=value time passed expire key
  30. 30. 30Copyright©2017 NTT Corp. All Rights Reserved. • A little bit tricky to replicate: time triggered state machine • Clocks of OSes arenʼt replicated with Raft • They can be diverged • The divergence can be propagated to the state machines Properties of well understood SMR techniques Replica 1 Replica 1 Replica 1 do something; gettimeofday(2); Replica 2 Replica 2 Replica 2 do something; gettimeofday(2); Clocks of OSes aren’t synchronized with Raft
  31. 31. 31Copyright©2017 NTT Corp. All Rights Reserved. • A little bit tricky to replicate: time triggered state machine • Solution: logical time progress can be initiated by a leader node Properties of well understood SMR techniques leader leader leader do something; follower follower follower do something; initiate logical time progress physical time progress the transition can happen in the same logical time
  32. 32. 32Copyright©2017 NTT Corp. All Rights Reserved. • A little bit tricky to replicate: time triggered state machine • Solution: logical time progress can be initiated by a leader node • The solution introduces some subtle corner cases that must be handled • e.g. what happen if the leader can be isolated from other nodes after initiating the progress • Anthony Romano taught me about these subtle points (the case of etcdʼs lease management) • https://github.com/coreos/etcd/issues/7320 Properties of well understood SMR techniques
  33. 33. 33Copyright©2017 NTT Corp. All Rights Reserved. • Extremely difficult to replicate: non deterministic state machines • If the state transition has non determinism (e.g. comes from multithreaded implementation), state of replicas can be divergent • It means the state machines cannot exploit multicore paralellism! Properties of well understood SMR techniques S … ? S’1 S’2 S’n S … ? S’1 S’2 S’n Replica 1 Replica 2 Which state will be the next one?
  34. 34. 34Copyright©2017 NTT Corp. All Rights Reserved. • Summary • Providing idempotent operations requires thought about the design of state machines • In addition, if your state machine has, • Probabilistic behaviour • Time triggered action • please be careful when you replicate it with Raft • If your state machine is non deterministic, replicating it with Raft will be quite challenging • Iʼll discuss about the detailed example of this problem later Properties of well understood SMR techniques
  35. 35. 35Copyright©2017 NTT Corp. All Rights Reserved. TIPS FOR ETCD MANAGEMENT AND RAFT PACKAGE USAGE
  36. 36. 36Copyright©2017 NTT Corp. All Rights Reserved. • What is etcd? • No need to explain here… • A highly available and consistent KVS • General purpose configuration store • As an open source clone of Googleʼs Chubby [Burrows, OSDI ʼ06] • The most important and interesting user would be kubernetes Why does performance aspects of etcd matter?
  37. 37. 37Copyright©2017 NTT Corp. All Rights Reserved. • Borg uses Chubby and the paxos store • Kubernetes uses etcd as their alternatives • Do they need to be high performant? Why does performance aspects of etcd matter? Chubby
  38. 38. 38Copyright©2017 NTT Corp. All Rights Reserved. • The paxos store needs to be high performant • Borg is a distributed operating system, and the paxos store is a kind of its runqueue Why does performance aspects of etcd matter? [Verma et, al. EuroSys ’15] A case of non distributed OS A case of Borg Operating Systems: Three Easy Pieces http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-intro.pdf
  39. 39. 39Copyright©2017 NTT Corp. All Rights Reserved. • Runqueue of OS must be implemented in an efficient manner • A case of non distributed OS: they can be accessed by interrupt and exception handler frequently (order of milliseconds) • Locking and partitioning are very important • A case of distributed OS: are the runqueue of distributed OS accessed so frequently? • Do we execute `kubectl create –f something.yaml` thousands in a second? Why does performance aspects of etcd matter?
  40. 40. 40Copyright©2017 NTT Corp. All Rights Reserved. • Runqueue of OS must be implemented in an efficient manner • Even for distributed OS • If a number of nodes that belong to the cluster becomes larger, lots of scheduling events can be produced Why does performance aspects of etcd matter? https://www.youtube.com/watch?v=OIsCwc7qfTU
  41. 41. 41Copyright©2017 NTT Corp. All Rights Reserved. • Runqueue of OS must be implemented in an efficient manner • In a distributed environment, the runqueue needs to be distributed and persisted for handling machine failures Why does performance aspects of etcd matter? 2000 Nodes and Beyond: How We Scaled Kubernetes to 60,000-Container Clusters and Where We're Going Next Marek Grabowski, KubeCon ‘16: http://sched.co/8K8w
  42. 42. 42Copyright©2017 NTT Corp. All Rights Reserved. • So, performance aspects of etcd matter • For large scale k8s clusters • For other deployments whose configuration changes frequently • Performance aspects of Raft itself matter • e.g. Spanner clones that store large amount of data with Raft • CockroachDB: https://github.com/cock roachdb/cockroach • TiDB: https://github.com/ping cap/tidb Why does performance aspects of etcd matter? Corbett et, al. OSDI ‘12
  43. 43. 43Copyright©2017 NTT Corp. All Rights Reserved. • Performance aspects of etcd and Raft matter • For both cases of configuration store and database • What kind of difficulties we will see practically? • Iʼll provide very few examples: 1. Resource consumption and performance impact from compaction • Related to etcd management and raft package usage 2. How to reduce time consumed in a state machine • Related to raft package usage 3. An experiment about improving throughput • Related to raft package usage and etcd improvement? Why does performance aspects of etcd matter?
  44. 44. 44Copyright©2017 NTT Corp. All Rights Reserved. Simplified etcd overview clientv3 application (e.g. etcdctl) etcdserver gRPC raft wal SSD • mvcc • lease • auth • alarm propose commands, receive commit decision apply commands, get results append committed commands persistMake decision about log entries can be applied or not Work as state machineUnmarshal/marshal, send/recv requests and responses
  45. 45. 45Copyright©2017 NTT Corp. All Rights Reserved. Simplified etcd overview clientv3 application (e.g. etcdctl) etcdserver gRPC raft etcdserver raft etcdserver raftrafthttp rafthttp Via rafthttp, raft modules talk with each other (e.g. AppendEntries())
  46. 46. 46Copyright©2017 NTT Corp. All Rights Reserved. Resource consumption and performance impact from compaction
  47. 47. 47Copyright©2017 NTT Corp. All Rights Reserved. • What is compaction in the context of Raft? • Raft manages operations of its state machine in a form of log • Newly arrived log entries are appended to the tail of the log • Unlimited growing log will exhaust spacial resources • The log needs to be compacted periodically • In addition, snapshot needs to be created How does compaction affects performance and resource consumption? etcdserver raft wal SSD append committed commands persist raft.MemoryStorage When should we be compacted?
  48. 48. 48Copyright©2017 NTT Corp. All Rights Reserved. • Generally speaking, frequency of compaction introduces tradeoff: throughput vs recovery • How will etcd performance affected by this tradeoff? • The frequency of compaction can be controlled with – snapshot-count and –snapshot-size (WIP: https://github.com/coreos/etcd/pull/7782) How does compaction affects performance and resource consumption? Throughput: high Recovery: slow Throughput: low Recovery: speedy
  49. 49. 49Copyright©2017 NTT Corp. All Rights Reserved. • A little experiment on GCE • 4 VMs of n1-standard-4 (4 vCPUs, 15 GB of memory, SSD) • X axis represents parameters of –snapshot-count How does compaction affects performance and resource consumption? 20000 21000 22000 23000 24000 25000 26000 100 1000 10000 100000 (default) 1000000 10000000 (virtually no compaction) IOPS benchmark --target-leader --conns=1000 --clients=1000 put --total=1000000
  50. 50. 50Copyright©2017 NTT Corp. All Rights Reserved. • A little experiment on GCE • 4 VMs of n1-standard-4 (4 vCPUs, 15 GB of memory, SSD) • X axis represents parameters of –snapshot-count How does compaction affects performance and resource consumption? 20000 21000 22000 23000 24000 25000 26000 100 1000 10000 100000 (default) 1000000 10000000 (virtually no compaction) IOPS benchmark --target-leader --conns=1000 --clients=1000 put --total=1000000 Why does this happen?
  51. 51. 51Copyright©2017 NTT Corp. All Rights Reserved. • Profiled leader node with go pprof • Two interesting functions could be found: runtime.mallocgc (allocation) and runtime.scanobject (GC) How does compaction affects performance and resource consumption? 0 1000 2000 3000 4000 5000 20000 21000 22000 23000 24000 25000 26000 100 1000 10000 100000 (default) 1000000 10000000 (no compaction) IOPS runtime.mallocgc (milli second) runtime.scanobject (milli second)
  52. 52. 52Copyright©2017 NTT Corp. All Rights Reserved. • How did these functions relate to compaction and performance? • A number of in-memory live objects (created in raft.MemoryStorage and managed by go runtime) can increase according to the interval of compaction • The increased number of objects can make the mark phase of go GC slow • runtime.scanobject • Also, infrequent reclamation of memory area can make the allocation slow • Miss ratio of thread local cache can increase • runtime.mallocgc • The analysis isnʼt complete, but can support the observed data How does compaction affects performance and resource consumption?
  53. 53. 53Copyright©2017 NTT Corp. All Rights Reserved. • How about recovery speed? • etcd solves the problem by limiting DB size (2GB default, up to 8GB) • https://github.com/coreos/etcd/blob/master/Documentat ion/faq.md#deployment • Interesting discussion including answer from Xiang Li • https://groups.google.com/forum/#!topic /etcd- dev/vCeSLBKC_M8 • Currently, I didnʼt observed the bad recovery performance, too How does compaction affects performance and resource consumption?
  54. 54. 54Copyright©2017 NTT Corp. All Rights Reserved. • Observations • Too frequent compaction is harmful for throughput • Quite natural • Too infrequent compaction is also harmful for throughput • Although it consumes larger memory! • A little bit tricky • Seeking the best parameters for your workload would be helpful • --snapshot-count and –snapshot-size • Buying expensive hardware (e.g. CPU with lots of cores) isnʼt so helpful for improving throughput of Raft based systems and etcd How does compaction affects performance and resource consumption?
  55. 55. 55Copyright©2017 NTT Corp. All Rights Reserved. How to reduce time consumed in a state machine
  56. 56. 56Copyright©2017 NTT Corp. All Rights Reserved. • etcd provides access control based on the concept of users and roles since v2 A short history of the etcd auth functionality user role1 role2 role3 range permission1 rage permission2 range permission3 range permission4 granted granted
  57. 57. 57Copyright©2017 NTT Corp. All Rights Reserved. • etcd clients (including etcdctl) can be authenticated in basic authentication of http A short history of the etcd auth functionality client application (e.g. etcdctl) etcdserver http raft wal SSD • storagepropose commands, receive commit decision apply commands, get results append committed commands persist bcypt based password checking is executed by etcdserver when the http request arrives
  58. 58. 58Copyright©2017 NTT Corp. All Rights Reserved. • etcd v2 execyted bcrypt based password checking at the API layer • Once the checking was passed, the authorized commands were sent to raft • Practically it wouldnʼt be problematic, but it can result TOCTOU (Time Of Check vs Time Of Use) problem • Admins can update passwords concurrently with the requests • Requests can be processed even the authorization is obsolete • For reducing the possibility of the problem, the auth of etcd v3 changed its design A short history of the etcd auth functionality
  59. 59. 59Copyright©2017 NTT Corp. All Rights Reserved. A short history of the etcd auth functionality clientv3 application (e.g. etcdctl) etcdserver gRPC raft wal SSD • mvcc • lease • auth • alarm propose commands, receive commit decision apply commands, get results append committed commands persist Auth metadata update and authentication are serialized with raft Password checking is executed by auth module, a part of the state machine auth token
  60. 60. 60Copyright©2017 NTT Corp. All Rights Reserved. • Now the TOCTOU problem wonʼt happen • Happy ending? • There was another problem: high cost of bcrypt password checking • https://godoc.org/golang.org/x/crypto/bcrypt • It requires almost 100ms even on modern CPU! • 100ms CPU consumption means etcd can authorize 10 times per second • How should we solve this? A short history of the etcd auth functionality
  61. 61. 61Copyright©2017 NTT Corp. All Rights Reserved. Solution: version number validation clientv3 application (e.g. etcdctl) etcdserver gRPC raft wal • mvcc • lease • auth (versioned) • alarm propose commands, receive commit decision apply commands, get results append committed commands persist 2. Once the password is checked, authenticate request sent to raft 3. The rest of authentiation is executed in the state machine. The response includes the version number of auth store. 1. Check password in the etcdserver layer, save the version number of auth store 4. Compare the saved version number and the number in the response
  62. 62. 62Copyright©2017 NTT Corp. All Rights Reserved. • If the state machine has a version number, it can be used for the purpose of version number validation of OCC (Optimistic Concurrency Control) • The original idea was provided by Anthony Romano • Similar to multiple keys transaction of database systems • Like the case of idempotency, versioned structure is helpful! • It can reduce precious time of state machine layer Solution: version number validation etcdserver (bypassing raft) state machine Read(k1) val1, version1 Read(k2) val2, version2 Validate version1 & 2 ack My data is consistent!
  63. 63. 63Copyright©2017 NTT Corp. All Rights Reserved. • Concurrent modification can be detected like this: Solution: version number validation etcdserver thread1 (bypassing raft) state machine of raft Read(k1) val1, version1 Read(k2) val2, version2 Validate version1 & 2 version1 is updated My data is inconsistent! etcdserver thread2 (not bypassing raft) Write(k1, v1) Ack, version1 -> version1’
  64. 64. 64Copyright©2017 NTT Corp. All Rights Reserved. An experiment about improving throughput
  65. 65. 65Copyright©2017 NTT Corp. All Rights Reserved. • Replicating state machines with non deterministic transition is hard • The non determinism introduces divergence in the replicas • Replicating state machines that exploit multicore parallelism is hard • Replicating state machines that exploit high bandwidth of modern I/O devices is also hard SMR and parallelism S … ? S’1 S’2 S’n S … ? S’1 S’2 S’n Replica 1 Replica 2
  66. 66. 66Copyright©2017 NTT Corp. All Rights Reserved. • What kind of techniques are available? • EVE [Kapritsos et al. OSDI ʻ12] • Consider the divergence of state are considered as a result of byzantine fault, and fix in the agreement process • Rex [Guo et al. EuroSys ʻ14] • Speculation based replication for multicore scalable systems • https://github.com/Microsoft/rDSN • Crane [Cui et al. SOSP ʻ15] • Deterministic scheduling (originally established in the context of debugging purpose) based replication techniques • Posix applications can be replicated without modification • https://github.com/columbia/crane • All of them are research prototypes • Replicating multicore scalable state machines is a cutting edge research topic! • Very hopeful, but using them today would require significant engineering cost SMR and parallelism
  67. 67. 67Copyright©2017 NTT Corp. All Rights Reserved. • How about etcd specific optimization for the purpose? • etcdʼs main functionality is a KVS that support transactional access • The core storage functionality is implemented in mvcc package • BoltDB based • If keys are independent, update requests on them are commutative The case of etcd etcdserver SSD mvcc (based on BoltDB) apply commands, e.g. Single key put Multiple keys transaction persist Put k1 Put k2 Put k2 Put k1commutative
  68. 68. 68Copyright©2017 NTT Corp. All Rights Reserved. • etcdserver applies a command that is supplied by raft • Iteration: apply a single command, goto next one… The case of etcd etcdserver SSD mvcc (based on BoltDB) apply commands, e.g. Single key put Multiple keys transaction persist raft Obtain a log entry, goto next one…
  69. 69. 69Copyright©2017 NTT Corp. All Rights Reserved. • Exploiting KVS specific semantics? • KVS has commutativity in its operation • Individual commands (e.g. Put(key1) and Put(key2)) can be grouped in a single large transaction The case of etcd etcdserver SSD mvcc (based on BoltDB) Convert multiple commands into a single large txn persist (issue multiple puts at once) raft Grab independent commands Can this be performed effectively?
  70. 70. 70Copyright©2017 NTT Corp. All Rights Reserved. The case of etcd clientv3 application (e.g. etcdctl) etcdserver gRPC raft etcdserver raftrafthttp Via rafthttp, raft modules talk with each other (e.g. AppendEntries()) etcd sends entries in a batched manner Raft itself is friendly with batching: AppendEntries() Isn’t AppendEntry() In a case of 1000 concurrent clients, peek numbers of batched entries can be 1000
  71. 71. 71Copyright©2017 NTT Corp. All Rights Reserved. • Benchmarking mvcc individually • Grouping multiple puts in a single transaction improves total IOPS • tools/benchmark: `benchmark mvcc put` can be used for this purpose The case of etcd 0 50000 100000 150000 200000 1 key/txn 10 keys/txn 100 keys/txn benchmark mvcc put --total X –txn- ops Y –txn (X * Y = 1000000) IOPS SSD mvcc (based on BoltDB) persist txn txn commit put, put, put… txn/commit put txn/commit put commit put batching
  72. 72. 72Copyright©2017 NTT Corp. All Rights Reserved. • Turning multiple puts in a single txn • https://github.com/mitake/e tcd/commits/batch-append- group-commit • Performance improvement isnʼt so excellent (almost 10% higher IOPS) • Keys need to be distributed • Skewed access cannot be benefited by this strategy The case of etcd 0 5000 10000 15000 20000 original etcd group commit IOPS Benchmark command: benchmark --target-leader --conns=1000 --clients=1000 put --total=1000000 --sequential-keys --key-space-size 1000000
  73. 73. 73Copyright©2017 NTT Corp. All Rights Reserved. • Is the idea worth to be invested more? • Iʼm not sure • There are some rooms for improvements: 1. Multicore scalable backend: current mvcc allows one writer at once 2. Pipelining rafthttp: exploit network bandwidth more aggressively • If we face throughput problems in the future, revisiting it would be helpful The case for etcd
  74. 74. 74Copyright©2017 NTT Corp. All Rights Reserved. CONCLUSION
  75. 75. 75Copyright©2017 NTT Corp. All Rights Reserved. • Raft is a solid foundation for highly available and consistent distributed storage systems • If you want your own system, etcdʼs raft package is a good starting point for you • However, it doesnʼt mean we can replicate any state machines easily with it • Probabilistic behaviour, time triggered action will introduce some difficulties • Version aware structure will be helpful • Non determinism will be a serious problem • Not only replication methodologies, but also state machine themselves matter! Conclusion
  76. 76. 76Copyright©2017 NTT Corp. All Rights Reserved. • Exploiting performance of modern hardware by Raft based systems is not easy • Especially exploiting parallelism of multicore and bandwidth of I/O devices is difficult • etcd would also have a room of evolving • They are exciting technical challenges! Conclusion
  77. 77. 77Copyright©2017 NTT Corp. All Rights Reserved. Thanks for listening! Questions? Comments are welcomed email: mitake.hitoshi@lab.ntt.co.jp github: @mitake Twitter: @_3take
  78. 78. 78Copyright©2017 NTT Corp. All Rights Reserved. APPENDIX
  79. 79. 79Copyright©2017 NTT Corp. All Rights Reserved. • [Ongaro and Ousterhout, USENIX ATC 2014] • https://www.usenix.org/conference/atc14/technical- sessions/presentation/ongaro • https://raft.github.io/ has other important materials • [Kapritsos et al. OSDI ʻ12] • https://www.usenix.org/node/170851 • [Guo et al. EuroSys ʻ14] • https://www.microsoft.com/en-us/research/publication/rex- replication-at-the-speed-of-multi-core/ • Crane [Cui et al. SOSP ʻ15] • http://i.cs.hku.hk/~heming/papers/crane-sosp15.pdf • [Lamport TOCS ʻ98] • http://lamport.azurewebsites.net/pubs/pubs.html#lamport- paxos • [Verma et, al. EuroSys ʼ15] • https://research.google.com/pubs/pub43438.html References
  80. 80. 80Copyright©2017 NTT Corp. All Rights Reserved. • Techniques for efficient SMR • https://www.usenix.org/conference/atc13/technical- sessions/presentation/bessani • https://fpaxos.github.io/ • https://www.usenix.org/legacy/events/nsdi11/tech/full _papers/Bolosky.pdf • Chapters of the SRE book that include topics related to Paxos • https://landing.google.com/sre/book/chapters/managi ng-critical-state.html • https://landing.google.com/sre/book/chapters/distribu ted-periodic-scheduling.html • Comparison of etcd, zookeeper and consul • https://coreos.com/blog/performance-of-etcd.html Other interesting papers and articles

×