Understanding performance aspects of etcd and Raft

Copyright©2017 NTT Corp. All Rights Reserved.
Understanding performance
aspects of etcd and Raft
NTT Laboratories
Hitoshi Mitake

2Copyright©2017 NTT Corp. All Rights Reserved.
• Who am I?
• github.com/mitake
• Software engineer@NTT Labs (a Japanese telecom)
• Working on distributed storage systems for 5 years
• A maintainer of etcd project
• Mainly working on components related to authentication
• Todayʼs talk, Iʼll share
• Lessons learned during etcd development
• Some ongoing idea
• These would be useful for
• Managing your etcd cluster
• Developing your own applications based on the raft
package (github.com/coreos/etcd/raft)
Self introduction

1. Background of Raft and state machine
replication (SMR) techniques
• Why performance aspects of SMR based systems are
important
• Especially, why state machines replicated by Raft
matter
2. Few tips for etcd management and raft
package usage
1. Performance impact from compaction
2. How to reduce time spent in the state machine
execution: a case of bcrypt password checking of
etcd
3. An idea and ongoing work of optimization based on
group commit
3. Conclusion
Agenda

BACKGROUND OF RAFT AND
STATE MACHINE REPLICATION
TECHNIQUES

• Assume you have a KVS like this
• And you want the KVS to work in a highly available and
consistent manner
• What can we do for the purpose?
Why do we need SMR techniques?
KVS
Requests:
Read(k), Write(k, v) Update keys
Responses:
Value, Ack

mainframe
• Methodology 1: use a mainframe
• Or any other reliable hardware
• We can achieve reliability with the expensive hardware
KVS
request
response
https://en.wikipedia.org/wiki/Mainframe_computer

• Methodology 2: replicate the KVS
Commodity
server
KVS
request
response
Commodity
server
KVS
Commodity
server
KVS
replicate

Commodity
server
KVS
Commodity
server
KVS
Commodity
server
KVS
replicate
request
response

• Achieve reliability with software
• Replicating the system with software functionality will
enable high availability
• But, how should the functionality be designed and
implemented?
If the system state changes,
the entire state should be copied?
Or the working system should
forward input to replicas?
How the cluster should elect a leader?
Or multiple nodes can work at once?
etc, etc…

• Our goal of availability
• Even f nodes fail at once, the entire system must
survive if enough nodes are alive
• And f should be configurable!
• The failure includes temporal failure (e.g. power
outage, network disconnection) and permanent failure
client servers
talking
1 node is failing
client servers
talking
2 node is failing

• Our goal of consistency
• What is the ideal goal of consistency?
• Eventual consistency, causal consistency, external
consistency, etc…
• Our goal here: linearizability
• Replicated systems behave as a non replicated system
• e.g. Clients must not see stale state of the servers
client server
talking
I’m talking with
a single server
From the perspective of the client
client servers
talking
Reality
We are team

• Our goal of availability and consistency
• If one of them can be sacrificed, things are not so
complicated
• Many reasonable alternative can be found
• How to achieve the goals?
• Seems to be difficult

• SMR: replicate a service as a state machine
State machine
input
output
change
internal state
KVS
Read(k), Write(k, v)
Value, Ack
Update keys
Model KVS as
a deterministic state machine

• Wrap the state machine with an SMR framework
• Every inputs that change the state must be supplied by
the framework
State machine
input
output
SMR framework
Consensus
module
apply

• Replicate the framework
• The consensus module decides which inputs should be
supplied and an order of the inputs
• If the state machines are deterministic, every state
machine should be identical
• If one server goes down, others can be an alternative
State machine
input
output
SMR framework
Consensus
module
apply
State machine
SMR framework
Consensus
module
apply
State machine
SMR framework
Consensus
module
apply
replicate

• Also, the inputs must be persisted on a non volatile
media (e.g. HDD, SSD)
State machine
input
output
Log
SMR framework
Consensus
module
append
apply
non volatile media

• When should the consensus module issue
append and apply?
• If a quorum of nodes can agree, the module can issue
apply
• Quorum: 𝒬 is a quorum system if it can satisfy
• In natural language: if every member of 𝒬 has an
intersection that is not empty with each other, 𝒬 is a
quorum system
∀ 𝑄% , 𝑄' ∈ 𝒬 , 𝑄% ∩ 𝑄' ≠ ∅

• For making progress, agreement of every
node isnʼt required
• Agreement of quorum system is enough
• Any two quorum nodes have at least one intersection
• If a cluster has 2f + 1 nodes, it can tolerate f faults
agree
agree
The intersection of two different quorum nodes
Note that quorum is not always equal to majority: https://fpaxos.github.io/

• For making progress, agreement of every
node isnʼt required
• If the cluster cannot collect agreements of quorum
nodes, it cannot make progress
• With this idea, SMR techniques can handle the problem
of network partition
agree
Cannot make progress…
3 node fails

• Raft is a methodology for SMR based systems
• In Search of an Understandable Consensus Algorithm
[Ongaro and Ousterhout, USENIX ATC 2014]
• The performance and functionalities of Raft are same to
Paxos (Multi Paxos) [Lamport TOCS ʻ98]
• And Raft is more understandable: details required by
implementations are well specified

• Raft is a methodology for SMR based systems
• Its important properties related to safety and liveness
are specified and proven in TLA+ (some parts are
proven manually)
• Isnʼt it enough?

• Now we have the methodology for replicating
state machines
• In a highly available and linearizable manner
• Can we replicate any state machines easily?
• According to Leslie Lamport, (it seems to be) yes
• From Part-time Parliament [Lamport TOCS ʻ98]

• Can we replicate any state machines easily?
• Unfortunately, it is a little bit hard to agree from the
perspective of practitioner L
• Some of them are tricky to replicate
• Some of them are extremely hard to replicate
• Let me explain some examples
• Idempotency
• Probabilistic behaviour
• Time triggered action
• non deterministic state machines

• Idempotency of
operations
• Raft (and other SMR
techniques) does not
guarantee that a client
request is delivered to
its state machine
exactly once
• What happens in a
case of duplicate
requests e.g. retrying?
• Serious problem for
operations like locking
Properties of well understood SMR techniques
Figure 6.2 of Diego Ongaro’s
PhD dissertation

• Solution: version aware state machine
• If the state machine and client are version aware, duplicated
requests can be stopped by the request assumption
• In the case of etcd, it forms the foundation of rich transactional
operations: https://coreos.com/blog/transactional-memory-with-
etcd3.html
• Related discussion: https://github.com/coreos/etcd/issues/7062
• key1, value1, revision1
Update key1 to value1’
If its revision == revision1
Update key1 to value1’
If its revision == revision1
First try
Second try
The second try can be failed because
its assumption (revision == revision1)
isn’t satisfied

• A little bit tricky to replicate: probabilistic
state machine
• If state transition is probabilistic, the state of each
replica can be divergent
• Also, replaying log entries can produce different state
S
S’1
S’2
S’3
20%
30%
50%

state machine
• Solution: share the seed of the random number
generator
• e.g. identical random(3)ʼs seed should be copied in each
replica
Replica 1
initstate(seed);
Replica 1
Replica 1
x = random();
Replica 2
initstate(seed’);
Replica 2
Replica 2
x’ = random();
If the seeds differ,
the results of
random number
generator will
differ

state machine
• Solution: share the seed of the probabilistic state
• e.g. identical random(3)ʼs seed should be copied in each
replica
Replica 1
initstate(seed);
Replica 1
Replica 1
x = random();
Replica 2
initstate(seed);
Replica 2
Replica 2
x = random();
Same seed will produce
identical random numbers

• A little bit tricky to replicate: time triggered
state machine
• How to handle state transition triggered by time
passing?
• e.g. TTL of key value store
Key=value Key=value
time passed expire key

state machine
• Clocks of OSes arenʼt replicated with Raft
• They can be diverged
• The divergence can be propagated to the state machines
Replica 1
Replica 1
Replica 1
do something;
gettimeofday(2);
Replica 2
Replica 2
Replica 2
do something;
gettimeofday(2);
Clocks of OSes
aren’t
synchronized with
Raft

state machine
• Solution: logical time progress can be initiated by a
leader node
leader
leader
leader
do something;
follower
follower
follower
do something;
initiate logical time progress
physical time
progress
the transition can
happen in the same
logical time

state machine
• Solution: logical time progress can be initiated by a
leader node
• The solution introduces some subtle corner cases that
must be handled
• e.g. what happen if the leader can be isolated from other
nodes after initiating the progress
• Anthony Romano taught me about these subtle points
(the case of etcdʼs lease management)
• https://github.com/coreos/etcd/issues/7320

• Extremely difficult to replicate: non
deterministic state machines
• If the state transition has non determinism (e.g. comes
from multithreaded implementation), state of replicas
can be divergent
• It means the state machines cannot exploit multicore
paralellism!
S
…
?
S’1 S’2 S’n
S
…
?
S’1 S’2 S’n
Replica 1 Replica 2
Which state will be the next one?

• Summary
• Providing idempotent operations requires thought
about the design of state machines
• In addition, if your state machine has,
• Probabilistic behaviour
• Time triggered action
• please be careful when you replicate it with Raft
• If your state machine is non deterministic, replicating it
with Raft will be quite challenging
• Iʼll discuss about the detailed example of this problem
later

TIPS FOR ETCD MANAGEMENT
AND RAFT PACKAGE USAGE

• What is etcd?
• No need to explain here…
• A highly available and consistent KVS
• General purpose configuration store
• As an open source clone of Googleʼs Chubby [Burrows,
OSDI ʼ06]
• The most important and interesting user would be
kubernetes
Why does performance aspects of etcd matter?

• Borg uses Chubby and the paxos store
• Kubernetes uses etcd as their alternatives
• Do they need to be high performant?
Chubby

• The paxos store needs to be high performant
• Borg is a distributed operating system, and the paxos
store is a kind of its runqueue
[Verma et, al. EuroSys ’15]
A case of non distributed OS
A case of Borg
Operating Systems: Three Easy Pieces
http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-intro.pdf

• Runqueue of OS must be implemented in an
efficient manner
• A case of non distributed OS: they can be accessed by
interrupt and exception handler frequently (order of
milliseconds)
• Locking and partitioning are very important
• A case of distributed OS: are the runqueue of
distributed OS accessed so frequently?
• Do we execute `kubectl create –f something.yaml`
thousands in a second?

• Runqueue of OS
must be
implemented in an
efficient manner
• Even for distributed
OS
• If a number of nodes
that belong to the
cluster becomes
larger, lots of
scheduling events
can be produced
https://www.youtube.com/watch?v=OIsCwc7qfTU

• Runqueue of OS must be implemented in an
efficient manner
• In a distributed environment, the runqueue needs to be
distributed and persisted for handling machine failures
2000 Nodes and Beyond: How We Scaled Kubernetes to 60,000-Container Clusters and Where We're Going Next
Marek Grabowski, KubeCon ‘16: http://sched.co/8K8w

• So, performance
aspects of etcd matter
• For large scale k8s
clusters
• For other deployments
whose configuration
changes frequently
• Performance aspects
of Raft itself matter
• e.g. Spanner clones that
store large amount of
data with Raft
• CockroachDB:
https://github.com/cock
roachdb/cockroach
• TiDB:
https://github.com/ping
cap/tidb
Corbett et, al. OSDI ‘12

• Performance aspects of etcd and Raft matter
• For both cases of configuration store and database
• What kind of difficulties we will see
practically?
• Iʼll provide very few examples:
1. Resource consumption and performance impact from
compaction
• Related to etcd management and raft package usage
2. How to reduce time consumed in a state machine
• Related to raft package usage
3. An experiment about improving throughput
• Related to raft package usage and etcd improvement?

Simplified etcd overview
clientv3
application
(e.g. etcdctl)
etcdserver
gRPC
raft
wal
SSD
• mvcc
• lease
• auth
• alarm
propose commands,
receive commit decision
apply commands,
get results
append committed commands
persistMake decision about log entries
can be applied or not
Work as
state machineUnmarshal/marshal, send/recv
requests and responses

Simplified etcd overview
clientv3
application
(e.g. etcdctl)
etcdserver
gRPC
raft
etcdserver
raft
etcdserver
raftrafthttp rafthttp
Via rafthttp, raft modules talk with each other (e.g. AppendEntries())

Resource consumption and performance impact from
compaction

• What is compaction in
the context of Raft?
• Raft manages operations
of its state machine in a
form of log
• Newly arrived log entries
are appended to the tail
of the log
• Unlimited growing log will
exhaust spacial resources
• The log needs to be
compacted periodically
• In addition, snapshot
needs to be created
How does compaction affects performance and
resource consumption?
etcdserver
raft
wal
SSD
persist
raft.MemoryStorage
When should we
be compacted?

• Generally speaking, frequency of compaction
introduces tradeoff: throughput vs recovery
• How will etcd performance affected by this tradeoff?
• The frequency of compaction can be controlled with –
snapshot-count and –snapshot-size (WIP:
https://github.com/coreos/etcd/pull/7782)
Throughput: high
Recovery: slow
Throughput: low
Recovery: speedy

• A little experiment on GCE
• 4 VMs of n1-standard-4 (4 vCPUs, 15 GB of memory,
SSD)
• X axis represents parameters of –snapshot-count
20000
21000
22000
23000
24000
25000
26000
100 1000 10000 100000
(default)
1000000 10000000
(virtually no
compaction)
IOPS
benchmark --target-leader --conns=1000 --clients=1000 put --total=1000000

• A little experiment on GCE
• 4 VMs of n1-standard-4 (4 vCPUs, 15 GB of memory,
SSD)
• X axis represents parameters of –snapshot-count
20000
21000
22000
23000
24000
25000
26000
100 1000 10000 100000
(default)
1000000 10000000
(virtually no
compaction)
IOPS
benchmark --target-leader --conns=1000 --clients=1000 put --total=1000000
Why does this happen?

• Profiled leader node with go pprof
• Two interesting functions could be found:
runtime.mallocgc (allocation) and runtime.scanobject
(GC)
0
1000
2000
3000
4000
5000
20000
21000
22000
23000
24000
25000
26000
100 1000 10000 100000
(default)
1000000 10000000
(no
compaction)
IOPS runtime.mallocgc (milli second) runtime.scanobject (milli second)

• How did these functions relate to compaction
and performance?
• A number of in-memory live objects (created in
raft.MemoryStorage and managed by go runtime) can
increase according to the interval of compaction
• The increased number of objects can make the mark
phase of go GC slow
• runtime.scanobject
• Also, infrequent reclamation of memory area can make
the allocation slow
• Miss ratio of thread local cache can increase
• runtime.mallocgc
• The analysis isnʼt complete, but can support the
observed data

• How about recovery speed?
• etcd solves the problem by limiting DB size (2GB
default, up to 8GB)
• https://github.com/coreos/etcd/blob/master/Documentat
ion/faq.md#deployment
• Interesting discussion including answer from Xiang Li
• https://groups.google.com/forum/#!topic /etcd-
dev/vCeSLBKC_M8
• Currently, I didnʼt observed the bad recovery
performance, too

• Observations
• Too frequent compaction is harmful for throughput
• Quite natural
• Too infrequent compaction is also harmful for
throughput
• Although it consumes larger memory!
• A little bit tricky
• Seeking the best parameters for your workload would
be helpful
• --snapshot-count and –snapshot-size
• Buying expensive hardware (e.g. CPU with lots of cores)
isnʼt so helpful for improving throughput of Raft based
systems and etcd

How to reduce time consumed in a state machine

• etcd provides access control based on the
concept of users and roles since v2
A short history of the etcd auth functionality
user
role1
role2
role3
range permission1
rage permission2
range permission3
range permission4
granted
granted

• etcd clients (including etcdctl) can be
authenticated in basic authentication of http
client
application
(e.g. etcdctl)
etcdserver
http
raft
wal
SSD
• storagepropose commands,
apply commands,
get results
persist
bcypt based password checking
is executed by etcdserver
when the http request arrives

• etcd v2 execyted bcrypt based password
checking at the API layer
• Once the checking was passed, the authorized
commands were sent to raft
• Practically it wouldnʼt be problematic, but it can result
TOCTOU (Time Of Check vs Time Of Use) problem
• Admins can update passwords concurrently with the
requests
• Requests can be processed even the authorization is
obsolete
• For reducing the possibility of the problem, the auth of
etcd v3 changed its design

clientv3
application
(e.g. etcdctl)
etcdserver
gRPC
raft
wal
SSD
• mvcc
• lease
• auth
• alarm
propose commands,
apply commands,
get results
persist
Auth metadata update and
authentication are serialized
with raft
Password checking is executed by auth module,
a part of the state machine
auth token

• Now the TOCTOU problem wonʼt happen
• Happy ending?
• There was another problem: high cost of bcrypt
password checking
• https://godoc.org/golang.org/x/crypto/bcrypt
• It requires almost 100ms even on modern CPU!
• 100ms CPU consumption means etcd can authorize 10
times per second
• How should we solve this?

Solution: version number validation
clientv3
application
(e.g. etcdctl)
etcdserver
gRPC
raft
wal
• mvcc
• lease
• auth
(versioned)
• alarm
propose commands,
apply commands,
get results
persist
2. Once the password is checked,
authenticate request sent to
raft
3. The rest of authentiation is executed in
the state machine. The response includes the
version number of auth store.
1. Check password in the etcdserver layer, save the
version number of auth store
4. Compare the saved version number and the
number in the response

• If the state machine has a version number, it can be
used for the purpose of version number validation of
OCC (Optimistic Concurrency Control)
• The original idea was provided by Anthony Romano
• Similar to multiple keys transaction of database systems
• Like the case of idempotency, versioned structure is helpful!
• It can reduce precious time of state machine layer
etcdserver
(bypassing raft)
state
machine
Read(k1)
val1,
version1
Read(k2)
val2,
version2
Validate
version1 & 2
ack
My data is
consistent!

• Concurrent modification can be detected like this:
etcdserver
thread1
(bypassing raft)
state
machine of raft
Read(k1) val1,
version1
Read(k2)
val2,
version2
Validate
version1 & 2
version1
is updated
My data is
inconsistent!
etcdserver
thread2 (not bypassing raft)
Write(k1, v1) Ack, version1 -> version1’

An experiment about improving throughput

• Replicating state machines with non
deterministic transition is hard
• The non determinism introduces divergence in the
replicas
• Replicating state machines that exploit multicore
parallelism is hard
• Replicating state machines that exploit high bandwidth
of modern I/O devices is also hard
SMR and parallelism
S
…
?
S’1 S’2 S’n
S
…
?
S’1 S’2 S’n
Replica 1 Replica 2

• What kind of techniques are available?
• EVE [Kapritsos et al. OSDI ʻ12]
• Consider the divergence of state are considered as a result of
byzantine fault, and fix in the agreement process
• Rex [Guo et al. EuroSys ʻ14]
• Speculation based replication for multicore scalable systems
• https://github.com/Microsoft/rDSN
• Crane [Cui et al. SOSP ʻ15]
• Deterministic scheduling (originally established in the
context of debugging purpose) based replication techniques
• Posix applications can be replicated without modification
• https://github.com/columbia/crane
• All of them are research prototypes
• Replicating multicore scalable state machines is a cutting
edge research topic!
• Very hopeful, but using them today would require
significant engineering cost
SMR and parallelism

• How about etcd specific
optimization for the
purpose?
• etcdʼs main functionality is a
KVS that support transactional
access
• The core storage functionality
is implemented in mvcc
package
• BoltDB based
• If keys are independent,
update requests on them are
commutative
The case of etcd
etcdserver
SSD
mvcc (based on BoltDB)
apply commands, e.g.
Single key put
Multiple keys transaction
persist
Put k1 Put k2 Put k2 Put k1commutative

• etcdserver applies a command that is
supplied by raft
• Iteration: apply a single command, goto next one…
The case of etcd
etcdserver
SSD
apply commands, e.g.
Single key put
Multiple keys transaction
persist
raft
Obtain a log entry,
goto next one…

• Exploiting KVS
specific
semantics?
• KVS has
commutativity in its
operation
• Individual commands
(e.g. Put(key1) and
Put(key2)) can be
grouped in a single
large transaction
The case of etcd
etcdserver
SSD
Convert multiple commands
into a single large txn
persist (issue multiple puts at once)
raft
Grab independent
commands
Can this be performed
effectively?

The case of etcd
clientv3
application
(e.g. etcdctl)
etcdserver
gRPC
raft
etcdserver
raftrafthttp
Via rafthttp, raft modules talk with each other (e.g. AppendEntries())
etcd sends entries in a batched manner
Raft itself is friendly with batching: AppendEntries()
Isn’t AppendEntry()
In a case of 1000 concurrent clients, peek
numbers of batched entries can be 1000

• Benchmarking mvcc individually
• Grouping multiple puts in a single transaction improves
total IOPS
• tools/benchmark: `benchmark mvcc put` can be used
for this purpose
The case of etcd
0
50000
100000
150000
200000
1 key/txn 10 keys/txn 100
keys/txn
benchmark mvcc put --total X –txn-
ops Y –txn (X * Y = 1000000)
IOPS
SSD
persist
txn
txn
commit
put, put, put…
txn/commit
put
txn/commit
put
commit
put
batching

• Turning multiple puts in
a single txn
• https://github.com/mitake/e
tcd/commits/batch-append-
group-commit
• Performance improvement
isnʼt so excellent (almost
10% higher IOPS)
• Keys need to be distributed
• Skewed access cannot be
benefited by this strategy
The case of etcd
0
5000
10000
15000
20000
original etcd group commit
IOPS
Benchmark command:
benchmark --target-leader --conns=1000 --clients=1000 put --total=1000000 --sequential-keys --key-space-size 1000000

• Is the idea worth to be invested more?
• Iʼm not sure
• There are some rooms for improvements:
1. Multicore scalable backend: current mvcc allows one
writer at once
2. Pipelining rafthttp: exploit network bandwidth more
aggressively
• If we face throughput problems in the future, revisiting
it would be helpful
The case for etcd

CONCLUSION

• Raft is a solid foundation for highly available
and consistent distributed storage systems
• If you want your own system, etcdʼs raft package is a
good starting point for you
• However, it doesnʼt mean we can replicate
any state machines easily with it
• Probabilistic behaviour, time triggered action will
introduce some difficulties
• Version aware structure will be helpful
• Non determinism will be a serious problem
• Not only replication methodologies, but also state
machine themselves matter!
Conclusion

• Exploiting performance of modern hardware
by Raft based systems is not easy
• Especially exploiting parallelism of multicore and
bandwidth of I/O devices is difficult
• etcd would also have a room of evolving
• They are exciting technical challenges!
Conclusion

Thanks for listening! Questions?
Comments are welcomed
email: mitake.hitoshi@lab.ntt.co.jp
github: @mitake
Twitter: @_3take

APPENDIX

• [Ongaro and Ousterhout, USENIX ATC 2014]
• https://www.usenix.org/conference/atc14/technical-
sessions/presentation/ongaro
• https://raft.github.io/ has other important materials
• [Kapritsos et al. OSDI ʻ12]
• https://www.usenix.org/node/170851
• [Guo et al. EuroSys ʻ14]
• https://www.microsoft.com/en-us/research/publication/rex-
replication-at-the-speed-of-multi-core/
• Crane [Cui et al. SOSP ʻ15]
• http://i.cs.hku.hk/~heming/papers/crane-sosp15.pdf
• [Lamport TOCS ʻ98]
• http://lamport.azurewebsites.net/pubs/pubs.html#lamport-
paxos
• [Verma et, al. EuroSys ʼ15]
• https://research.google.com/pubs/pub43438.html
References

• Techniques for efficient SMR
• https://www.usenix.org/conference/atc13/technical-
sessions/presentation/bessani
• https://fpaxos.github.io/
• https://www.usenix.org/legacy/events/nsdi11/tech/full
_papers/Bolosky.pdf
• Chapters of the SRE book that include topics
related to Paxos
• https://landing.google.com/sre/book/chapters/managi
ng-critical-state.html
• https://landing.google.com/sre/book/chapters/distribu
ted-periodic-scheduling.html
• Comparison of etcd, zookeeper and consul
• https://coreos.com/blog/performance-of-etcd.html
Other interesting papers and articles

Understanding performance aspects of etcd and Raft

In this document

More Related Content

What's hot

Similar to Understanding performance aspects of etcd and Raft

Recently uploaded

Understanding performance aspects of etcd and Raft