7. What is etcd?
etcd is a strongly consistent, distributed key-value
store that provides a reliable way to store data that
needs to be accessed by a distributed system.
● Use Cases:
o Configuration management, Service discovery
o Distributed locks
o Kubernetes stores configuration data into etcd for service
discovery and cluster management.
8. 2013-06
Initial Commit
● CoreOS contribution
2014-06
etcd V0.2
● Kubernates V0.4
● 10x community
2015-02
First Stable Release of etcd V2.0
● Raft consistency protocol
● 1,000 writes/second
2017-01
etcd V3.1
● New APIs
● Fast linearized read
● gPRC proxy
2018-02
CNCF Incubation
● 30+ projects using etcd
● 400+ contribution groups
● 9 maintainers from 8
companies
2019-08
etcd V3.4
● Leaner member
● Fully concurrent read
● Performance enhancement
2020-11
CNCF Graduation
● Security audit
● Jepsen testing
● Testing and Bug fix
2021-06
etcd V3.5
● Performance enhancement
● Reduce memory usage
● Zap logger
History
9. Features
● Simple interface: HTTP API, gRPC
● Fast: Benchmarked 10,000 writes/sec
● Reliable: distributed consensus based on Raft algorithm
● SSL certificate authentication, Role-based ACL
● Watch for changes
● Transaction, Lease
● Distribute lock, Leader election
17. Etcd Cluster
MVCC Store
WAL Snapshot
Raft
gRPC Server
Client
MVCC Store
WAL Snapshot
Raft
gRPC Server
Client
MVCC Store
WAL Snapshot
Raft
gRPC Server
Client
Follower Leader Follower
18. Etcd Components
gPRC Server
gPRC Gateway
Client
API
Raft Etcd Server
KVServer Quota
Leader
Election
Auth Lease Compactor
Storage
Tree Index
boltdb
Snapshot
WAL
Metric
gRPC Server HTTP API
Log
Replication
Membershi
ps
Applier
MVCC Store
clientv3/etcdctl
Read Index
19. Read Flow
gPRC Server
gPRC Gateway
Client
API
Raft Etcd Server
KVServer Quota
Leader
Election
Auth Lease Compactor
Storage
Tree Index
boltdb
Snapshot
WAL
Metric
gRPC Server HTTP API
Log
Replication
Membershi
ps
Applier
MVCC Store
clientv3/etcdctl
ReadIndex
Ready
State
Read Index
1
2
3
4
20. Write Flow
gPRC Server
gPRC Gateway
Client
API
Raft Etcd Server
KVServer Quota
Leader
Election
Auth Lease Compactor
Storage
Tree Index
boltdb
Snapshot
WAL
Metric
gRPC Server HTTP API
Log
Replication
Membershi
ps
Applier
MVCC Store
clientv3/etcdctl
Propose
Ready
Persist log
Read Index
1
2
3
4
5
6
21. MVCC
Tree Index (btree)
backend
BoltDB (B+ Tree)
ReadTx BatchTx
Buffer Buffer
Put (key, value)
key -> revision
revision -> (key, value)
Merge
Write
v1 v2 v3 v4
key Index
Store
Multi-version concurrency control
23. Raft Algorithm
● Raft (2013) is a distributed consensus algorithm
designed to be easy to understand and an alternative
to the Paxos (1998) algorithm.
● One leader, multiple followers.
● System makes progress if majority of servers are up.
● Failure mode: delayed/lost messages, fail-stop (but not
a Byzantine)
25. Raft Decomposition
● Leader election
○ Select one server to act as leader
○ Detect crashes, choose new leader
● Log replication (normal operation)
○ Leader accepts commands from clients, appends to its log
○ Leader replicates its log to other servers (overwrites
inconsistencies)
● Safety
○ Keep logs consistent
○ Only servers with up-to-date logs can become leader
27. Raft Server States
Follower
Candidate
Leader
starts up
times out
start election
win election
discover
higher
term Issues RequestVote RPCs
to get elected as leader in term
• retry when times out or split
votes
Issues AppendEntries RPCs
• Replicate its log
• Heartbeats to maintain leadership
Passive, Receive log and heartbeats
28. Raft Terms
• At most 1 leader per term
• Some terms have no leader (failed election, split vote)
• Each server maintains current term value
o Exchanged in every RPC
o Peer has higher term => Update term, revert to follower
o Incoming RPC has obsolete term => Reject, reply error
Term 1 Term 2 Term 3 Term 4 Term 5
time
election normal operations split vote
29. Leader Election
Follower
Become Candidate
current term + 1,
vote for self
heartbeat timeout
Send RequestVote
RPCs to other servers
election timeout
Become leader,
send heartbeats
Become follower
get votes from majority Receive heartbeat from leader
30. Election Correctness
• Safety: allow at most one winner per term
• Each server give only one vote per term
• Majority required to win election
• Liveness: Some candidate must eventually win
• Choose election timeout randomly (e.g. 1000-2000ms)
• One server usually timeout and win election before others
timeout
31. Log Replication
1. Client sends command to leader
2. Leader appends command to its log
3. Leader sends AppendEntries RPCs to all followers
4. New entry become committed when leader received replies from
majority of servers.
5. Leader executes command in its state machine and return result to
client
6. Leader notifies followers of committed entries in subsequent
AppendEntries RPCs
7. Followers execute committed commands in their state machines
32. Log Structure
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
3
x = 1
3
z = 6
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
3
x = 1
3
z = 6
1
x = 3
1
y = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
1 2 3 4 5 6 7 8 9 10
followers
leader for term 3
committed entries
If a given entry is committed, all preceding entries are also committed.
• same term and index => same command
33. Log Inconsistence
Raft minimizes special code for reparing inconsistencies
• Leader assumes its log is correct
• Normal operation will repair all inconsistencies
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
3
x = 1
1
x = 3
1
y = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
2
y = 6
2
y = 5
2
z = 9
2
x = 4
1 2 3 4 5 6 7 8 9 10
followers
leader for term 4
34. Append Entries
• AppendEntries RPCs include <term, index> of entry preceding
new one(s)
• Follower must contain matching entry, otherwise it rejects request
o Leader retries with lower log index
• Implements an induction step, ensures Log Matching Property
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
1
x = 3
1
y = 4
1
z = 6
2
x = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
1 2 3 4 5
leader:
follower before:
follower after:
35. Append Entries
1
x = 3
1
y = 4
1
z = 6
1
x = 3
1
x = 9
1
z = 5
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
1 2 3 4 5 6
1
x = 3
1
y = 4
1
z = 6
1
x = 3
1
x = 9
1
z = 5
1
x = 3
1
y = 4
1
z = 6
1
x = 3
1
x = 9
1
z = 5
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
1 2 3 4 5 6
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
leader:
follower before:
follower after:
mismatch -> reject success
36. Safety: Leader Completeness
• Once log entry committed,
all future leaders must
store that entry.
• Servers with incomplete
logs must not get elected.
o Candidates include term
and index of last log entry in
RequestVote RPCs
o Voting node denies vote if
its log is more up-to-date
o Logs ranked by <last term,
last index>
leader election for term 4
38. Summary
● etcd is a strongly consistent, distributed, reliable key-
value store for the most critical data of a distributed
system
● Design and Architecture
○ Use gPRC to provide simple and fast API
○ Use BlotDB as storage backend
○ Use MVCC to provide concurrent read/write
○ Use Raft algorithm to achieve high availability
● Go and read more code 🧑💻 👩💻
41. Reference
● Etcd.io
● Raft (algorithm)
● https://raft.github.io/
● https://github.com/etcd-io/etcd
● Designing for Understandability: The Raft Consensus
Algorithm
● "Raft - The Understandable Distributed Protocol" by Ben
Johnson (2013)
● Getting Started with Kubernetes | etcd Performance
Optimization Practices