Unveiling etcd: Architecture and Source Code Deep Dive

I’m 小傑 (Jack Yu)
● Software Engineer @ Synology
● Why this talk?
○ Interesting in the implementation of etcd 😆

Outline
● What is etcd?
● Design and Architecture
● Raft Algorithm
● Summary
● Q&A

etcd is pronounced /ˈɛtsiːdiː/ and
means “distributed etc directory.”

Original image credited to xkcd.com/2347,
alterations by Josh Berkus.

What is etcd?
etcd is a strongly consistent, distributed key-value
store that provides a reliable way to store data that
needs to be accessed by a distributed system.
● Use Cases:
o Configuration management, Service discovery
o Distributed locks
o Kubernetes stores configuration data into etcd for service
discovery and cluster management.

2013-06
Initial Commit
● CoreOS contribution
2014-06
etcd V0.2
● Kubernates V0.4
● 10x community
2015-02
First Stable Release of etcd V2.0
● Raft consistency protocol
● 1,000 writes/second
2017-01
etcd V3.1
● New APIs
● Fast linearized read
● gPRC proxy
2018-02
CNCF Incubation
● 30+ projects using etcd
● 400+ contribution groups
● 9 maintainers from 8
companies
2019-08
etcd V3.4
● Leaner member
● Fully concurrent read
● Performance enhancement
2020-11
CNCF Graduation
● Security audit
● Jepsen testing
● Testing and Bug fix
2021-06
etcd V3.5
● Performance enhancement
● Reduce memory usage
● Zap logger
History

Features
● Simple interface: HTTP API, gRPC
● Fast: Benchmarked 10,000 writes/sec
● Reliable: distributed consensus based on Raft algorithm
● SSL certificate authentication, Role-based ACL
● Watch for changes
● Transaction, Lease
● Distribute lock, Leader election

CLI: etcdctl
● Round-robin
● Retry to another member
when error
● Cluster member redirect most
of requests to leader (except
serialized read)

Etcd Cluster
MVCC Store
WAL Snapshot
Raft
gRPC Server
Client
MVCC Store
WAL Snapshot
Raft
gRPC Server
Client
MVCC Store
WAL Snapshot
Raft
gRPC Server
Client
Follower Leader Follower

Etcd Components
gPRC Server
gPRC Gateway
Client
API
Raft Etcd Server
KVServer Quota
Leader
Election
Auth Lease Compactor
Storage
Tree Index
boltdb
Snapshot
WAL
Metric
gRPC Server HTTP API
Log
Replication
Membershi
ps
Applier
MVCC Store
clientv3/etcdctl
Read Index

Read Flow
gPRC Server
gPRC Gateway
Client
API
Raft Etcd Server
KVServer Quota
Leader
Election
Storage
Tree Index
boltdb
Snapshot
WAL
Metric
Log
Replication
Membershi
ps
Applier
MVCC Store
clientv3/etcdctl
ReadIndex
Ready
State
Read Index
1
2
3
4

Write Flow
gPRC Server
gPRC Gateway
Client
API
Raft Etcd Server
KVServer Quota
Leader
Election
Storage
Tree Index
boltdb
Snapshot
WAL
Metric
Log
Replication
Membershi
ps
Applier
MVCC Store
clientv3/etcdctl
Propose
Ready
Persist log
Read Index
1
2
3
4
5
6

MVCC
Tree Index (btree)
backend
BoltDB (B+ Tree)
ReadTx BatchTx
Buffer Buffer
Put (key, value)
key -> revision
revision -> (key, value)
Merge
Write
v1 v2 v3 v4
key Index
Store
Multi-version concurrency control

Raft Algorithm
● Raft (2013) is a distributed consensus algorithm
designed to be easy to understand and an alternative
to the Paxos (1998) algorithm.
● One leader, multiple followers.
● System makes progress if majority of servers are up.
● Failure mode: delayed/lost messages, fail-stop (but not
a Byzantine)

Designing for Understandability: The Raft Consensus Algorithm, Diego Ongaro

Raft Decomposition
● Leader election
○ Select one server to act as leader
○ Detect crashes, choose new leader
● Log replication (normal operation)
○ Leader accepts commands from clients, appends to its log
○ Leader replicates its log to other servers (overwrites
inconsistencies)
● Safety
○ Keep logs consistent
○ Only servers with up-to-date logs can become leader

Demo
● https://raft.github.io/

Raft Server States
Follower
Candidate
Leader
starts up
times out
start election
win election
discover
higher
term Issues RequestVote RPCs
to get elected as leader in term
• retry when times out or split
votes
Issues AppendEntries RPCs
• Replicate its log
• Heartbeats to maintain leadership
Passive, Receive log and heartbeats

Raft Terms
• At most 1 leader per term
• Some terms have no leader (failed election, split vote)
• Each server maintains current term value
o Exchanged in every RPC
o Peer has higher term => Update term, revert to follower
o Incoming RPC has obsolete term => Reject, reply error
Term 1 Term 2 Term 3 Term 4 Term 5
time
election normal operations split vote

Leader Election
Follower
Become Candidate
current term + 1,
vote for self
heartbeat timeout
Send RequestVote
RPCs to other servers
election timeout
Become leader,
send heartbeats
Become follower
get votes from majority Receive heartbeat from leader

Election Correctness
• Safety: allow at most one winner per term
• Each server give only one vote per term
• Majority required to win election
• Liveness: Some candidate must eventually win
• Choose election timeout randomly (e.g. 1000-2000ms)
• One server usually timeout and win election before others
timeout

Log Replication
1. Client sends command to leader
2. Leader appends command to its log
3. Leader sends AppendEntries RPCs to all followers
4. New entry become committed when leader received replies from
majority of servers.
5. Leader executes command in its state machine and return result to
client
6. Leader notifies followers of committed entries in subsequent
AppendEntries RPCs
7. Followers execute committed commands in their state machines

Log Structure
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
3
x = 1
3
z = 6
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
3
x = 1
3
z = 6
1
x = 3
1
y = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
1 2 3 4 5 6 7 8 9 10
followers
leader for term 3
committed entries
If a given entry is committed, all preceding entries are also committed.
• same term and index => same command

Log Inconsistence
Raft minimizes special code for reparing inconsistencies
• Leader assumes its log is correct
• Normal operation will repair all inconsistencies
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
3
x = 1
1
x = 3
1
y = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
2
y = 6
2
y = 5
2
z = 9
2
x = 4
1 2 3 4 5 6 7 8 9 10
followers
leader for term 4

Append Entries
• AppendEntries RPCs include <term, index> of entry preceding
new one(s)
• Follower must contain matching entry, otherwise it rejects request
o Leader retries with lower log index
• Implements an induction step, ensures Log Matching Property
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
1
x = 3
1
y = 4
1
z = 6
2
x = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
1 2 3 4 5
leader:
follower before:
follower after:

Append Entries
1
x = 3
1
y = 4
1
z = 6
1
x = 3
1
x = 9
1
z = 5
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
1 2 3 4 5 6
1
x = 3
1
y = 4
1
z = 6
1
x = 3
1
x = 9
1
z = 5
1
x = 3
1
y = 4
1
z = 6
1
x = 3
1
x = 9
1
z = 5
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
1 2 3 4 5 6
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
leader:
follower before:
follower after:
mismatch -> reject success

Safety: Leader Completeness
• Once log entry committed,
all future leaders must
store that entry.
• Servers with incomplete
logs must not get elected.
o Candidates include term
and index of last log entry in
RequestVote RPCs
o Voting node denies vote if
its log is more up-to-date
o Logs ranked by <last term,
last index>
leader election for term 4

Summary
● etcd is a strongly consistent, distributed, reliable key-
value store for the most critical data of a distributed
system
● Design and Architecture
○ Use gPRC to provide simple and fast API
○ Use BlotDB as storage backend
○ Use MVCC to provide concurrent read/write
○ Use Raft algorithm to achieve high availability
● Go and read more code 🧑💻 👩💻

Reference
● Etcd.io
● Raft (algorithm)
● https://raft.github.io/
● https://github.com/etcd-io/etcd
● Designing for Understandability: The Raft Consensus
Algorithm
● "Raft - The Understandable Distributed Protocol" by Ben
Johnson (2013)
● Getting Started with Kubernetes | etcd Performance
Optimization Practices

Unveiling etcd: Architecture and Source Code Deep Dive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unveiling etcd: Architecture and Source Code Deep Dive

Similar to Unveiling etcd: Architecture and Source Code Deep Dive (20)

Recently uploaded

Recently uploaded (20)

Unveiling etcd: Architecture and Source Code Deep Dive