SlideShare a Scribd company logo
1 of 41
Download to read offline
Jack Yu @ COSCUP 2023
I’m 小傑 (Jack Yu)
● Software Engineer @ Synology
● Why this talk?
○ Interesting in the implementation of etcd 😆
Outline
● What is etcd?
● Design and Architecture
● Raft Algorithm
● Summary
● Q&A
What is etcd?
etcd is pronounced /ˈɛtsiːdiː/ and
means “distributed etc directory.”
Original image credited to xkcd.com/2347,
alterations by Josh Berkus.
What is etcd?
etcd is a strongly consistent, distributed key-value
store that provides a reliable way to store data that
needs to be accessed by a distributed system.
● Use Cases:
o Configuration management, Service discovery
o Distributed locks
o Kubernetes stores configuration data into etcd for service
discovery and cluster management.
2013-06
Initial Commit
● CoreOS contribution
2014-06
etcd V0.2
● Kubernates V0.4
● 10x community
2015-02
First Stable Release of etcd V2.0
● Raft consistency protocol
● 1,000 writes/second
2017-01
etcd V3.1
● New APIs
● Fast linearized read
● gPRC proxy
2018-02
CNCF Incubation
● 30+ projects using etcd
● 400+ contribution groups
● 9 maintainers from 8
companies
2019-08
etcd V3.4
● Leaner member
● Fully concurrent read
● Performance enhancement
2020-11
CNCF Graduation
● Security audit
● Jepsen testing
● Testing and Bug fix
2021-06
etcd V3.5
● Performance enhancement
● Reduce memory usage
● Zap logger
History
Features
● Simple interface: HTTP API, gRPC
● Fast: Benchmarked 10,000 writes/sec
● Reliable: distributed consensus based on Raft algorithm
● SSL certificate authentication, Role-based ACL
● Watch for changes
● Transaction, Lease
● Distribute lock, Leader election
Setup Cluster
CLI: etcdctl
● Round-robin
● Retry to another member
when error
● Cluster member redirect most
of requests to leader (except
serialized read)
Get Values by Key Prefix
Get Revisions
Watch Key
Lease (TTL)
Design and
Architecture
Etcd Cluster
MVCC Store
WAL Snapshot
Raft
gRPC Server
Client
MVCC Store
WAL Snapshot
Raft
gRPC Server
Client
MVCC Store
WAL Snapshot
Raft
gRPC Server
Client
Follower Leader Follower
Etcd Components
gPRC Server
gPRC Gateway
Client
API
Raft Etcd Server
KVServer Quota
Leader
Election
Auth Lease Compactor
Storage
Tree Index
boltdb
Snapshot
WAL
Metric
gRPC Server HTTP API
Log
Replication
Membershi
ps
Applier
MVCC Store
clientv3/etcdctl
Read Index
Read Flow
gPRC Server
gPRC Gateway
Client
API
Raft Etcd Server
KVServer Quota
Leader
Election
Auth Lease Compactor
Storage
Tree Index
boltdb
Snapshot
WAL
Metric
gRPC Server HTTP API
Log
Replication
Membershi
ps
Applier
MVCC Store
clientv3/etcdctl
ReadIndex
Ready
State
Read Index
1
2
3
4
Write Flow
gPRC Server
gPRC Gateway
Client
API
Raft Etcd Server
KVServer Quota
Leader
Election
Auth Lease Compactor
Storage
Tree Index
boltdb
Snapshot
WAL
Metric
gRPC Server HTTP API
Log
Replication
Membershi
ps
Applier
MVCC Store
clientv3/etcdctl
Propose
Ready
Persist log
Read Index
1
2
3
4
5
6
MVCC
Tree Index (btree)
backend
BoltDB (B+ Tree)
ReadTx BatchTx
Buffer Buffer
Put (key, value)
key -> revision
revision -> (key, value)
Merge
Write
v1 v2 v3 v4
key Index
Store
Multi-version concurrency control
Raft Algorithm
Raft Algorithm
● Raft (2013) is a distributed consensus algorithm
designed to be easy to understand and an alternative
to the Paxos (1998) algorithm.
● One leader, multiple followers.
● System makes progress if majority of servers are up.
● Failure mode: delayed/lost messages, fail-stop (but not
a Byzantine)
Designing for Understandability: The Raft Consensus Algorithm, Diego Ongaro
Raft Decomposition
● Leader election
○ Select one server to act as leader
○ Detect crashes, choose new leader
● Log replication (normal operation)
○ Leader accepts commands from clients, appends to its log
○ Leader replicates its log to other servers (overwrites
inconsistencies)
● Safety
○ Keep logs consistent
○ Only servers with up-to-date logs can become leader
Demo
● https://raft.github.io/
Raft Server States
Follower
Candidate
Leader
starts up
times out
start election
win election
discover
higher
term Issues RequestVote RPCs
to get elected as leader in term
• retry when times out or split
votes
Issues AppendEntries RPCs
• Replicate its log
• Heartbeats to maintain leadership
Passive, Receive log and heartbeats
Raft Terms
• At most 1 leader per term
• Some terms have no leader (failed election, split vote)
• Each server maintains current term value
o Exchanged in every RPC
o Peer has higher term => Update term, revert to follower
o Incoming RPC has obsolete term => Reject, reply error
Term 1 Term 2 Term 3 Term 4 Term 5
time
election normal operations split vote
Leader Election
Follower
Become Candidate
current term + 1,
vote for self
heartbeat timeout
Send RequestVote
RPCs to other servers
election timeout
Become leader,
send heartbeats
Become follower
get votes from majority Receive heartbeat from leader
Election Correctness
• Safety: allow at most one winner per term
• Each server give only one vote per term
• Majority required to win election
• Liveness: Some candidate must eventually win
• Choose election timeout randomly (e.g. 1000-2000ms)
• One server usually timeout and win election before others
timeout
Log Replication
1. Client sends command to leader
2. Leader appends command to its log
3. Leader sends AppendEntries RPCs to all followers
4. New entry become committed when leader received replies from
majority of servers.
5. Leader executes command in its state machine and return result to
client
6. Leader notifies followers of committed entries in subsequent
AppendEntries RPCs
7. Followers execute committed commands in their state machines
Log Structure
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
3
x = 1
3
z = 6
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
3
x = 1
3
z = 6
1
x = 3
1
y = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
1 2 3 4 5 6 7 8 9 10
followers
leader for term 3
committed entries
If a given entry is committed, all preceding entries are also committed.
• same term and index => same command
Log Inconsistence
Raft minimizes special code for reparing inconsistencies
• Leader assumes its log is correct
• Normal operation will repair all inconsistencies
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
3
y = 1
3
y = 4
3
z =1
3
x = 1
1
x = 3
1
y = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
2
z = 5
2
y = 6
2
y = 5
2
z = 9
2
x = 4
1 2 3 4 5 6 7 8 9 10
followers
leader for term 4
Append Entries
• AppendEntries RPCs include <term, index> of entry preceding
new one(s)
• Follower must contain matching entry, otherwise it rejects request
o Leader retries with lower log index
• Implements an induction step, ensures Log Matching Property
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
1
x = 3
1
y = 4
1
z = 6
2
x = 4
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
1 2 3 4 5
leader:
follower before:
follower after:
Append Entries
1
x = 3
1
y = 4
1
z = 6
1
x = 3
1
x = 9
1
z = 5
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
1 2 3 4 5 6
1
x = 3
1
y = 4
1
z = 6
1
x = 3
1
x = 9
1
z = 5
1
x = 3
1
y = 4
1
z = 6
1
x = 3
1
x = 9
1
z = 5
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
1 2 3 4 5 6
1
x = 3
1
y = 4
1
z = 6
2
x = 4
3
z = 5
leader:
follower before:
follower after:
mismatch -> reject success
Safety: Leader Completeness
• Once log entry committed,
all future leaders must
store that entry.
• Servers with incomplete
logs must not get elected.
o Candidates include term
and index of last log entry in
RequestVote RPCs
o Voting node denies vote if
its log is more up-to-date
o Logs ranked by <last term,
last index>
leader election for term 4
Summary
Summary
● etcd is a strongly consistent, distributed, reliable key-
value store for the most critical data of a distributed
system
● Design and Architecture
○ Use gPRC to provide simple and fast API
○ Use BlotDB as storage backend
○ Use MVCC to provide concurrent read/write
○ Use Raft algorithm to achieve high availability
● Go and read more code 🧑💻 👩💻
Q&A
● Any questions?
?
Thank You!
Reference
● Etcd.io
● Raft (algorithm)
● https://raft.github.io/
● https://github.com/etcd-io/etcd
● Designing for Understandability: The Raft Consensus
Algorithm
● "Raft - The Understandable Distributed Protocol" by Ben
Johnson (2013)
● Getting Started with Kubernetes | etcd Performance
Optimization Practices

More Related Content

What's hot

Contemporary Linux Networking
Contemporary Linux NetworkingContemporary Linux Networking
Contemporary Linux NetworkingMaximilan Wilhelm
 
Linux Linux Traffic Control
Linux Linux Traffic ControlLinux Linux Traffic Control
Linux Linux Traffic ControlSUSE Labs Taipei
 
Observability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing PrimerObservability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing PrimerVMware Tanzu
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabTaeung Song
 
OpenTelemetry For Operators
OpenTelemetry For OperatorsOpenTelemetry For Operators
OpenTelemetry For OperatorsKevin Brockhoff
 
OVN - Basics and deep dive
OVN - Basics and deep diveOVN - Basics and deep dive
OVN - Basics and deep diveTrinath Somanchi
 
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...HostedbyConfluent
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
How VXLAN works on Linux
How VXLAN works on LinuxHow VXLAN works on Linux
How VXLAN works on LinuxEtsuji Nakai
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationKnoldus Inc.
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of PrestoTaro L. Saito
 
Reactive programming
Reactive programmingReactive programming
Reactive programmingSUDIP GHOSH
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKYoungHeon (Roy) Kim
 
Using eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in CiliumUsing eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in CiliumScyllaDB
 
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...HostedbyConfluent
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우if kakao
 
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyA Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyHostedbyConfluent
 

What's hot (20)

Sockets and Socket-Buffer
Sockets and Socket-BufferSockets and Socket-Buffer
Sockets and Socket-Buffer
 
ns-3 Tutorial
ns-3 Tutorialns-3 Tutorial
ns-3 Tutorial
 
Contemporary Linux Networking
Contemporary Linux NetworkingContemporary Linux Networking
Contemporary Linux Networking
 
Linux Linux Traffic Control
Linux Linux Traffic ControlLinux Linux Traffic Control
Linux Linux Traffic Control
 
Observability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing PrimerObservability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing Primer
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLab
 
OpenTelemetry For Operators
OpenTelemetry For OperatorsOpenTelemetry For Operators
OpenTelemetry For Operators
 
OVN - Basics and deep dive
OVN - Basics and deep diveOVN - Basics and deep dive
OVN - Basics and deep dive
 
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
How VXLAN works on Linux
How VXLAN works on LinuxHow VXLAN works on Linux
How VXLAN works on Linux
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
Reactive programming
Reactive programmingReactive programming
Reactive programming
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELK
 
Using eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in CiliumUsing eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in Cilium
 
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우
 
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyA Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
 

Similar to Unveiling etcd: Architecture and Source Code Deep Dive

Reaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable worldReaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable worldHeidi Howard
 
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...confluent
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
Consensus algo with_distributed_key_value_store_in_distributed_system
Consensus algo with_distributed_key_value_store_in_distributed_systemConsensus algo with_distributed_key_value_store_in_distributed_system
Consensus algo with_distributed_key_value_store_in_distributed_systemAtin Mukherjee
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionJoel Koshy
 
Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis
Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis  Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis
Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis Apache Apex
 
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
 
Demystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live scDemystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live scEmanuel Calvo
 
Top Ten Kafka® Configs
Top Ten Kafka® ConfigsTop Ten Kafka® Configs
Top Ten Kafka® Configsconfluent
 
Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyftkbajda
 
Hyperledger 구조 분석
Hyperledger 구조 분석Hyperledger 구조 분석
Hyperledger 구조 분석Jongseok Choi
 
Moving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsHPCC Systems
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...Lucidworks
 
RAFT Consensus Algorithm
RAFT Consensus AlgorithmRAFT Consensus Algorithm
RAFT Consensus Algorithmsangyun han
 
Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®confluent
 
Oracle Basics and Architecture
Oracle Basics and ArchitectureOracle Basics and Architecture
Oracle Basics and ArchitectureSidney Chen
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 

Similar to Unveiling etcd: Architecture and Source Code Deep Dive (20)

Reaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable worldReaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable world
 
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
Consensus algo with_distributed_key_value_store_in_distributed_system
Consensus algo with_distributed_key_value_store_in_distributed_systemConsensus algo with_distributed_key_value_store_in_distributed_system
Consensus algo with_distributed_key_value_store_in_distributed_system
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolution
 
ApexMeetup Geode - Talk2 2016-03-17
ApexMeetup Geode - Talk2 2016-03-17ApexMeetup Geode - Talk2 2016-03-17
ApexMeetup Geode - Talk2 2016-03-17
 
Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis
Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis  Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis
Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis
 
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Demystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live scDemystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live sc
 
Top Ten Kafka® Configs
Top Ten Kafka® ConfigsTop Ten Kafka® Configs
Top Ten Kafka® Configs
 
Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyft
 
Hyperledger 구조 분석
Hyperledger 구조 분석Hyperledger 구조 분석
Hyperledger 구조 분석
 
Moving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC Systems
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
 
RAFT Consensus Algorithm
RAFT Consensus AlgorithmRAFT Consensus Algorithm
RAFT Consensus Algorithm
 
Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®
 
Oracle Basics and Architecture
Oracle Basics and ArchitectureOracle Basics and Architecture
Oracle Basics and Architecture
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 

Recently uploaded

Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniquesugginaramesh
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxPurva Nikam
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 

Recently uploaded (20)

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniques
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 

Unveiling etcd: Architecture and Source Code Deep Dive

  • 1. Jack Yu @ COSCUP 2023
  • 2. I’m 小傑 (Jack Yu) ● Software Engineer @ Synology ● Why this talk? ○ Interesting in the implementation of etcd 😆
  • 3. Outline ● What is etcd? ● Design and Architecture ● Raft Algorithm ● Summary ● Q&A
  • 5. etcd is pronounced /ˈɛtsiːdiː/ and means “distributed etc directory.”
  • 6. Original image credited to xkcd.com/2347, alterations by Josh Berkus.
  • 7. What is etcd? etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system. ● Use Cases: o Configuration management, Service discovery o Distributed locks o Kubernetes stores configuration data into etcd for service discovery and cluster management.
  • 8. 2013-06 Initial Commit ● CoreOS contribution 2014-06 etcd V0.2 ● Kubernates V0.4 ● 10x community 2015-02 First Stable Release of etcd V2.0 ● Raft consistency protocol ● 1,000 writes/second 2017-01 etcd V3.1 ● New APIs ● Fast linearized read ● gPRC proxy 2018-02 CNCF Incubation ● 30+ projects using etcd ● 400+ contribution groups ● 9 maintainers from 8 companies 2019-08 etcd V3.4 ● Leaner member ● Fully concurrent read ● Performance enhancement 2020-11 CNCF Graduation ● Security audit ● Jepsen testing ● Testing and Bug fix 2021-06 etcd V3.5 ● Performance enhancement ● Reduce memory usage ● Zap logger History
  • 9. Features ● Simple interface: HTTP API, gRPC ● Fast: Benchmarked 10,000 writes/sec ● Reliable: distributed consensus based on Raft algorithm ● SSL certificate authentication, Role-based ACL ● Watch for changes ● Transaction, Lease ● Distribute lock, Leader election
  • 11. CLI: etcdctl ● Round-robin ● Retry to another member when error ● Cluster member redirect most of requests to leader (except serialized read)
  • 12. Get Values by Key Prefix
  • 17. Etcd Cluster MVCC Store WAL Snapshot Raft gRPC Server Client MVCC Store WAL Snapshot Raft gRPC Server Client MVCC Store WAL Snapshot Raft gRPC Server Client Follower Leader Follower
  • 18. Etcd Components gPRC Server gPRC Gateway Client API Raft Etcd Server KVServer Quota Leader Election Auth Lease Compactor Storage Tree Index boltdb Snapshot WAL Metric gRPC Server HTTP API Log Replication Membershi ps Applier MVCC Store clientv3/etcdctl Read Index
  • 19. Read Flow gPRC Server gPRC Gateway Client API Raft Etcd Server KVServer Quota Leader Election Auth Lease Compactor Storage Tree Index boltdb Snapshot WAL Metric gRPC Server HTTP API Log Replication Membershi ps Applier MVCC Store clientv3/etcdctl ReadIndex Ready State Read Index 1 2 3 4
  • 20. Write Flow gPRC Server gPRC Gateway Client API Raft Etcd Server KVServer Quota Leader Election Auth Lease Compactor Storage Tree Index boltdb Snapshot WAL Metric gRPC Server HTTP API Log Replication Membershi ps Applier MVCC Store clientv3/etcdctl Propose Ready Persist log Read Index 1 2 3 4 5 6
  • 21. MVCC Tree Index (btree) backend BoltDB (B+ Tree) ReadTx BatchTx Buffer Buffer Put (key, value) key -> revision revision -> (key, value) Merge Write v1 v2 v3 v4 key Index Store Multi-version concurrency control
  • 23. Raft Algorithm ● Raft (2013) is a distributed consensus algorithm designed to be easy to understand and an alternative to the Paxos (1998) algorithm. ● One leader, multiple followers. ● System makes progress if majority of servers are up. ● Failure mode: delayed/lost messages, fail-stop (but not a Byzantine)
  • 24. Designing for Understandability: The Raft Consensus Algorithm, Diego Ongaro
  • 25. Raft Decomposition ● Leader election ○ Select one server to act as leader ○ Detect crashes, choose new leader ● Log replication (normal operation) ○ Leader accepts commands from clients, appends to its log ○ Leader replicates its log to other servers (overwrites inconsistencies) ● Safety ○ Keep logs consistent ○ Only servers with up-to-date logs can become leader
  • 27. Raft Server States Follower Candidate Leader starts up times out start election win election discover higher term Issues RequestVote RPCs to get elected as leader in term • retry when times out or split votes Issues AppendEntries RPCs • Replicate its log • Heartbeats to maintain leadership Passive, Receive log and heartbeats
  • 28. Raft Terms • At most 1 leader per term • Some terms have no leader (failed election, split vote) • Each server maintains current term value o Exchanged in every RPC o Peer has higher term => Update term, revert to follower o Incoming RPC has obsolete term => Reject, reply error Term 1 Term 2 Term 3 Term 4 Term 5 time election normal operations split vote
  • 29. Leader Election Follower Become Candidate current term + 1, vote for self heartbeat timeout Send RequestVote RPCs to other servers election timeout Become leader, send heartbeats Become follower get votes from majority Receive heartbeat from leader
  • 30. Election Correctness • Safety: allow at most one winner per term • Each server give only one vote per term • Majority required to win election • Liveness: Some candidate must eventually win • Choose election timeout randomly (e.g. 1000-2000ms) • One server usually timeout and win election before others timeout
  • 31. Log Replication 1. Client sends command to leader 2. Leader appends command to its log 3. Leader sends AppendEntries RPCs to all followers 4. New entry become committed when leader received replies from majority of servers. 5. Leader executes command in its state machine and return result to client 6. Leader notifies followers of committed entries in subsequent AppendEntries RPCs 7. Followers execute committed commands in their state machines
  • 32. Log Structure 1 x = 3 1 y = 4 1 z = 6 2 x = 4 2 z = 5 3 y = 1 3 y = 4 3 z =1 3 x = 1 3 z = 6 1 x = 3 1 y = 4 1 z = 6 2 x = 4 2 z = 5 3 y = 1 3 y = 4 1 x = 3 1 y = 4 1 z = 6 2 x = 4 2 z = 5 3 y = 1 3 y = 4 3 z =1 3 x = 1 3 z = 6 1 x = 3 1 y = 4 1 x = 3 1 y = 4 1 z = 6 2 x = 4 2 z = 5 3 y = 1 3 y = 4 3 z =1 1 2 3 4 5 6 7 8 9 10 followers leader for term 3 committed entries If a given entry is committed, all preceding entries are also committed. • same term and index => same command
  • 33. Log Inconsistence Raft minimizes special code for reparing inconsistencies • Leader assumes its log is correct • Normal operation will repair all inconsistencies 1 x = 3 1 y = 4 1 z = 6 2 x = 4 2 z = 5 3 y = 1 3 y = 4 3 z =1 1 x = 3 1 y = 4 1 z = 6 2 x = 4 2 z = 5 3 y = 1 3 y = 4 1 x = 3 1 y = 4 1 z = 6 2 x = 4 2 z = 5 3 y = 1 3 y = 4 3 z =1 3 x = 1 1 x = 3 1 y = 4 1 x = 3 1 y = 4 1 z = 6 2 x = 4 2 z = 5 2 y = 6 2 y = 5 2 z = 9 2 x = 4 1 2 3 4 5 6 7 8 9 10 followers leader for term 4
  • 34. Append Entries • AppendEntries RPCs include <term, index> of entry preceding new one(s) • Follower must contain matching entry, otherwise it rejects request o Leader retries with lower log index • Implements an induction step, ensures Log Matching Property 1 x = 3 1 y = 4 1 z = 6 2 x = 4 3 z = 5 1 x = 3 1 y = 4 1 z = 6 2 x = 4 1 x = 3 1 y = 4 1 z = 6 2 x = 4 3 z = 5 1 2 3 4 5 leader: follower before: follower after:
  • 35. Append Entries 1 x = 3 1 y = 4 1 z = 6 1 x = 3 1 x = 9 1 z = 5 1 x = 3 1 y = 4 1 z = 6 2 x = 4 3 z = 5 1 2 3 4 5 6 1 x = 3 1 y = 4 1 z = 6 1 x = 3 1 x = 9 1 z = 5 1 x = 3 1 y = 4 1 z = 6 1 x = 3 1 x = 9 1 z = 5 1 x = 3 1 y = 4 1 z = 6 2 x = 4 3 z = 5 1 2 3 4 5 6 1 x = 3 1 y = 4 1 z = 6 2 x = 4 3 z = 5 leader: follower before: follower after: mismatch -> reject success
  • 36. Safety: Leader Completeness • Once log entry committed, all future leaders must store that entry. • Servers with incomplete logs must not get elected. o Candidates include term and index of last log entry in RequestVote RPCs o Voting node denies vote if its log is more up-to-date o Logs ranked by <last term, last index> leader election for term 4
  • 38. Summary ● etcd is a strongly consistent, distributed, reliable key- value store for the most critical data of a distributed system ● Design and Architecture ○ Use gPRC to provide simple and fast API ○ Use BlotDB as storage backend ○ Use MVCC to provide concurrent read/write ○ Use Raft algorithm to achieve high availability ● Go and read more code 🧑💻 👩💻
  • 41. Reference ● Etcd.io ● Raft (algorithm) ● https://raft.github.io/ ● https://github.com/etcd-io/etcd ● Designing for Understandability: The Raft Consensus Algorithm ● "Raft - The Understandable Distributed Protocol" by Ben Johnson (2013) ● Getting Started with Kubernetes | etcd Performance Optimization Practices