SlideShare a Scribd company logo
Demystifying
etcd failure scenarios
for Kubernetes
By William Caban
1
@williamcaban
etcd 101
2
Kubernetes Control-Plane & etcd
3
W W
S S S
W
S W S W S W
Multi Node Cluster
Compact Cluster
S W
All-in-One K8s
W W W
Multi Node Cluster
S W S W S W
kube-apiserver
kube-scheduler
kube-controller-manager
cloud-controller-manager
container runtime
kubelet
Kubernetes Architectures
A
B
C
D
K8s Control Plane
(Supervisor role)
4
Etcd Redundancy vs Performance
Failure
Tolerance
x 2
x 1
x 0
x 0
Write
Performance
High
Low
Required Active
Quorum Size
Low
High
Redundancy
Low
High 3
2
2
1
5
The life of a write on etcd
1. No leader 2. The election & vote 3. Leader coordinate the
writes
4. For “Set Foo=bar”. Leader
writes into log entry
Foo=bar
5. Replicate “Foo=bar” to
follower nodes
Foo=bar Foo=bar
Foo=bar
6. Leader waits for majority
to write the entry to commit
Foo=bar Foo=bar
Foo=bar
7. Leader notifies followers
entry is committed
Foo=bar Foo=bar
Foo=bar
8. Leader send regular role
notifications to followers
Foo=bar Foo=bar
Foo=bar
Writing to etcd via a Leader
(etcd client)
A C
(Follower)
(Leader)
(write “foo”)
B(Follower)
1
Wait while I work…
2
Write to my Raft log
Send to Followers
4
3
Write to my Raft log
Send acknowledgement
6
7
Write to my Raft log
Send acknowledgement
6
7
Wait for ack
Ack to client
8
5
Send acknowledgement to
client and close session
6
(write
“foo=bar”)
9
Writing to etcd via a Follower
(etcd client)
A C
(Follower) (Leader)
(write “foo=bar”)
I’m not the leader.
Let me forward that to “C”.
B
(Follower)
1
7
(proxied write requests)
7
2
3
4
5
6
Myths & Realities
8
9
● Critical etcd timers settings:
○ HEARTBEAT_INTERVAL (100ms)
■ Frequency with which the Leader will notify
Followers that it is still the Leader
○ ELECTION_TIMEOUT (1000ms)
■ How long a Follower node will wait without hearing
a heartbeat before attempting to become Leader
itself.
Why the Critical ETCD Timers?
Best Practices
Heartbeat Interval
❏ < max(RTT) between members
❏ Too low increase CPU and network usage
❏ Too high leads to high election timeout
❏ slower to recover and detect
failures
Election Timeout
❏ 10 times the HEARTBEAT_INTERVAL
Why the Hardware Specifications?
10
CPU RAM DISK
2 to 4 cores
8 to 16 cores
MINIMUM
PRODUCTION
8 GB
16GB to 64GB
< 30ms latency
< 10ms latency
Introducing the Magic Latency Formula for ETCD latency profiles…
Effective Latency = Disk Latency + Max(Jitter(Disk Latency)) + Network RTT + Max(Network Jitter)
Note: To maintain etcd stability at scale, the Effective Latency must be well below < Election Timeout
Myth Collection 1
11
Myth: We can use stretched control-plane for Kubernetes:
● without impact in performance
● for high availability
● as a highly available Kubernetes design
What happens with failures?
❏ High Network Latency
❏ High Disk Latency
❏ Client to Leader Latency
❏ Cross-site Disconnection
❏ Kube-apiserver transaction rate?
❏ Memory utilization due to etcd
fragmentation?
Myth Collection 2
12
Myth: We can use backups of etcd to:
● Restore Kubernetes in case of disaster recovery
● Rollback Kubernetes
● To recover the applications running in the cluster
What happens with failures?
❏ Cluster identity?
❏ Certificates?
❏ ETCD peer certificates?
❏ ETCD identity?
❏ Persistent storage?
❏ API Schema Version?
Manifest and other K8s objects
Container image
PersistentVolumeClaim
PersistentVolume
CSI-enabled storage backend
Kubernetes Application
Stack (Pods, Manifests,
Storage mappings, etc)
VS.
13
ETCD Failure Modes
https://etcd.io/docs/v3.5/op-guide/failures/
Leader failure
Follower failure
Majority failure
Majority failure
Network Partition
Network Partition
14
What to Remember about etcd?
Enjoy the rest of
the event!
Image by https://www.opsramp.com/guides/why-kubernetes/who-made-kubernetes/
15

More Related Content

What's hot

Wi Fi Security
Wi Fi SecurityWi Fi Security
Wi Fi Security
yousef emami
 
Case study on Pamplona National High School Local Area Network
Case study on Pamplona National High School Local Area NetworkCase study on Pamplona National High School Local Area Network
Case study on Pamplona National High School Local Area Network
Jude Rainer
 
5 g technology...........the best presentation ever
5 g technology...........the best presentation ever5 g technology...........the best presentation ever
5 g technology...........the best presentation ever
Hitesh kumar gupta
 
Introduction to VOIP
Introduction to VOIPIntroduction to VOIP
Introduction to VOIP
Tausun Akhtary
 
Ip telephony
Ip telephonyIp telephony
Ip telephony
Deevena Dayaal
 
IOT Security
IOT SecurityIOT Security
IOT Security
Sylvain Martinez
 
Chfi V3 Module 01 Computer Forensics In Todays World
Chfi V3 Module 01 Computer Forensics In Todays WorldChfi V3 Module 01 Computer Forensics In Todays World
Chfi V3 Module 01 Computer Forensics In Todays Worldgueste0d962
 
Seminar ppt fog comp
Seminar ppt fog compSeminar ppt fog comp
Seminar ppt fog comp
Mahantesh Hiremath
 
4G Technology
4G Technology4G Technology
4G Technology
Ritu Bafna
 
ppt on WIFI
ppt on WIFIppt on WIFI
ppt on WIFI
Rohit Lakkabathini
 
volte ims network architecture
volte ims network architecturevolte ims network architecture
volte ims network architecture
Vikas Shokeen
 
Network Security Fundamentals
Network Security FundamentalsNetwork Security Fundamentals
Network Security Fundamentals
Rahmat Suhatman
 
Expo Canitec 2010, Taller Arris
Expo Canitec 2010, Taller ArrisExpo Canitec 2010, Taller Arris
Expo Canitec 2010, Taller Arris
Expo Canitec
 
4G Mobile communication Technology
4G  Mobile communication Technology4G  Mobile communication Technology
4G Mobile communication TechnologyC.Vamsi Krishna
 
e-SIM Technology || Electronics || Hariharan K
e-SIM Technology || Electronics || Hariharan Ke-SIM Technology || Electronics || Hariharan K
e-SIM Technology || Electronics || Hariharan K
Hariharan Krishnan
 
Introduction & history of mobile computing
Introduction & history of mobile computingIntroduction & history of mobile computing
Introduction & history of mobile computingDavid Livingston J
 
Report on 4g Wireless Communication
Report on 4g Wireless CommunicationReport on 4g Wireless Communication
Report on 4g Wireless Communication
Shubham Roy
 
Edge Computing
Edge ComputingEdge Computing
Edge Computing
Chetan Kumar S
 
Virtual Private Networks (VPN) ppt
Virtual Private Networks (VPN) pptVirtual Private Networks (VPN) ppt
Virtual Private Networks (VPN) ppt
OECLIB Odisha Electronics Control Library
 
Introduction to Mobile Internet
Introduction to Mobile InternetIntroduction to Mobile Internet
Introduction to Mobile Internet
Shujaa Solutions Ltd
 

What's hot (20)

Wi Fi Security
Wi Fi SecurityWi Fi Security
Wi Fi Security
 
Case study on Pamplona National High School Local Area Network
Case study on Pamplona National High School Local Area NetworkCase study on Pamplona National High School Local Area Network
Case study on Pamplona National High School Local Area Network
 
5 g technology...........the best presentation ever
5 g technology...........the best presentation ever5 g technology...........the best presentation ever
5 g technology...........the best presentation ever
 
Introduction to VOIP
Introduction to VOIPIntroduction to VOIP
Introduction to VOIP
 
Ip telephony
Ip telephonyIp telephony
Ip telephony
 
IOT Security
IOT SecurityIOT Security
IOT Security
 
Chfi V3 Module 01 Computer Forensics In Todays World
Chfi V3 Module 01 Computer Forensics In Todays WorldChfi V3 Module 01 Computer Forensics In Todays World
Chfi V3 Module 01 Computer Forensics In Todays World
 
Seminar ppt fog comp
Seminar ppt fog compSeminar ppt fog comp
Seminar ppt fog comp
 
4G Technology
4G Technology4G Technology
4G Technology
 
ppt on WIFI
ppt on WIFIppt on WIFI
ppt on WIFI
 
volte ims network architecture
volte ims network architecturevolte ims network architecture
volte ims network architecture
 
Network Security Fundamentals
Network Security FundamentalsNetwork Security Fundamentals
Network Security Fundamentals
 
Expo Canitec 2010, Taller Arris
Expo Canitec 2010, Taller ArrisExpo Canitec 2010, Taller Arris
Expo Canitec 2010, Taller Arris
 
4G Mobile communication Technology
4G  Mobile communication Technology4G  Mobile communication Technology
4G Mobile communication Technology
 
e-SIM Technology || Electronics || Hariharan K
e-SIM Technology || Electronics || Hariharan Ke-SIM Technology || Electronics || Hariharan K
e-SIM Technology || Electronics || Hariharan K
 
Introduction & history of mobile computing
Introduction & history of mobile computingIntroduction & history of mobile computing
Introduction & history of mobile computing
 
Report on 4g Wireless Communication
Report on 4g Wireless CommunicationReport on 4g Wireless Communication
Report on 4g Wireless Communication
 
Edge Computing
Edge ComputingEdge Computing
Edge Computing
 
Virtual Private Networks (VPN) ppt
Virtual Private Networks (VPN) pptVirtual Private Networks (VPN) ppt
Virtual Private Networks (VPN) ppt
 
Introduction to Mobile Internet
Introduction to Mobile InternetIntroduction to Mobile Internet
Introduction to Mobile Internet
 

Similar to [KCD GT 2023] Demystifying etcd failure scenarios for Kubernetes.pdf

Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla Operator
ScyllaDB
 
Kubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaKubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe Barcelona
Henning Jacobs
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
HostedbyConfluent
 
Performance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 releasePerformance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 release
LibbySchulze
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
confluent
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
OpenStack Korea Community
 
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity HardwareMirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Ryan Aydelott
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kevin Lynch
 
OpenSlava Infrastructure Automation Patterns
OpenSlava   Infrastructure Automation PatternsOpenSlava   Infrastructure Automation Patterns
OpenSlava Infrastructure Automation Patterns
Antons Kranga
 
Redis Meetup TLV - K8s Session 28/10/2018
Redis Meetup TLV - K8s Session 28/10/2018Redis Meetup TLV - K8s Session 28/10/2018
Redis Meetup TLV - K8s Session 28/10/2018
Danni Moiseyev
 
How to Fail at VDI
How to Fail at VDIHow to Fail at VDI
How to Fail at VDI
Dan Brinkmann
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
In-Memory Computing Summit
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
inwin stack
 
Apache Spark on K8s and HDFS Security
Apache Spark on K8s and HDFS SecurityApache Spark on K8s and HDFS Security
Apache Spark on K8s and HDFS Security
Databricks
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Community
 
Red hat open stack and storage presentation
Red hat open stack and storage presentationRed hat open stack and storage presentation
Red hat open stack and storage presentation
Mayur Shetty
 
1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS
Jim Plush
 
Microsofts Configurable Cloud
Microsofts Configurable CloudMicrosofts Configurable Cloud
Microsofts Configurable CloudChris Genazzio
 
KubeCon EU 2016: A Practical Guide to Container Scheduling
KubeCon EU 2016: A Practical Guide to Container SchedulingKubeCon EU 2016: A Practical Guide to Container Scheduling
KubeCon EU 2016: A Practical Guide to Container Scheduling
KubeAcademy
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
Kris Buytaert
 

Similar to [KCD GT 2023] Demystifying etcd failure scenarios for Kubernetes.pdf (20)

Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla Operator
 
Kubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaKubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe Barcelona
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 
Performance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 releasePerformance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 release
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity HardwareMirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
OpenSlava Infrastructure Automation Patterns
OpenSlava   Infrastructure Automation PatternsOpenSlava   Infrastructure Automation Patterns
OpenSlava Infrastructure Automation Patterns
 
Redis Meetup TLV - K8s Session 28/10/2018
Redis Meetup TLV - K8s Session 28/10/2018Redis Meetup TLV - K8s Session 28/10/2018
Redis Meetup TLV - K8s Session 28/10/2018
 
How to Fail at VDI
How to Fail at VDIHow to Fail at VDI
How to Fail at VDI
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
 
Apache Spark on K8s and HDFS Security
Apache Spark on K8s and HDFS SecurityApache Spark on K8s and HDFS Security
Apache Spark on K8s and HDFS Security
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
Red hat open stack and storage presentation
Red hat open stack and storage presentationRed hat open stack and storage presentation
Red hat open stack and storage presentation
 
1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS
 
Microsofts Configurable Cloud
Microsofts Configurable CloudMicrosofts Configurable Cloud
Microsofts Configurable Cloud
 
KubeCon EU 2016: A Practical Guide to Container Scheduling
KubeCon EU 2016: A Practical Guide to Container SchedulingKubeCon EU 2016: A Practical Guide to Container Scheduling
KubeCon EU 2016: A Practical Guide to Container Scheduling
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 

[KCD GT 2023] Demystifying etcd failure scenarios for Kubernetes.pdf

  • 1. Demystifying etcd failure scenarios for Kubernetes By William Caban 1 @williamcaban
  • 3. Kubernetes Control-Plane & etcd 3 W W S S S W S W S W S W Multi Node Cluster Compact Cluster S W All-in-One K8s W W W Multi Node Cluster S W S W S W kube-apiserver kube-scheduler kube-controller-manager cloud-controller-manager container runtime kubelet Kubernetes Architectures A B C D K8s Control Plane (Supervisor role)
  • 4. 4 Etcd Redundancy vs Performance Failure Tolerance x 2 x 1 x 0 x 0 Write Performance High Low Required Active Quorum Size Low High Redundancy Low High 3 2 2 1
  • 5. 5 The life of a write on etcd 1. No leader 2. The election & vote 3. Leader coordinate the writes 4. For “Set Foo=bar”. Leader writes into log entry Foo=bar 5. Replicate “Foo=bar” to follower nodes Foo=bar Foo=bar Foo=bar 6. Leader waits for majority to write the entry to commit Foo=bar Foo=bar Foo=bar 7. Leader notifies followers entry is committed Foo=bar Foo=bar Foo=bar 8. Leader send regular role notifications to followers Foo=bar Foo=bar Foo=bar
  • 6. Writing to etcd via a Leader (etcd client) A C (Follower) (Leader) (write “foo”) B(Follower) 1 Wait while I work… 2 Write to my Raft log Send to Followers 4 3 Write to my Raft log Send acknowledgement 6 7 Write to my Raft log Send acknowledgement 6 7 Wait for ack Ack to client 8 5 Send acknowledgement to client and close session 6 (write “foo=bar”) 9
  • 7. Writing to etcd via a Follower (etcd client) A C (Follower) (Leader) (write “foo=bar”) I’m not the leader. Let me forward that to “C”. B (Follower) 1 7 (proxied write requests) 7 2 3 4 5 6
  • 9. 9 ● Critical etcd timers settings: ○ HEARTBEAT_INTERVAL (100ms) ■ Frequency with which the Leader will notify Followers that it is still the Leader ○ ELECTION_TIMEOUT (1000ms) ■ How long a Follower node will wait without hearing a heartbeat before attempting to become Leader itself. Why the Critical ETCD Timers? Best Practices Heartbeat Interval ❏ < max(RTT) between members ❏ Too low increase CPU and network usage ❏ Too high leads to high election timeout ❏ slower to recover and detect failures Election Timeout ❏ 10 times the HEARTBEAT_INTERVAL
  • 10. Why the Hardware Specifications? 10 CPU RAM DISK 2 to 4 cores 8 to 16 cores MINIMUM PRODUCTION 8 GB 16GB to 64GB < 30ms latency < 10ms latency Introducing the Magic Latency Formula for ETCD latency profiles… Effective Latency = Disk Latency + Max(Jitter(Disk Latency)) + Network RTT + Max(Network Jitter) Note: To maintain etcd stability at scale, the Effective Latency must be well below < Election Timeout
  • 11. Myth Collection 1 11 Myth: We can use stretched control-plane for Kubernetes: ● without impact in performance ● for high availability ● as a highly available Kubernetes design What happens with failures? ❏ High Network Latency ❏ High Disk Latency ❏ Client to Leader Latency ❏ Cross-site Disconnection ❏ Kube-apiserver transaction rate? ❏ Memory utilization due to etcd fragmentation?
  • 12. Myth Collection 2 12 Myth: We can use backups of etcd to: ● Restore Kubernetes in case of disaster recovery ● Rollback Kubernetes ● To recover the applications running in the cluster What happens with failures? ❏ Cluster identity? ❏ Certificates? ❏ ETCD peer certificates? ❏ ETCD identity? ❏ Persistent storage? ❏ API Schema Version? Manifest and other K8s objects Container image PersistentVolumeClaim PersistentVolume CSI-enabled storage backend Kubernetes Application Stack (Pods, Manifests, Storage mappings, etc) VS.
  • 13. 13 ETCD Failure Modes https://etcd.io/docs/v3.5/op-guide/failures/ Leader failure Follower failure Majority failure Majority failure Network Partition Network Partition
  • 14. 14 What to Remember about etcd?
  • 15. Enjoy the rest of the event! Image by https://www.opsramp.com/guides/why-kubernetes/who-made-kubernetes/ 15