SlideShare a Scribd company logo
1 of 53
Download to read offline
Presentation title
ANDREY FALKO - APRIL 2019
Bulletproof Apache
Kafka®
with Fault
Tree Analysis
Agenda
The Challenge: Quantify Kafka Reliability
Introduction to Fault Tree Analysis
Kafka Fault Trees
Availability
Data Durability
Conclusion
The Challenge: Quantify Kafka Reliability
Kafka is a “reliability tool”
Move data without lossiness
High stakes usage
What are we trying to do?
Observability Data
Event Streaming
DB
Change Data Capture
DB
DB
DB DB
Why Quantify?
Determine probability of success
Find opportunities to trim cost
Defining SLOs
Need to define Service Level Objectives
Availability
Durability
Latency
Quantifying SLOs
Availability
What is the probability that writes or reads fail?
How long do we tolerate downtime?
Quantifying SLOs
Durability
What is the probability that we’ll lose data?
How much will we lose?
Quantifying SLOs
Latency
How long are transactions allowed to take?
Introduction to Fault Tree Analysis
What is Fault Tree Analysis?
Deductive Failure Analysis
Invented in 1962 for Minuteman I ICBM Launch Control System
Industry wide adoption
Aerospace
Military
Petrochemical
Et al.
Fault Tree Analysis: Event Symbols
Basic Intermediate
Transfer
Fault Tree Analysis: Gate Symbols
ANDOR
Fault Tree Analysis: OR Example
Fault Tree Analysis: OR Example
4% probability of failure
Fault Tree Analysis: OR Example
4% probability of failure
annualized
Fault Tree Analysis: OR Example
p(A “or” B) =
p(A) + p(B) - p(A) * p(B)
Almost always small
Fault Tree Analysis: OR Example
12% chance of failure
annualized
Fault Tree Analysis: AND Example
Fault Tree Analysis: AND Example
p(A and B) = p(A) * p(B)
Fault Tree Analysis: AND Example
99.9936% chance of
success annualized
Fault Tree Analysis: AND Example
99.9936% chance of success
annualized if we don’t
remediate
Fault Tree Analysis: AND Example
99.999999996% chance of
success if we remediate within
3 days
Fault Tree Analysis: AND Example
99.99999% chance of
success if we remediate
within 3 days
Fault Tree Analysis: AND Example
p(A and B) =
p(A) * p(B) =
(1 - e-p(A)*t
) * (1 - e-p(B)*t
)
Where t = time to remediate
If p(A) and p(B) < .1, approximate to
p(A)*p(B)*t2
Kafka Fault Trees
Can we write or read to a Kafka cluster?
Service Level Objective (SLO):
99.99% success rate per year
Availability
4% 2% 1% 1%
?
Availability
Availability
4% 2% 1% 1%.8% 2% 1% 1%
4.8% failure chance 8% failure chance
12.8% failure chance
Availability - Two brokers, single ZK
Availability - Collapse Host Faults
.8%
2% 1% 1%
4%
98.4% success chance
Availability - Multiple ZKs
99.4% success chance
Availability - Three Brokers
99.95% success chance
Availability - Summary
Success Probability Cost Per Nine*
Standalone 87.2% n/a
Two brokers, ISR=1, One ZK 98.36% 2
Two brokers, ISR=1, Three ZKs 99.36% 2
Two brokers, ISR=1, Five ZKs 99.36% 3
Three brokers, ISR=1, Three ZKs 99.95% 1.5
Three brokers, ISR=2, Three ZKs 99.36% 2.25
* Cost is computed in “disk units” / “number of nines”:
Kafka Broker Rotational Disk = .5
Zookeeper SSD Disk = 1
Lower is better
Availability - Broker SSD
Success Probability Cost Per Nine*
Standalone 90.4% 1
Two brokers, ISR=1, One ZK 99.08% 1.5
Two brokers, ISR=1, Three ZKs 99.77% 2.5
Two brokers, ISR=1, Five ZKs 99.77% 3.5
Three brokers, ISR=1, Three ZKs 99.99% 1.5
Three brokers, ISR=2, Three ZKs 99.77% 3
* SSD Disk = 1
Availability - Broker EBS
Success Probability Cost Per Nine*
Standalone 91.6% 1.5
Two brokers, ISR=1, One ZK 99.29% 2.25
Two brokers, ISR=1, Three ZKs 99.82% 3.75
Two brokers, ISR=1, Five ZKs 99.82% 5.25
Three brokers, ISR=1, Three ZKs 99.99% 4.5
Three brokers, ISR=2, Three ZKs 99.82% 2.25
* EBS disk units:
EBS SSD Disk = 1.5
Assumption that EBS fails at .2%
What are the chances of losing data?
Service Level Objective (SLO):
99.99999% durability per year
Durability
We lose data when all hosts with replicas go down
Assumptions:
6TB per broker (2TB per disk w/ RAID)
70MB/s replication rate
~24 hours to replicate full broker
We replace bad hosts almost immediately
Durability
Durability - Two brokers - One 6TB Disk
1 - e-p(A)*t
* 1 -
e-p(B)*t
99.999999% durability
Durability - Add Raid0
99.9999% durability
Durability - Three brokers
99.99999999% durability
Assumption:
48TB per broker
70MB/s replication rate
~8 days to replicate full broker
We replace bad hosts almost immediately
Durability
Durability - Three brokers 24 disks
99.999% completeness
24 disks
Durability - Summary
Data completeness Cost Per Nine*
Standalone 99.99% .125
Two brokers 99.999999% .5
Two brokers RAID0 99.9999% .22
Three brokers RAID0 99.99999999% .15
Three brokers RAID0 - 48TB 99.999% 1
* Cost is computed in “disk units” / “number of nines”:
Single non-raid disk = .5
Raid0 = .167
Zookeeper SSD Disk = 1
Lower is better
Latency
FTA focused on failures
Latency is not an inherent failure
Experiment with worst-case scenarios
Conclusion
Fault Tree Models: github.com/afalko/fta-kafka
OSS tool to draw and compute models: github.com/rakhimov/scram
How Not to Go Boom: Lessons for SREs from Oil Refineries by Emil Stolarsky
Fault Tree Analysis - A History by Clifton A. Ericson II
Fault Tree Handbook with Aerospace Applications by Dr. Michael Stamatelatos and Mr. José Caraballo
Failure Trends in a Large Disk Drive Population by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso
Solving Data Loss in Massive Storage Systems by Jason Resch
Failures at Scale and How to Ride Through Them by James Hamilton
Tools and References
FTA can be applied to:
Kafka Availability and Durability SLOs
Find cost savings
Uncover decisions that reduce reliability
Takeaways
Kafka on Kubernetes analysis
Improve scram-pra
Better FTA inputs via Distributed Tracing
Further Work
github.com/afalko/fta-kafka
Andrey Falko <afalko@lyft.com>
Thank You
Thank you!

More Related Content

What's hot

ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 

What's hot (20)

From Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka Journey
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
 
Ingesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah WhitacreIngesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah Whitacre
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
 
Jitney, Kafka at Airbnb
Jitney, Kafka at AirbnbJitney, Kafka at Airbnb
Jitney, Kafka at Airbnb
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
 
101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
 
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
 
Real time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreamingReal time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreaming
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talk
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Deploying Confluent Platform for Production
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Production
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
 

Similar to Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit NYC 2019

Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
HostedbyConfluent
 
AQM performance for VOIP
AQM performance for VOIPAQM performance for VOIP
AQM performance for VOIP
Makkawy khair
 

Similar to Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit NYC 2019 (20)

Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
 
Does it still make sense to do Big Data with Small Nodes?
Does it still make sense to do Big Data with Small Nodes? Does it still make sense to do Big Data with Small Nodes?
Does it still make sense to do Big Data with Small Nodes?
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
 
Redis Reliability, Performance & Innovation
Redis Reliability, Performance & InnovationRedis Reliability, Performance & Innovation
Redis Reliability, Performance & Innovation
 
Save Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud CostsSave Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud Costs
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 
Breda Development Meetup 2016-06-08 - High Availability
Breda Development Meetup 2016-06-08 - High AvailabilityBreda Development Meetup 2016-06-08 - High Availability
Breda Development Meetup 2016-06-08 - High Availability
 
Scaling with MongoDB
Scaling with MongoDBScaling with MongoDB
Scaling with MongoDB
 
Troubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesTroubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issues
 
Java zone 2015 How to make life with kafka easier.
Java zone 2015 How to make life with kafka easier.Java zone 2015 How to make life with kafka easier.
Java zone 2015 How to make life with kafka easier.
 
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Pattern
 
AQM performance for VOIP
AQM performance for VOIPAQM performance for VOIP
AQM performance for VOIP
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
 
Day 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfDay 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConf
 
How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)
 
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...
 
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
 
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2Ginsbourg.com - Performance and Load Test Report Template LTR 1.2
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2
 

More from confluent

More from confluent (20)

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 

Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit NYC 2019

  • 1. Presentation title ANDREY FALKO - APRIL 2019 Bulletproof Apache Kafka® with Fault Tree Analysis
  • 2. Agenda The Challenge: Quantify Kafka Reliability Introduction to Fault Tree Analysis Kafka Fault Trees Availability Data Durability Conclusion
  • 3. The Challenge: Quantify Kafka Reliability
  • 4. Kafka is a “reliability tool” Move data without lossiness High stakes usage What are we trying to do?
  • 8. Why Quantify? Determine probability of success Find opportunities to trim cost
  • 9. Defining SLOs Need to define Service Level Objectives Availability Durability Latency
  • 10. Quantifying SLOs Availability What is the probability that writes or reads fail? How long do we tolerate downtime?
  • 11. Quantifying SLOs Durability What is the probability that we’ll lose data? How much will we lose?
  • 12. Quantifying SLOs Latency How long are transactions allowed to take?
  • 13. Introduction to Fault Tree Analysis
  • 14. What is Fault Tree Analysis? Deductive Failure Analysis Invented in 1962 for Minuteman I ICBM Launch Control System Industry wide adoption Aerospace Military Petrochemical Et al.
  • 15. Fault Tree Analysis: Event Symbols Basic Intermediate Transfer
  • 16. Fault Tree Analysis: Gate Symbols ANDOR
  • 17. Fault Tree Analysis: OR Example
  • 18. Fault Tree Analysis: OR Example 4% probability of failure
  • 19. Fault Tree Analysis: OR Example 4% probability of failure annualized
  • 20. Fault Tree Analysis: OR Example p(A “or” B) = p(A) + p(B) - p(A) * p(B) Almost always small
  • 21. Fault Tree Analysis: OR Example 12% chance of failure annualized
  • 22. Fault Tree Analysis: AND Example
  • 23. Fault Tree Analysis: AND Example p(A and B) = p(A) * p(B)
  • 24. Fault Tree Analysis: AND Example 99.9936% chance of success annualized
  • 25. Fault Tree Analysis: AND Example 99.9936% chance of success annualized if we don’t remediate
  • 26. Fault Tree Analysis: AND Example 99.999999996% chance of success if we remediate within 3 days
  • 27. Fault Tree Analysis: AND Example 99.99999% chance of success if we remediate within 3 days
  • 28. Fault Tree Analysis: AND Example p(A and B) = p(A) * p(B) = (1 - e-p(A)*t ) * (1 - e-p(B)*t ) Where t = time to remediate If p(A) and p(B) < .1, approximate to p(A)*p(B)*t2
  • 30. Can we write or read to a Kafka cluster? Service Level Objective (SLO): 99.99% success rate per year Availability
  • 31. 4% 2% 1% 1% ? Availability
  • 32. Availability 4% 2% 1% 1%.8% 2% 1% 1% 4.8% failure chance 8% failure chance 12.8% failure chance
  • 33. Availability - Two brokers, single ZK
  • 34. Availability - Collapse Host Faults .8% 2% 1% 1% 4% 98.4% success chance
  • 35. Availability - Multiple ZKs 99.4% success chance
  • 36. Availability - Three Brokers 99.95% success chance
  • 37. Availability - Summary Success Probability Cost Per Nine* Standalone 87.2% n/a Two brokers, ISR=1, One ZK 98.36% 2 Two brokers, ISR=1, Three ZKs 99.36% 2 Two brokers, ISR=1, Five ZKs 99.36% 3 Three brokers, ISR=1, Three ZKs 99.95% 1.5 Three brokers, ISR=2, Three ZKs 99.36% 2.25 * Cost is computed in “disk units” / “number of nines”: Kafka Broker Rotational Disk = .5 Zookeeper SSD Disk = 1 Lower is better
  • 38. Availability - Broker SSD Success Probability Cost Per Nine* Standalone 90.4% 1 Two brokers, ISR=1, One ZK 99.08% 1.5 Two brokers, ISR=1, Three ZKs 99.77% 2.5 Two brokers, ISR=1, Five ZKs 99.77% 3.5 Three brokers, ISR=1, Three ZKs 99.99% 1.5 Three brokers, ISR=2, Three ZKs 99.77% 3 * SSD Disk = 1
  • 39. Availability - Broker EBS Success Probability Cost Per Nine* Standalone 91.6% 1.5 Two brokers, ISR=1, One ZK 99.29% 2.25 Two brokers, ISR=1, Three ZKs 99.82% 3.75 Two brokers, ISR=1, Five ZKs 99.82% 5.25 Three brokers, ISR=1, Three ZKs 99.99% 4.5 Three brokers, ISR=2, Three ZKs 99.82% 2.25 * EBS disk units: EBS SSD Disk = 1.5 Assumption that EBS fails at .2%
  • 40. What are the chances of losing data? Service Level Objective (SLO): 99.99999% durability per year Durability
  • 41. We lose data when all hosts with replicas go down Assumptions: 6TB per broker (2TB per disk w/ RAID) 70MB/s replication rate ~24 hours to replicate full broker We replace bad hosts almost immediately Durability
  • 42. Durability - Two brokers - One 6TB Disk 1 - e-p(A)*t * 1 - e-p(B)*t 99.999999% durability
  • 43. Durability - Add Raid0 99.9999% durability
  • 44. Durability - Three brokers 99.99999999% durability
  • 45. Assumption: 48TB per broker 70MB/s replication rate ~8 days to replicate full broker We replace bad hosts almost immediately Durability
  • 46. Durability - Three brokers 24 disks 99.999% completeness 24 disks
  • 47. Durability - Summary Data completeness Cost Per Nine* Standalone 99.99% .125 Two brokers 99.999999% .5 Two brokers RAID0 99.9999% .22 Three brokers RAID0 99.99999999% .15 Three brokers RAID0 - 48TB 99.999% 1 * Cost is computed in “disk units” / “number of nines”: Single non-raid disk = .5 Raid0 = .167 Zookeeper SSD Disk = 1 Lower is better
  • 48. Latency FTA focused on failures Latency is not an inherent failure Experiment with worst-case scenarios
  • 50. Fault Tree Models: github.com/afalko/fta-kafka OSS tool to draw and compute models: github.com/rakhimov/scram How Not to Go Boom: Lessons for SREs from Oil Refineries by Emil Stolarsky Fault Tree Analysis - A History by Clifton A. Ericson II Fault Tree Handbook with Aerospace Applications by Dr. Michael Stamatelatos and Mr. José Caraballo Failure Trends in a Large Disk Drive Population by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso Solving Data Loss in Massive Storage Systems by Jason Resch Failures at Scale and How to Ride Through Them by James Hamilton Tools and References
  • 51. FTA can be applied to: Kafka Availability and Durability SLOs Find cost savings Uncover decisions that reduce reliability Takeaways
  • 52. Kafka on Kubernetes analysis Improve scram-pra Better FTA inputs via Distributed Tracing Further Work