SlideShare a Scribd company logo
1 of 16
Download to read offline
What’s Up with
Availability in Kafka?
Justine Olshan
Imagine this scenario….
I wasn’t able to talk to Apache Kafka®
for 30 minutes!!
What do you mean? The servers were
all up and running.
Well I know that my application was
down! So something was wrong!
How do we define expectations?
● Service Level Indicator (SLI)
○ A measurement on a service
● Service Level Objective (SLO)
○ A goal for how we want our service to behave
● Service Level Agreement (SLA)
○ An understanding about expectations for the
service
SRE fundamentals 2021: SLIs vs SLAs vs SLOs
Leader: 1 ISR: 1,2,3
Leader: 1 ISR: 1,2,3
Leader: 1 ISR: 1,2,3
Broker 3
Broker 2
Broker 1
How Kafka Prevents Downtime
Leader (1)
Follower (3)
Follower (2)
acks = all
replication.factor = 3
min.insync.replicas = 2
Data Plane: Replication Protocol
Configuring Durability, Availability, and Ordering Guarantees
Leader: 1 ISR: 1,2,3
Leader: 2 ISR: 1,2,3
Broker 3
Broker 2
Broker 1
How Kafka Prevents Downtime
Leader (1)
Follower (3)
Follower (2)
Leader (2)
acks = all
replication.factor = 3
min.insync.replicas = 2
Comparing Shutdowns…
Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper
5 Common Pitfalls When Using Apache Kafka
Gaps in Kafka’s Availability Story
● External network connectivity issues
○ Load balancers failing
○ Cloud provider outage
● Storage stuck on leader
● Intermittent issues
● High latency
?
What is up with Kafka?
● Metrics that truly measure availability
○ Can users interact with their data?
● Can we produce/consume? With a cluster or an individual partition?
● Can connections be made?
● Can we replicate?
● From there: define SLI, SLO, SLA
● Detect misbehaving brokers and take action!
○ Transfer leadership – bin/kafka-reassign-partitions.sh
○ Restart or replace
● Detect misbehaving brokers and take action!
○ Transfer leadership – bin/kafka-reassign-partitions.sh
Leader (1)
Follower (1) Follower (2)
Leader (2)
How can we mitigate unavailability?
Confluent has cool tools in cloud!
● Broker Leadership Priority APIs
● Automatic External Network Mitigation
● Automatic Stuck Storage Mitigation
Note: Confluent Cloud is the only
place to take advantage
of all these availability features!
Promote API
Leader (1)
Follower (1) Follower (2)
Leader (2)
Broker Leadership Priority API
Demote API
DEMOTED!
Automatic External Network Mitigation
Symptoms:
● External (user) connections and traffic lost
● Internal (replication, ZooKeeper) connections and traffic remain
Mitigation:
● Use external traffic and explicit pings
● Automatically demote when external traffic lost
● Automatically promote when external traffic returns
Automatic Stuck Storage Mitigation
Symptoms:
● Storage threads on a leader get stuck, leader can’t replicate
● Followers fall out of ISR
● Leader crashes resulting in offline partitions
Mitigation:
● Detect when threads get stuck
● Automatically restart the broker, leaders move
● Leadership won’t return unless the broker comes up healthy
Reimagine this scenario….
Our monitoring noticed external
connectivity loss to part of Kafka. We
limited the unavailability by moving
your data to an available part of the
system. Hopefully this caused
minimal downtime for your clients.
Got it. Thanks for keeping my cluster
available and meeting SLA!
Get started with Confluent Cloud to take
advantage of the availability features
mentioned today!
https://developer.confluent.io
Thank you!
Special thanks: Manikumar, Keshav, Gopi, Pablo, Drumil, Lewis,
Adithya

More Related Content

Similar to What’s up With Availability in Kafka? With Justine Olshan | Current 2022

UG-SQL-Server-Internals-Architecture.pptx
UG-SQL-Server-Internals-Architecture.pptxUG-SQL-Server-Internals-Architecture.pptx
UG-SQL-Server-Internals-Architecture.pptx
bocaha3988
 
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
DataStax
 
[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake
Timothy Spann
 

Similar to What’s up With Availability in Kafka? With Justine Olshan | Current 2022 (20)

Tips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache KafkaTips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache Kafka
 
Conf2014_SearchHeadClustering
Conf2014_SearchHeadClusteringConf2014_SearchHeadClustering
Conf2014_SearchHeadClustering
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
UG-SQL-Server-Internals-Architecture.pptx
UG-SQL-Server-Internals-Architecture.pptxUG-SQL-Server-Internals-Architecture.pptx
UG-SQL-Server-Internals-Architecture.pptx
 
Smart monitoring how does oracle rac manage resource, state ukoug19
Smart monitoring how does oracle rac manage resource, state ukoug19Smart monitoring how does oracle rac manage resource, state ukoug19
Smart monitoring how does oracle rac manage resource, state ukoug19
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right Way
 
Building Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodeBuilding Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache Geode
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
PlayStation and Cassandra Streams (Alexander Filipchik & Dustin Pham, Sony) |...
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
 
3 Flink Mistakes We Made So You Won't Have To
3 Flink Mistakes We Made So You Won't Have To3 Flink Mistakes We Made So You Won't Have To
3 Flink Mistakes We Made So You Won't Have To
 
Production Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldProduction Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated World
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talks
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 

What’s up With Availability in Kafka? With Justine Olshan | Current 2022

  • 1. What’s Up with Availability in Kafka? Justine Olshan
  • 2. Imagine this scenario…. I wasn’t able to talk to Apache Kafka® for 30 minutes!! What do you mean? The servers were all up and running. Well I know that my application was down! So something was wrong!
  • 3. How do we define expectations? ● Service Level Indicator (SLI) ○ A measurement on a service ● Service Level Objective (SLO) ○ A goal for how we want our service to behave ● Service Level Agreement (SLA) ○ An understanding about expectations for the service SRE fundamentals 2021: SLIs vs SLAs vs SLOs
  • 4. Leader: 1 ISR: 1,2,3 Leader: 1 ISR: 1,2,3 Leader: 1 ISR: 1,2,3 Broker 3 Broker 2 Broker 1 How Kafka Prevents Downtime Leader (1) Follower (3) Follower (2) acks = all replication.factor = 3 min.insync.replicas = 2 Data Plane: Replication Protocol Configuring Durability, Availability, and Ordering Guarantees
  • 5. Leader: 1 ISR: 1,2,3 Leader: 2 ISR: 1,2,3 Broker 3 Broker 2 Broker 1 How Kafka Prevents Downtime Leader (1) Follower (3) Follower (2) Leader (2) acks = all replication.factor = 3 min.insync.replicas = 2
  • 6. Comparing Shutdowns… Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper 5 Common Pitfalls When Using Apache Kafka
  • 7. Gaps in Kafka’s Availability Story ● External network connectivity issues ○ Load balancers failing ○ Cloud provider outage ● Storage stuck on leader ● Intermittent issues ● High latency ?
  • 8. What is up with Kafka? ● Metrics that truly measure availability ○ Can users interact with their data? ● Can we produce/consume? With a cluster or an individual partition? ● Can connections be made? ● Can we replicate? ● From there: define SLI, SLO, SLA
  • 9. ● Detect misbehaving brokers and take action! ○ Transfer leadership – bin/kafka-reassign-partitions.sh ○ Restart or replace ● Detect misbehaving brokers and take action! ○ Transfer leadership – bin/kafka-reassign-partitions.sh Leader (1) Follower (1) Follower (2) Leader (2) How can we mitigate unavailability?
  • 10. Confluent has cool tools in cloud! ● Broker Leadership Priority APIs ● Automatic External Network Mitigation ● Automatic Stuck Storage Mitigation Note: Confluent Cloud is the only place to take advantage of all these availability features!
  • 11. Promote API Leader (1) Follower (1) Follower (2) Leader (2) Broker Leadership Priority API Demote API DEMOTED!
  • 12. Automatic External Network Mitigation Symptoms: ● External (user) connections and traffic lost ● Internal (replication, ZooKeeper) connections and traffic remain Mitigation: ● Use external traffic and explicit pings ● Automatically demote when external traffic lost ● Automatically promote when external traffic returns
  • 13. Automatic Stuck Storage Mitigation Symptoms: ● Storage threads on a leader get stuck, leader can’t replicate ● Followers fall out of ISR ● Leader crashes resulting in offline partitions Mitigation: ● Detect when threads get stuck ● Automatically restart the broker, leaders move ● Leadership won’t return unless the broker comes up healthy
  • 14. Reimagine this scenario…. Our monitoring noticed external connectivity loss to part of Kafka. We limited the unavailability by moving your data to an available part of the system. Hopefully this caused minimal downtime for your clients. Got it. Thanks for keeping my cluster available and meeting SLA!
  • 15. Get started with Confluent Cloud to take advantage of the availability features mentioned today! https://developer.confluent.io
  • 16. Thank you! Special thanks: Manikumar, Keshav, Gopi, Pablo, Drumil, Lewis, Adithya