SlideShare a Scribd company logo
Our Multi-Year
Performance Journey in
Confluent Cloud
Shriram Sridharan - Sr. Manager, Kafka Data Infrastructure
Marc Selwan - Sr. Product Manager, Kafka Data Infrastructure
Who are we?
Database background working on
storage and indexing.
Building faster and cheaper Kafka in
Confluent. Built relational databases.
We are simply representing a team of incredibly talented and hard working engineers
It’s not just Kafka in the cloud, in reality…
There’s a ton that goes into running our cloud service
NETWORK
COMPUTE
AZ AZ AZ
Cells
Cells
Cells
OBJECT
STORAGE
CUSTOMERS
Multi-Cloud Networking & Routing Tier
Metadata
Durability Audits
METRICS & OBSERVABILITY
CONNECT
PROCESSING
GOVERNANCE
Data Balancing
Health Checks
Real-time
feedback
data
Other Confluent Cloud Services
GLOBAL CONTROL PLANE
Agenda
● Abstractions for a unified multi-cloud experience
● Latency SLO
● Monitoring
● Steady state latency
○ Workload patterns
○ Kora optimizations
○ Results
● Degraded state latency
● Takeaways
Cloud deployments can
be notoriously complex
Each arrow represents an
interaction that has an
associated cost, performance,
and throughput limit. NETWORK
BROKERS
AZ AZ AZ
OBJECT
STORAGE
Multi-Cloud Networking & Routing Tier
PRODUCERS
SSDs SSDs SSDs
CONSUMERS
LOCAL STORAGE
Cloud deployments can
be notoriously complex
Each arrow represents an
interaction that has an
associated cost, performance,
and throughput limit.
These aspects—cost,
performance, and throughput
limits—don’t always change
proportionally for different
hardware options.
Available bw
(Gbps)
Instance name
Cloud deployments can
be notoriously complex
Each arrow represents an
interaction that has an
associated cost, performance,
and throughput limit.
These aspects—cost,
performance, and throughput
limits—don’t always change
proportionally for different
hardware options.
Pricing model, instance
performance varies across
cloud providers Many operators punt this complexity to customers.
Abstractions for a
unified multi-cloud
experience
Logical Kafka Cluster (LKC) as
the unit of access control
Confluent Kafka Unit (CKU) as
unit of cluster capacity in terms
of customer-visible metrics e.g.
50 MB/s ingress, 150MB/s egress
bandwidth per CKU
Cluster load exposes how
loaded a cluster is and provides
a signal when customers need
to scale up/down
NETWORK
BROKERS
AZ AZ AZ
OBJECT
STORAGE
Networking & Routing
PRODUCERS
SSDs SSDs SSDs
CONSUMERS
LOCAL STORAGE
LKC
AZ AZ AZ
PRODUCERS
CONSUMERS
1. Broker instance type?
2. Number of brokers?
3. Block storage type?
4. Block storage throughput?
5. Block storage IOPS?
6. Associated kernel, filesystem, and
Kafka knobs?
7. Which resource is bottlenecked?
1. How many CKUs do I need?
2. Is my cluster overloaded wrt
my latency requirements?
VS
• Managing 30K+ clusters
• Adapt and accommodate various
workload profiles
• Adding new features
• Run auxiliary software needed to
run our services
• Handle cloud provider variability
• Operated by machines not people
Ensure consistent
performance behind the
abstractions while…
NETWORK
BROKERS
AZ AZ AZ
OBJECT
STORAGE
Multi-Cloud Networking & Routing Tier
Metadata
Durability Audits
Data Balancing
Health Checks
Real-time
feedback
data
PRODUCERS
SSDs SSDs SSDs
CONSUMERS
LOCAL STORAGE
End
to
End
Latency
The Challenge
Agenda
● Abstractions for a unified multi-cloud experience
● Latency SLO
● Monitoring
● Steady state latency
○ Workload patterns
○ Kora optimizations
○ Results
● Degraded state latency
● Takeaways
Factors that determine the Latency SLO for Kafka
● Aggregate Broker/Cluster level ?
● Aggregate per week/day/hour/min ?
● E2E or Produce latencies ?
● p50/avg/p95/p99/p9999 ?
Challenges with running a cloud service
● No client visibility (KIP-714)
● Each customer has their own usage pattern
and expectation from the service
Defining a Latency SLO
- Challenges
What doesn’t get
measured, doesn’t get
improved
● External health check probes every broker
● Measure E2E (Produce + Consume)
● Aggregate max per broker per min
● Monitor p99 over a week per cluster
Defining a Latency SLO
- First Attempt
● Did NOT capture latency anomalies during
degradations
● Up-to 100 mins of degraded latency but cluster
still under SLO
Defining a Latency SLO
- Issues
Added a new SLO
● External health check probes every broker
● Measure E2E (Produce + Consume)
● Aggregate max per broker per min
● Monitor p99 per window (in mins) per cluster
Latency SLOs
● Steady State Latency SLO
● Degraded State Latency SLO
Defining a Latency SLO
- Current State
Agenda
● Abstractions for a unified multi-cloud experience
● Latency SLO
● Monitoring
● Steady state latency
○ Workload patterns
○ Kora optimizations
○ Results
● Degraded state latency
● Takeaways
Monitoring
Infrastructure
● Per cluster monitoring
● Alerts
● Operated by machines
● Nightly regression
tests
NETWORK
BROKERS
AZ AZ AZ
OBJECT
STORAGE
Multi-Cloud Networking & Routing Tier
Health check
producer
SSDs SSDs SSDs
Health check
consumer
LOCAL STORAGE
End
to
End
Latency
Internal
Latency
HC agent produce to consume measuring E2E latency
Agenda
● Abstractions for a unified multi-cloud experience
● Latency SLO
● Monitoring
● Steady state latency
○ Workload patterns
○ Kora optimizations
○ Results
● Degraded state latency
● Takeaways
Steady State Latency - Workload Patterns
Partitions 100s - 100s of thousands
Fanout 1:1 - 1:30
Throughput 10MB - 20GB
Clients 10s - 10s of thousands
Additional Variables Connection Rate, Requests per sec, Keyed vs non-keyed
Workloads
Proof of Concept
Benchmark
Tracing
What we got right
● Built distributed tracing
● Encode workloads into Open Messaging
Benchmark
● Fail fast with dirty proof of concepts
Steady State
Latency - Peeling
the Onion
What took us some time to figure out
● Hyper-focused on Confluent Kafka
● Kora services had significant impact
Split the investigation into
● Confluent Kafka runtime running on bare EC2
● Kora Services
Workloads
Proof of Concept
Benchmark
Tracing
Steady State
Latency - Peeling
the Onion
Agenda
● Abstractions for a unified multi-cloud experience
● Latency SLO
● Monitoring
● Steady state latency
○ Workload patterns
○ Kora optimizations
○ Results
● Degraded state latency
○ Cloud Infrastructure degradation
○ Workload degradation
● Takeaways
Kora Optimizations - Kafka Specific Optimizations
Disclaimer: YMMV depending on hardware/workload/configs
Replication
Optimizations
Observation
● Replication layer had a lot of CPU overhead and
inefficient allocation patterns
● Predominantly visible in workloads with a lot of
partitions
Improvement
● Kora has a completely rewritten efficient
replication protocol shipping in the next few
weeks aimed at minimizing CPU usage/
allocations.
Network Optimizations
Observation
● E2E latency much higher than broker side
latencies.
● Predominantly visible with less number of
clients.
Improvement
● Kora has increased parallelism in Kafka.
● Increases CPU consumption on the broker side
but provides overall better E2E latency
Storage
Optimizations
Observation : Background operations interfering with
foreground real-time operations.
Improvement
● Tiered Storage (Compute Storage Separation)
● Catchup consumption happens from the object
storage instead of local storage
● Heavily tuned filesystem and page cache
parameters
Incremental
Improvements
Infinite number of
Infinitesimal
improvements
Observation : Death by thousand cuts
Improvement
● Minimize work per request
● Move work out of the critical path
● Tuning (GC, Kernel parameters)
Kora Optimizations - Kora Specific Optimizations
Improvements to other
Kora Services
Observation
● Some of these services bin-packed with Kafka thus
using the same hardware resources.
Improvement
● Impact minimized by either
○ Complete re-architecture
○ Enforced QoS
● Monitoring agent - Example
Learning : Build with performance as first class citizen
Agenda
● Abstractions for a unified multi-cloud experience
● Latency SLO
● Monitoring
● Steady state latency
○ Workload patterns
○ Kora optimizations
○ Results
● Degraded state latency
○ Cloud Infrastructure degradation
○ Workload degradation
○ Multi-cloud vagaries
● Takeaways
Up-to 10X Faster
Results - Confluent Cloud vs Apache Kafka
*Blog post with benchmark details/ numbers expected in the new few weeks. Also, as more improvements
come in, these numbers will change
Up-to 10X Faster
Results - Confluent Cloud vs Apache Kafka
*Blog post with benchmark details/ numbers expected in the new few weeks. Also, as more improvements
come in, these numbers will change
Agenda
● Abstractions for a unified multi-cloud experience
● Latency SLO
● Monitoring
● Steady state latency
○ Workload patterns
○ Kora optimizations
○ Results
● Degraded state latency
● Takeaways
Degraded State Latency
Recap of the Degraded State Latency SLO
● p99 E2E latency per window (in mins) per cluster
Investigations broadly reveal the following issues
● Degraded cloud hardware/services
● Workload induced degradation (imbalance in distribution)
● Multi-cloud vagaries
Degraded cloud
hardware/services
Observation : Degradation in the cloud is real!
Over a recent 1 week interval we observed
● A few incidents with complete block storage outages
● 10s of incidents with external connectivity loss
● 100s of incidents of storage and network degradation
Improvement
● Built proprietary APIs to transfer leadership to
non-degraded broker (or AZ)
● No compromise on durability and availability
guarantees for predictable performance
Degraded cloud
hardware/services -
Automation
Monitor
Aggregate
Mitigate
Improvement : Proprietary APIs enable automated
mitigations
● Monitor, Aggregate and Mitigation pipeline
● Has triggered > 500 times in the last 30 days!!
Degraded cloud
hardware/services - An
example
Workload Induced
Degradation
Observation : Workload changes causes imbalance in
distribution of load and hence degrades the latency.
Improvement:
● Kora includes a component called Self
Balancing Cluster (SBC) which continuously
rebalances the cluster.
● Rebalancing was heavy-weight/ slow by
computing all required changes up-front and
making required rebalancing.
● Re-architected to be more real time
One customer saw ~25% reduction in their load
when rebalancing was enabled.
Another customer saw significant improvement
in latency with rebalancing.
Workload Induced
Degradation
Multi-Cloud Vagaries
Observation
● Same Instance type had different CPU generations
● Throughput/ IOPS scaled differently. Eg: GP2 vs GP3.
Improvement
● Abstractions enable us to continuously optimize
latencies as new hardware becomes available
● Tuning of IOPS/throughput per cloud provider
Abstractions enabled
latency improvements
while “flying the
plane”
Abstractions enabled continuous optimizations
● Switched from GP2 to GP3
● Moved 20000+ instances to Graviton
● Moved between memory and compute
optimized instance types seamlessly
Customer filed a support ticket asking why their “cluster load”
decreased significantly? They were able to downsize their cluster to
save money!
Customer Example
Our Learnings through this multi-year journey
● Primitives vary widely across cloud providers - Abstractions are required to
provide a unified multi-cloud experience
● You cannot improve what you cannot observe
● Build with performance as first class citizen across the stack
● Steady state latency optimizations are necessary but not sufficient
● Cloud hardware/services degradation is real & frequent - Resiliency needs to be
build in for the cloud
Thank you! Questions ?

More Related Content

Similar to Our Multi-Year Journey to a 10x Faster Confluent Cloud

EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
Scott Mansfield
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
Nicolas Brousse
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
Alexander Penev
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
LarryZaman
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
QAware GmbH
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Josef Adersberger
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenant
Swiss Data Forum Swiss Data Forum
 
Accelerated SDN in Azure
Accelerated SDN in AzureAccelerated SDN in Azure
Accelerated SDN in Azure
Open Networking Summit
 
Benchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutionsBenchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutions
Zhidong Yu
 
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph Community
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Continuent
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Codemotion
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Monal Daxini
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
sabnees
 
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick ParkerDevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
R3
 

Similar to Our Multi-Year Journey to a 10x Faster Confluent Cloud (20)

EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenant
 
Accelerated SDN in Azure
Accelerated SDN in AzureAccelerated SDN in Azure
Accelerated SDN in Azure
 
Benchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutionsBenchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutions
 
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
 
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick ParkerDevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 

Recently uploaded (20)

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 

Our Multi-Year Journey to a 10x Faster Confluent Cloud

  • 1. Our Multi-Year Performance Journey in Confluent Cloud Shriram Sridharan - Sr. Manager, Kafka Data Infrastructure Marc Selwan - Sr. Product Manager, Kafka Data Infrastructure
  • 2. Who are we? Database background working on storage and indexing. Building faster and cheaper Kafka in Confluent. Built relational databases. We are simply representing a team of incredibly talented and hard working engineers
  • 3. It’s not just Kafka in the cloud, in reality…
  • 4. There’s a ton that goes into running our cloud service NETWORK COMPUTE AZ AZ AZ Cells Cells Cells OBJECT STORAGE CUSTOMERS Multi-Cloud Networking & Routing Tier Metadata Durability Audits METRICS & OBSERVABILITY CONNECT PROCESSING GOVERNANCE Data Balancing Health Checks Real-time feedback data Other Confluent Cloud Services GLOBAL CONTROL PLANE
  • 5. Agenda ● Abstractions for a unified multi-cloud experience ● Latency SLO ● Monitoring ● Steady state latency ○ Workload patterns ○ Kora optimizations ○ Results ● Degraded state latency ● Takeaways
  • 6. Cloud deployments can be notoriously complex Each arrow represents an interaction that has an associated cost, performance, and throughput limit. NETWORK BROKERS AZ AZ AZ OBJECT STORAGE Multi-Cloud Networking & Routing Tier PRODUCERS SSDs SSDs SSDs CONSUMERS LOCAL STORAGE
  • 7. Cloud deployments can be notoriously complex Each arrow represents an interaction that has an associated cost, performance, and throughput limit. These aspects—cost, performance, and throughput limits—don’t always change proportionally for different hardware options. Available bw (Gbps) Instance name
  • 8. Cloud deployments can be notoriously complex Each arrow represents an interaction that has an associated cost, performance, and throughput limit. These aspects—cost, performance, and throughput limits—don’t always change proportionally for different hardware options. Pricing model, instance performance varies across cloud providers Many operators punt this complexity to customers.
  • 9. Abstractions for a unified multi-cloud experience Logical Kafka Cluster (LKC) as the unit of access control Confluent Kafka Unit (CKU) as unit of cluster capacity in terms of customer-visible metrics e.g. 50 MB/s ingress, 150MB/s egress bandwidth per CKU Cluster load exposes how loaded a cluster is and provides a signal when customers need to scale up/down NETWORK BROKERS AZ AZ AZ OBJECT STORAGE Networking & Routing PRODUCERS SSDs SSDs SSDs CONSUMERS LOCAL STORAGE LKC AZ AZ AZ PRODUCERS CONSUMERS 1. Broker instance type? 2. Number of brokers? 3. Block storage type? 4. Block storage throughput? 5. Block storage IOPS? 6. Associated kernel, filesystem, and Kafka knobs? 7. Which resource is bottlenecked? 1. How many CKUs do I need? 2. Is my cluster overloaded wrt my latency requirements? VS
  • 10. • Managing 30K+ clusters • Adapt and accommodate various workload profiles • Adding new features • Run auxiliary software needed to run our services • Handle cloud provider variability • Operated by machines not people Ensure consistent performance behind the abstractions while… NETWORK BROKERS AZ AZ AZ OBJECT STORAGE Multi-Cloud Networking & Routing Tier Metadata Durability Audits Data Balancing Health Checks Real-time feedback data PRODUCERS SSDs SSDs SSDs CONSUMERS LOCAL STORAGE End to End Latency The Challenge
  • 11. Agenda ● Abstractions for a unified multi-cloud experience ● Latency SLO ● Monitoring ● Steady state latency ○ Workload patterns ○ Kora optimizations ○ Results ● Degraded state latency ● Takeaways
  • 12. Factors that determine the Latency SLO for Kafka ● Aggregate Broker/Cluster level ? ● Aggregate per week/day/hour/min ? ● E2E or Produce latencies ? ● p50/avg/p95/p99/p9999 ? Challenges with running a cloud service ● No client visibility (KIP-714) ● Each customer has their own usage pattern and expectation from the service Defining a Latency SLO - Challenges What doesn’t get measured, doesn’t get improved
  • 13. ● External health check probes every broker ● Measure E2E (Produce + Consume) ● Aggregate max per broker per min ● Monitor p99 over a week per cluster Defining a Latency SLO - First Attempt
  • 14. ● Did NOT capture latency anomalies during degradations ● Up-to 100 mins of degraded latency but cluster still under SLO Defining a Latency SLO - Issues
  • 15. Added a new SLO ● External health check probes every broker ● Measure E2E (Produce + Consume) ● Aggregate max per broker per min ● Monitor p99 per window (in mins) per cluster Latency SLOs ● Steady State Latency SLO ● Degraded State Latency SLO Defining a Latency SLO - Current State
  • 16. Agenda ● Abstractions for a unified multi-cloud experience ● Latency SLO ● Monitoring ● Steady state latency ○ Workload patterns ○ Kora optimizations ○ Results ● Degraded state latency ● Takeaways
  • 17. Monitoring Infrastructure ● Per cluster monitoring ● Alerts ● Operated by machines ● Nightly regression tests NETWORK BROKERS AZ AZ AZ OBJECT STORAGE Multi-Cloud Networking & Routing Tier Health check producer SSDs SSDs SSDs Health check consumer LOCAL STORAGE End to End Latency Internal Latency HC agent produce to consume measuring E2E latency
  • 18. Agenda ● Abstractions for a unified multi-cloud experience ● Latency SLO ● Monitoring ● Steady state latency ○ Workload patterns ○ Kora optimizations ○ Results ● Degraded state latency ● Takeaways
  • 19. Steady State Latency - Workload Patterns Partitions 100s - 100s of thousands Fanout 1:1 - 1:30 Throughput 10MB - 20GB Clients 10s - 10s of thousands Additional Variables Connection Rate, Requests per sec, Keyed vs non-keyed
  • 20. Workloads Proof of Concept Benchmark Tracing What we got right ● Built distributed tracing ● Encode workloads into Open Messaging Benchmark ● Fail fast with dirty proof of concepts Steady State Latency - Peeling the Onion
  • 21. What took us some time to figure out ● Hyper-focused on Confluent Kafka ● Kora services had significant impact Split the investigation into ● Confluent Kafka runtime running on bare EC2 ● Kora Services Workloads Proof of Concept Benchmark Tracing Steady State Latency - Peeling the Onion
  • 22. Agenda ● Abstractions for a unified multi-cloud experience ● Latency SLO ● Monitoring ● Steady state latency ○ Workload patterns ○ Kora optimizations ○ Results ● Degraded state latency ○ Cloud Infrastructure degradation ○ Workload degradation ● Takeaways
  • 23. Kora Optimizations - Kafka Specific Optimizations Disclaimer: YMMV depending on hardware/workload/configs
  • 24. Replication Optimizations Observation ● Replication layer had a lot of CPU overhead and inefficient allocation patterns ● Predominantly visible in workloads with a lot of partitions Improvement ● Kora has a completely rewritten efficient replication protocol shipping in the next few weeks aimed at minimizing CPU usage/ allocations.
  • 25. Network Optimizations Observation ● E2E latency much higher than broker side latencies. ● Predominantly visible with less number of clients. Improvement ● Kora has increased parallelism in Kafka. ● Increases CPU consumption on the broker side but provides overall better E2E latency
  • 26. Storage Optimizations Observation : Background operations interfering with foreground real-time operations. Improvement ● Tiered Storage (Compute Storage Separation) ● Catchup consumption happens from the object storage instead of local storage ● Heavily tuned filesystem and page cache parameters
  • 27. Incremental Improvements Infinite number of Infinitesimal improvements Observation : Death by thousand cuts Improvement ● Minimize work per request ● Move work out of the critical path ● Tuning (GC, Kernel parameters)
  • 28. Kora Optimizations - Kora Specific Optimizations
  • 29. Improvements to other Kora Services Observation ● Some of these services bin-packed with Kafka thus using the same hardware resources. Improvement ● Impact minimized by either ○ Complete re-architecture ○ Enforced QoS ● Monitoring agent - Example Learning : Build with performance as first class citizen
  • 30. Agenda ● Abstractions for a unified multi-cloud experience ● Latency SLO ● Monitoring ● Steady state latency ○ Workload patterns ○ Kora optimizations ○ Results ● Degraded state latency ○ Cloud Infrastructure degradation ○ Workload degradation ○ Multi-cloud vagaries ● Takeaways
  • 31. Up-to 10X Faster Results - Confluent Cloud vs Apache Kafka *Blog post with benchmark details/ numbers expected in the new few weeks. Also, as more improvements come in, these numbers will change
  • 32. Up-to 10X Faster Results - Confluent Cloud vs Apache Kafka *Blog post with benchmark details/ numbers expected in the new few weeks. Also, as more improvements come in, these numbers will change
  • 33. Agenda ● Abstractions for a unified multi-cloud experience ● Latency SLO ● Monitoring ● Steady state latency ○ Workload patterns ○ Kora optimizations ○ Results ● Degraded state latency ● Takeaways
  • 34. Degraded State Latency Recap of the Degraded State Latency SLO ● p99 E2E latency per window (in mins) per cluster Investigations broadly reveal the following issues ● Degraded cloud hardware/services ● Workload induced degradation (imbalance in distribution) ● Multi-cloud vagaries
  • 35. Degraded cloud hardware/services Observation : Degradation in the cloud is real! Over a recent 1 week interval we observed ● A few incidents with complete block storage outages ● 10s of incidents with external connectivity loss ● 100s of incidents of storage and network degradation Improvement ● Built proprietary APIs to transfer leadership to non-degraded broker (or AZ) ● No compromise on durability and availability guarantees for predictable performance
  • 36. Degraded cloud hardware/services - Automation Monitor Aggregate Mitigate Improvement : Proprietary APIs enable automated mitigations ● Monitor, Aggregate and Mitigation pipeline ● Has triggered > 500 times in the last 30 days!!
  • 38. Workload Induced Degradation Observation : Workload changes causes imbalance in distribution of load and hence degrades the latency. Improvement: ● Kora includes a component called Self Balancing Cluster (SBC) which continuously rebalances the cluster. ● Rebalancing was heavy-weight/ slow by computing all required changes up-front and making required rebalancing. ● Re-architected to be more real time
  • 39. One customer saw ~25% reduction in their load when rebalancing was enabled. Another customer saw significant improvement in latency with rebalancing. Workload Induced Degradation
  • 40. Multi-Cloud Vagaries Observation ● Same Instance type had different CPU generations ● Throughput/ IOPS scaled differently. Eg: GP2 vs GP3. Improvement ● Abstractions enable us to continuously optimize latencies as new hardware becomes available ● Tuning of IOPS/throughput per cloud provider
  • 41. Abstractions enabled latency improvements while “flying the plane” Abstractions enabled continuous optimizations ● Switched from GP2 to GP3 ● Moved 20000+ instances to Graviton ● Moved between memory and compute optimized instance types seamlessly Customer filed a support ticket asking why their “cluster load” decreased significantly? They were able to downsize their cluster to save money! Customer Example
  • 42. Our Learnings through this multi-year journey ● Primitives vary widely across cloud providers - Abstractions are required to provide a unified multi-cloud experience ● You cannot improve what you cannot observe ● Build with performance as first class citizen across the stack ● Steady state latency optimizations are necessary but not sufficient ● Cloud hardware/services degradation is real & frequent - Resiliency needs to be build in for the cloud