Our Multi-Year Journey to a 10x Faster Confluent Cloud

Our Multi-Year
Performance Journey in
Conﬂuent Cloud
Shriram Sridharan - Sr. Manager, Kafka Data Infrastructure
Marc Selwan - Sr. Product Manager, Kafka Data Infrastructure

Who are we?
Database background working on
storage and indexing.
Building faster and cheaper Kafka in
Conﬂuent. Built relational databases.
We are simply representing a team of incredibly talented and hard working engineers

It’s not just Kafka in the cloud, in reality…

There’s a ton that goes into running our cloud service
NETWORK
COMPUTE
AZ AZ AZ
Cells
Cells
Cells
OBJECT
STORAGE
CUSTOMERS
Multi-Cloud Networking & Routing Tier
Metadata
Durability Audits
METRICS & OBSERVABILITY
CONNECT
PROCESSING
GOVERNANCE
Data Balancing
Health Checks
Real-time
feedback
data
Other Conﬂuent Cloud Services
GLOBAL CONTROL PLANE

Agenda
● Abstractions for a uniﬁed multi-cloud experience
● Latency SLO
● Monitoring
● Steady state latency
○ Workload patterns
○ Kora optimizations
○ Results
● Degraded state latency
● Takeaways

Cloud deployments can
be notoriously complex
Each arrow represents an
interaction that has an
associated cost, performance,
and throughput limit. NETWORK
BROKERS
AZ AZ AZ
OBJECT
STORAGE
PRODUCERS
SSDs SSDs SSDs
CONSUMERS
LOCAL STORAGE

and throughput limit.
These aspects—cost,
performance, and throughput
limits—don’t always change
proportionally for different
hardware options.
Available bw
(Gbps)
Instance name

and throughput limit.
These aspects—cost,
performance, and throughput
limits—don’t always change
proportionally for different
hardware options.
Pricing model, instance
performance varies across
cloud providers Many operators punt this complexity to customers.

Abstractions for a
unified multi-cloud
experience
Logical Kafka Cluster (LKC) as
the unit of access control
Confluent Kafka Unit (CKU) as
unit of cluster capacity in terms
of customer-visible metrics e.g.
50 MB/s ingress, 150MB/s egress
bandwidth per CKU
Cluster load exposes how
loaded a cluster is and provides
a signal when customers need
to scale up/down
NETWORK
BROKERS
AZ AZ AZ
OBJECT
STORAGE
Networking & Routing
PRODUCERS
SSDs SSDs SSDs
CONSUMERS
LOCAL STORAGE
LKC
AZ AZ AZ
PRODUCERS
CONSUMERS
1. Broker instance type?
2. Number of brokers?
3. Block storage type?
4. Block storage throughput?
5. Block storage IOPS?
6. Associated kernel, filesystem, and
Kafka knobs?
7. Which resource is bottlenecked?
1. How many CKUs do I need?
2. Is my cluster overloaded wrt
my latency requirements?
VS

• Managing 30K+ clusters
• Adapt and accommodate various
workload proﬁles
• Adding new features
• Run auxiliary software needed to
run our services
• Handle cloud provider variability
• Operated by machines not people
Ensure consistent
performance behind the
abstractions while…
NETWORK
BROKERS
AZ AZ AZ
OBJECT
STORAGE
Metadata
Durability Audits
Data Balancing
Health Checks
Real-time
feedback
data
PRODUCERS
SSDs SSDs SSDs
CONSUMERS
LOCAL STORAGE
End
to
End
Latency
The Challenge

Factors that determine the Latency SLO for Kafka
● Aggregate Broker/Cluster level ?
● Aggregate per week/day/hour/min ?
● E2E or Produce latencies ?
● p50/avg/p95/p99/p9999 ?
Challenges with running a cloud service
● No client visibility (KIP-714)
● Each customer has their own usage pattern
and expectation from the service
Deﬁning a Latency SLO
- Challenges
What doesn’t get
measured, doesn’t get
improved

● External health check probes every broker
● Measure E2E (Produce + Consume)
● Aggregate max per broker per min
● Monitor p99 over a week per cluster
- First Attempt

● Did NOT capture latency anomalies during
degradations
● Up-to 100 mins of degraded latency but cluster
still under SLO
- Issues

Added a new SLO
● External health check probes every broker
● Measure E2E (Produce + Consume)
● Aggregate max per broker per min
● Monitor p99 per window (in mins) per cluster
Latency SLOs
● Steady State Latency SLO
● Degraded State Latency SLO
- Current State

Monitoring
Infrastructure
● Per cluster monitoring
● Alerts
● Operated by machines
● Nightly regression
tests
NETWORK
BROKERS
AZ AZ AZ
OBJECT
STORAGE
Health check
producer
SSDs SSDs SSDs
Health check
consumer
LOCAL STORAGE
End
to
End
Latency
Internal
Latency
HC agent produce to consume measuring E2E latency

Steady State Latency - Workload Patterns
Partitions 100s - 100s of thousands
Fanout 1:1 - 1:30
Throughput 10MB - 20GB
Clients 10s - 10s of thousands
Additional Variables Connection Rate, Requests per sec, Keyed vs non-keyed

Workloads
Proof of Concept
Benchmark
Tracing
What we got right
● Built distributed tracing
● Encode workloads into Open Messaging
Benchmark
● Fail fast with dirty proof of concepts
Steady State
Latency - Peeling
the Onion

What took us some time to figure out
● Hyper-focused on Confluent Kafka
● Kora services had significant impact
Split the investigation into
● Confluent Kafka runtime running on bare EC2
● Kora Services
Workloads
Proof of Concept
Benchmark
Tracing
Steady State
Latency - Peeling
the Onion

Agenda
● Latency SLO
● Monitoring
○ Results
○ Cloud Infrastructure degradation
○ Workload degradation
● Takeaways

Kora Optimizations - Kafka Speciﬁc Optimizations
Disclaimer: YMMV depending on hardware/workload/conﬁgs

Replication
Optimizations
Observation
● Replication layer had a lot of CPU overhead and
inefﬁcient allocation patterns
● Predominantly visible in workloads with a lot of
partitions
Improvement
● Kora has a completely rewritten efﬁcient
replication protocol shipping in the next few
weeks aimed at minimizing CPU usage/
allocations.

Network Optimizations
Observation
● E2E latency much higher than broker side
latencies.
● Predominantly visible with less number of
clients.
Improvement
● Kora has increased parallelism in Kafka.
● Increases CPU consumption on the broker side
but provides overall better E2E latency

Storage
Optimizations
Observation : Background operations interfering with
foreground real-time operations.
Improvement
● Tiered Storage (Compute Storage Separation)
● Catchup consumption happens from the object
storage instead of local storage
● Heavily tuned ﬁlesystem and page cache
parameters

Incremental
Improvements
Inﬁnite number of
Inﬁnitesimal
improvements
Observation : Death by thousand cuts
Improvement
● Minimize work per request
● Move work out of the critical path
● Tuning (GC, Kernel parameters)

Kora Optimizations - Kora Speciﬁc Optimizations

Improvements to other
Kora Services
Observation
● Some of these services bin-packed with Kafka thus
using the same hardware resources.
Improvement
● Impact minimized by either
○ Complete re-architecture
○ Enforced QoS
● Monitoring agent - Example
Learning : Build with performance as ﬁrst class citizen

Agenda
● Latency SLO
● Monitoring
○ Results
○ Cloud Infrastructure degradation
○ Workload degradation
○ Multi-cloud vagaries
● Takeaways

Up-to 10X Faster
Results - Conﬂuent Cloud vs Apache Kafka
*Blog post with benchmark details/ numbers expected in the new few weeks. Also, as more improvements
come in, these numbers will change

Degraded State Latency
Recap of the Degraded State Latency SLO
● p99 E2E latency per window (in mins) per cluster
Investigations broadly reveal the following issues
● Degraded cloud hardware/services
● Workload induced degradation (imbalance in distribution)
● Multi-cloud vagaries

Degraded cloud
hardware/services
Observation : Degradation in the cloud is real!
Over a recent 1 week interval we observed
● A few incidents with complete block storage outages
● 10s of incidents with external connectivity loss
● 100s of incidents of storage and network degradation
Improvement
● Built proprietary APIs to transfer leadership to
non-degraded broker (or AZ)
● No compromise on durability and availability
guarantees for predictable performance

Degraded cloud
hardware/services -
Automation
Monitor
Aggregate
Mitigate
Improvement : Proprietary APIs enable automated
mitigations
● Monitor, Aggregate and Mitigation pipeline
● Has triggered > 500 times in the last 30 days!!

Degraded cloud
hardware/services - An
example

Workload Induced
Degradation
Observation : Workload changes causes imbalance in
distribution of load and hence degrades the latency.
Improvement:
● Kora includes a component called Self
Balancing Cluster (SBC) which continuously
rebalances the cluster.
● Rebalancing was heavy-weight/ slow by
computing all required changes up-front and
making required rebalancing.
● Re-architected to be more real time

One customer saw ~25% reduction in their load
when rebalancing was enabled.
Another customer saw signiﬁcant improvement
in latency with rebalancing.
Workload Induced
Degradation

Multi-Cloud Vagaries
Observation
● Same Instance type had different CPU generations
● Throughput/ IOPS scaled differently. Eg: GP2 vs GP3.
Improvement
● Abstractions enable us to continuously optimize
latencies as new hardware becomes available
● Tuning of IOPS/throughput per cloud provider

Abstractions enabled
latency improvements
while “flying the
plane”
Abstractions enabled continuous optimizations
● Switched from GP2 to GP3
● Moved 20000+ instances to Graviton
● Moved between memory and compute
optimized instance types seamlessly
Customer filed a support ticket asking why their “cluster load”
decreased significantly? They were able to downsize their cluster to
save money!
Customer Example

Our Learnings through this multi-year journey
● Primitives vary widely across cloud providers - Abstractions are required to
provide a unified multi-cloud experience
● You cannot improve what you cannot observe
● Build with performance as first class citizen across the stack
● Steady state latency optimizations are necessary but not sufficient
● Cloud hardware/services degradation is real & frequent - Resiliency needs to be
build in for the cloud

Our Multi-Year Journey to a 10x Faster Confluent Cloud

Recommended

Recommended

More Related Content

Similar to Our Multi-Year Journey to a 10x Faster Confluent Cloud

Similar to Our Multi-Year Journey to a 10x Faster Confluent Cloud (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Our Multi-Year Journey to a 10x Faster Confluent Cloud