Autoscaling Kafka: Strategies for Confluent Cloud and Beyond

Autoscaling Conﬂuent Cloud: Should
We? How Would We?
Julie Price
Senior Product Manager
Amanda Gilbert
Staff Solutions Engineer

Autoscaling
2
Lower Cost
Just-In-Time
Provisioning
Improved Resource
Utilization
Elasticity
Fault Tolerance
Reduced Operations
… is really just a capacity planning tactic

capacity; the ability to hold or contain
3
CPU MEMORY STORAGE
APPS
CPU
CPU
scaling up: adding additional resources
to provisioned nodes
CPU
MEM
STOR
CPU
MEM
STOR
CPU
MEM
STOR
scaling out: adding additional nodes w/
provisioned capacity

auto scaling
4
HARDWARE
HOST OS
CONTAINER ENGINE
BINS &
LIBS
BINS &
LIBS
BINS &
LIBS
APP APP APP
Containers
BINS &
LIBS
APP
● increase or decrease resources on demand
● policy based on load across ﬂeet of servers
● cool down period - time after scaling event when
scaling is on hold
● CSPs have native autoscaling approaches
AUTOSCALING GROUP

Kafka Cluster
BROKER
Storage
Memory
CPU
Kafka Connect
BROKER
Storage
Memory
CPU
BROKER
Storage
Memory
CPU
ZK/KRaft
Storage
Memory
CPU
ZK/KRaft
Storage
Memory
CPU
ZK/KRaft
Storage
Memory
CPU
ZK/KRaft
Storage
Memory
CPU
WORKER
[Storage]
Memory
CPU
WORKER
[Storage]
Memory
CPU
ksqlDB
SERVER
Storage
Memory
CPU
SERVER
Storage
Memory
CPU
Producer/Consumer Apps
APP
NODE
Storage
Memory
CPU
APP
NODE
Storage
Memory
CPU
APP
NODE
Storage
Memory
CPU
APP
NODE
Storage
Memory
CPU

7
Kafka Cluster
B B B
B B
Kafka Connect
W
W
ksqlDB
S
S
Producer/Consumer Apps
A A A A

NETWORK
COMPUTE
AZ AZ AZ
Cells
Cells
Cells
OBJECT
STORAGE
CUSTOMERS
Multi-Cloud Networking & Routing Tier
Metadata
Durability Audits
METRICS & OBSERVABILITY
CONNECT
PROCESSING
GOVERNANCE
Data Balancing
Health Checks
Real-time
feedback
data
Other Conﬂuent Cloud Services
GLOBAL CONTROL PLANE
… but I’m running on Conﬂuent Cloud
Learn more
about Kora

wouldn’t it
be great if
adding capacity meant more capacity?

Cluster Capacity is often a Ceiling, not a Floor
10
Kafka Cluster
Throughput - 100/300
MBps
p99 Latency - 6 ms
Kafka
Connect
Consumer App
A
C
ksqlDB
Q
Decreased
Throughput
Poor Query
Performance
Decreased
Throughput

Bottleneck 1: Number of Partitions
11
LIMITING FACTOR
throughput is constrained by the number
of partitions on your topic regardless of
cluster size
BENCHMARK IT
measure the throughput on a single
production partition
SIZE IT
max(t/p, t/c) where:
t= target throughput
p= measured throughput on 1 partition
c= consumption
[AUTO]SCALE IT?
METRICS
received_bytes
sent_bytes
generally, no
provision based on higher scale workload
beware: order guarantees, no downsizing
B
B
B
Your partitioning
strategy matters too!
P1
P2
P3

Bottleneck 2: Producer/Consumer Configs
12
LIMITING FACTOR client configurations
BENCHMARK IT
baseline with the kafka perf tools
baseline your app’s throughput
REMEDIATE IT
use benchmarks to guide the remediation
alter configs based on service goals
re-test
[AUTO]SCALE IT?
METRICS
depends on service goal
(throughput, lag, etc)
if clients are tuned correctly and cluster is
overutilized, a scaling event could be
required
B
B
B
Producer
Application
Consumer
Application
Consumer
Group
Configs should be
aligned to your service
goals

Bottleneck 3: Client Connections
13
LIMITING FACTOR
CPU [attempts] & Memory [requests]
Denied connection attempts
Added latency
BENCHMARK IT
measure total connections, connection
attempts, and requests
REMEDIATION
longer lived connections for less attempts
evaluate your app architecture & conﬁgs
audit logs for rogue clients
[AUTO]SCALE IT?
METRICS
active_connection_count
request_count
scaling might not solve your problem
address client conﬁgs &/ logic
B
B
B
Producer
Application
Consumer
Application
Rogue clients
unsuccessful connection
attempt still causes a
connection attempt
Clients open a connection to the
leader partition after
authentication

Additional Bottlenecks to Explore
14
partition
imbalance
consumer
parallelism
connector
throughput
stream processing
throughput
uneven load
distribution across
brokers
scale instances up
to # of partitions
limited by the # of
tasks, use
capacity-based
scaling
relies on topology
optimization &
parallelism
Consumer App
Kafka Cluster
Kafka
Connect
C
ksqlDB
Q

Scaling Needs are Driven by Use Case
16
Predictable Spikes for
Major Events
Predictable Spikes for
Minor Events
Unpredictable Spikes
These scaling activities are
based on predictable changes
in capacity requirements for
your application.
• Black Friday
• Super Bowl Sunday
• Campaign-Driven
• Product Release
These predictable spikes could
also be generated by ML
models monitoring activity
and predicting an increase in
demand.
based on predictable changes
in capacity that happen on a
regular basis.
• Nightly batch jobs
• Daily demand spikes
These predictable spikes may
or may not require additional
capacity.
The predictable and frequent
nature makes these use cases
a good ﬁt for automated
scaling OR overprovisioning.
based on unforeseen changes
in capacity requirements.
• Viral Social Media Trend
• Weather Event
• Unknown Supply Chain
Issue
These unpredictable spikes
can cause teams to scramble
to avoid downtime.

Predictable Spikes for Major Events
Predictable, infrequent changes in capacity
requirements
Guidelines
• Benchmark your cluster & resources:
connectors, clients and stream processing
apps. Scale test prior to event.
• Ensure proper client configurations
• Scale out proactively, based on expected
throughputs, client connections, etc
• Set up additional monitoring during the flex
period
17
POS
System
Inventory
Database
Distrib
ution
Center
Cloud
DB/ apps
Web &
Mobile
Snowflak
e
Google
BigQuery
Google Cloud
Storage
Snowflak
e
Google
BigQuery
Cloud Apps
(including
kStreams apps)
Downstream systems
Store

Predictable Spikes for Minor Events
Predictable, frequent changes in capacity
requirements
Guidelines
• Benchmark your applications, cluster &
resources to understand the amount of
trafﬁc they can handle
• Provision your kafka cluster to meet highest
demand point (app instances can be scaled
in place)
• Choose # of partitions based on max
expected throughput
18

19
19
Happy State
Load > 70%
!!!
Alert
Determines if
scaling is
required via
RCA
Kick off scaling scripts
and/or remediate cause
Last Resort State
Load > 70%
!!!
Alert
Load > 80%
Autoscaling
process
zZzZ
19
Unpredictable Spikes
Unforeseen changes in capacity
requirements
Guidelines
• Autoscaling should be a backup
plan, not a one-stop shop
• Monitor, alert, benchmark (alert on &
remediate proactively)
• Alert before you hit your auto scaling
threshold (which should be higher)
• Thresholds depend on how tolerant
you are to app connectivity failures &
consumer lag
• Natural bounds and long cool down
periods

A Tailored Capacity Plan
Are you optimizing for
throughput, latency,
durability, availability, or
something else?
20
What type of degradations can you
handle? (for how long?)
Do you have internal or external
SLOs?
Capacity planning on cluster
capacity alone produces
bottlenecks
Examine each part of your pipeline
& create a scaling plan
Lean on tools that support
scalability for each part of the stack.
Service
Goals
End-to-End
Architecture

Implementing a Capacity Plan
21

Kafka / CC
Cluster
Create a Scaling Plan
22
Determine service goals &
objectives
Deﬁne capacity requirements
Create a capacity plan based
on benchmarks & runbooks for
different scaling scenarios
Automate where possible
B
Producer /
Consumer
Applications
Scaling is an end-to-end
problem and needs a complete
solution
Benchmark components at
baseline (& test for
bottlenecks)
B
B
kStreams
Apps
Kafka
Connect
W
W
ksqlDB
S
S
Per Use Case
Per Component
You cannot effectively autoscale
until you benchmark

Kafka / CC
Cluster
Create a Maintenance Plan
23
B
Producer /
Consumer
Applications
Self-managed components can
be hosted on k8s for elastic
scaling
B
B
kStreams
Apps
Kafka
Connect
W
W
ksqlDB
S
S
● Monitor your pipeline (client and server side) and set
up alerting based on benchmarked thresholds
● Scale test before known scaling events
● Benchmark based on conﬁgs aligned to your service
goals, tune where necessary
● Set up client quotas
● Protect against bad actors by deﬁning an access
management strategy & monitoring your audit logs
If you’re running CP, utilize SBC
& tiered storage (default for CC)

Our Data Pipeline
24
Kafka Cluster
Deployment Setup
● CC Enterprise Cluster w/
max throughput of
250/750 mbps
● A single connector
benchmarked at 15 MBps
per task
● A kStreams application
with 1 app instance
● A single consumer app
Kafka
Connect
Consumer App
A
C
kStreams
A
Service Goals
● Expected avg throughput
of 10 MBps write, 3x
fanout
● Peak throughputs of
50/150 MBps
● Low tolerance for
degradation at peak

Our Capacity Plan
25
Kafka Cluster
Scaling Plan
● Cluster: no need to scale,
make sure capacity stays
below limits
● Connect: automate
capacity-based task scaling
● kStreams: automate scaling
of app instances
● Consumer: automate scaling
of instances up to # of
partitions
Kafka
Connect
Consumer App
A
C
kStreams
A
Maintenance Plan
● Monitor all resources & set up
alerting
● Scale test our use case @
peak throughputs
● Benchmark our consumer
● Examine kStreams topology
● Monitor audit logs

A Note for Platform Teams
● Platform teams with internal teams consuming Kafka
○ Client quotas
○ Best practices / templated approaches to clients
○ Proactive monitoring & auditing
○ Set of automated scripts for scaling cluster, connect, etc
(and/or automated scaling for self-hosted connect)
○ Set of remediation runbooks / troubleshooting guides
26

If You Want to Autoscale Conﬂuent Cloud
27
HIGH CLUSTER LOAD HIGH CONSUMER LAG
HIGH PRODUCER
BUFFER
HONORABLE MENTION:
CONNECTIONS/REQUES
TS
>70% >80%

If your cluster is Dedicated & your Ducks are in a Row
28
Kafka Cluster
Dedicated Cluster
2 CKUs
& [ | ]
high or rapidly
growing
consumer lag
high producer
buffer
cluster load >
80%
get # of CKUs
add a CKU
don’t forget a cooldown period
scale up aggressively, scale
down conservatively

Tools to Consider
29
CC METRICS API
CLIENT JMX METRICS
CC ADMIN API

Upcoming for Kafka & Conﬂuent
31
Predictable Spikes
for Major Events
Predictable Spikes
for Minor Events
Unpredictable
Spikes
Enterprise Tier1
Client-Side Metrics
Update [KIP 714]
Multi-CKU Shrink
Fast Scaling
1
GA Today

Autoscaling Kafka: Strategies for Confluent Cloud and Beyond

Autoscaling Kafka: Strategies for Confluent Cloud and Beyond

Recommended

Recommended

More Related Content

Similar to Autoscaling Kafka: Strategies for Confluent Cloud and Beyond

Similar to Autoscaling Kafka: Strategies for Confluent Cloud and Beyond (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Autoscaling Kafka: Strategies for Confluent Cloud and Beyond