This document discusses autoscaling Confluent Cloud and provides guidance on implementing autoscaling. Some key points:
- Autoscaling can help with lower costs, improved resource utilization, and reduced operations overhead. However, capacity planning is still needed.
- When autoscaling Kafka, the entire data pipeline needs to be considered including brokers, connectors, streams applications, and clients. Bottlenecks can occur in various parts of the system.
- Different use cases have different autoscaling needs, such as predictable spikes for major/minor events vs unpredictable spikes. Guidelines are provided for each.
- A complete capacity and maintenance plan is needed involving benchmarking all components, monitoring, and automated scaling scripts. Autoscaling alone is not sufficient
3. capacity; the ability to hold or contain
3
CPU MEMORY STORAGE
APPS
CPU
CPU
scaling up: adding additional resources
to provisioned nodes
CPU
MEM
STOR
CPU
MEM
STOR
CPU
MEM
STOR
scaling out: adding additional nodes w/
provisioned capacity
4. auto scaling
4
HARDWARE
HOST OS
CONTAINER ENGINE
BINS &
LIBS
BINS &
LIBS
BINS &
LIBS
APP APP APP
Containers
BINS &
LIBS
APP
● increase or decrease resources on demand
● policy based on load across fleet of servers
● cool down period - time after scaling event when
scaling is on hold
● CSPs have native autoscaling approaches
AUTOSCALING GROUP
7. 7
Kafka Cluster
B B B
B B
Kafka Connect
W
W
ksqlDB
S
S
Producer/Consumer Apps
A A A A
8. NETWORK
COMPUTE
AZ AZ AZ
Cells
Cells
Cells
OBJECT
STORAGE
CUSTOMERS
Multi-Cloud Networking & Routing Tier
Metadata
Durability Audits
METRICS & OBSERVABILITY
CONNECT
PROCESSING
GOVERNANCE
Data Balancing
Health Checks
Real-time
feedback
data
Other Confluent Cloud Services
GLOBAL CONTROL PLANE
… but I’m running on Confluent Cloud
Learn more
about Kora
10. Cluster Capacity is often a Ceiling, not a Floor
10
Kafka Cluster
Throughput - 100/300
MBps
p99 Latency - 6 ms
Kafka
Connect
Consumer App
A
C
ksqlDB
Q
Decreased
Throughput
Poor Query
Performance
Decreased
Throughput
11. Bottleneck 1: Number of Partitions
11
LIMITING FACTOR
throughput is constrained by the number
of partitions on your topic regardless of
cluster size
BENCHMARK IT
measure the throughput on a single
production partition
SIZE IT
max(t/p, t/c) where:
t= target throughput
p= measured throughput on 1 partition
c= consumption
[AUTO]SCALE IT?
METRICS
received_bytes
sent_bytes
generally, no
provision based on higher scale workload
beware: order guarantees, no downsizing
B
B
B
Your partitioning
strategy matters too!
P1
P2
P3
12. Bottleneck 2: Producer/Consumer Configs
12
LIMITING FACTOR client configurations
BENCHMARK IT
baseline with the kafka perf tools
baseline your app’s throughput
REMEDIATE IT
use benchmarks to guide the remediation
alter configs based on service goals
re-test
[AUTO]SCALE IT?
METRICS
depends on service goal
(throughput, lag, etc)
if clients are tuned correctly and cluster is
overutilized, a scaling event could be
required
B
B
B
Producer
Application
Consumer
Application
Consumer
Group
Configs should be
aligned to your service
goals
13. Bottleneck 3: Client Connections
13
LIMITING FACTOR
CPU [attempts] & Memory [requests]
Denied connection attempts
Added latency
BENCHMARK IT
measure total connections, connection
attempts, and requests
REMEDIATION
longer lived connections for less attempts
evaluate your app architecture & configs
audit logs for rogue clients
[AUTO]SCALE IT?
METRICS
active_connection_count
request_count
scaling might not solve your problem
address client configs &/ logic
B
B
B
Producer
Application
Consumer
Application
Rogue clients
unsuccessful connection
attempt still causes a
connection attempt
Clients open a connection to the
leader partition after
authentication
14. Additional Bottlenecks to Explore
14
partition
imbalance
consumer
parallelism
connector
throughput
stream processing
throughput
uneven load
distribution across
brokers
scale instances up
to # of partitions
limited by the # of
tasks, use
capacity-based
scaling
relies on topology
optimization &
parallelism
Consumer App
Kafka Cluster
Kafka
Connect
C
ksqlDB
Q
16. Scaling Needs are Driven by Use Case
16
Predictable Spikes for
Major Events
Predictable Spikes for
Minor Events
Unpredictable Spikes
These scaling activities are
based on predictable changes
in capacity requirements for
your application.
• Black Friday
• Super Bowl Sunday
• Campaign-Driven
• Product Release
These predictable spikes could
also be generated by ML
models monitoring activity
and predicting an increase in
demand.
These scaling activities are
based on predictable changes
in capacity that happen on a
regular basis.
• Nightly batch jobs
• Daily demand spikes
These predictable spikes may
or may not require additional
capacity.
The predictable and frequent
nature makes these use cases
a good fit for automated
scaling OR overprovisioning.
These scaling activities are
based on unforeseen changes
in capacity requirements.
• Viral Social Media Trend
• Weather Event
• Unknown Supply Chain
Issue
These unpredictable spikes
can cause teams to scramble
to avoid downtime.
17. Predictable Spikes for Major Events
Predictable, infrequent changes in capacity
requirements
Guidelines
• Benchmark your cluster & resources:
connectors, clients and stream processing
apps. Scale test prior to event.
• Ensure proper client configurations
• Scale out proactively, based on expected
throughputs, client connections, etc
• Set up additional monitoring during the flex
period
17
POS
System
Inventory
Database
Distrib
ution
Center
Cloud
DB/ apps
Web &
Mobile
Snowflak
e
Google
BigQuery
Google Cloud
Storage
Snowflak
e
Google
BigQuery
Cloud Apps
(including
kStreams apps)
Downstream systems
Store
18. Predictable Spikes for Minor Events
Predictable, frequent changes in capacity
requirements
Guidelines
• Benchmark your applications, cluster &
resources to understand the amount of
traffic they can handle
• Provision your kafka cluster to meet highest
demand point (app instances can be scaled
in place)
• Choose # of partitions based on max
expected throughput
18
19. 19
19
Happy State
Load > 70%
!!!
Alert
Determines if
scaling is
required via
RCA
Kick off scaling scripts
and/or remediate cause
Last Resort State
Load > 70%
!!!
Alert
Load > 80%
Autoscaling
process
zZzZ
19
Unpredictable Spikes
Unforeseen changes in capacity
requirements
Guidelines
• Autoscaling should be a backup
plan, not a one-stop shop
• Monitor, alert, benchmark (alert on &
remediate proactively)
• Alert before you hit your auto scaling
threshold (which should be higher)
• Thresholds depend on how tolerant
you are to app connectivity failures &
consumer lag
• Natural bounds and long cool down
periods
20. A Tailored Capacity Plan
Are you optimizing for
throughput, latency,
durability, availability, or
something else?
20
What type of degradations can you
handle? (for how long?)
Do you have internal or external
SLOs?
Capacity planning on cluster
capacity alone produces
bottlenecks
Examine each part of your pipeline
& create a scaling plan
Lean on tools that support
scalability for each part of the stack.
Service
Goals
End-to-End
Architecture
22. Kafka / CC
Cluster
Create a Scaling Plan
22
Determine service goals &
objectives
Define capacity requirements
Create a capacity plan based
on benchmarks & runbooks for
different scaling scenarios
Automate where possible
B
Producer /
Consumer
Applications
Scaling is an end-to-end
problem and needs a complete
solution
Benchmark components at
baseline (& test for
bottlenecks)
B
B
kStreams
Apps
Kafka
Connect
W
W
ksqlDB
S
S
Per Use Case
Per Component
You cannot effectively autoscale
until you benchmark
23. Kafka / CC
Cluster
Create a Maintenance Plan
23
B
Producer /
Consumer
Applications
Self-managed components can
be hosted on k8s for elastic
scaling
B
B
kStreams
Apps
Kafka
Connect
W
W
ksqlDB
S
S
● Monitor your pipeline (client and server side) and set
up alerting based on benchmarked thresholds
● Scale test before known scaling events
● Benchmark based on configs aligned to your service
goals, tune where necessary
● Set up client quotas
● Protect against bad actors by defining an access
management strategy & monitoring your audit logs
If you’re running CP, utilize SBC
& tiered storage (default for CC)
24. Our Data Pipeline
24
Kafka Cluster
Deployment Setup
● CC Enterprise Cluster w/
max throughput of
250/750 mbps
● A single connector
benchmarked at 15 MBps
per task
● A kStreams application
with 1 app instance
● A single consumer app
Kafka
Connect
Consumer App
A
C
kStreams
A
Service Goals
● Expected avg throughput
of 10 MBps write, 3x
fanout
● Peak throughputs of
50/150 MBps
● Low tolerance for
degradation at peak
25. Our Capacity Plan
25
Kafka Cluster
Scaling Plan
● Cluster: no need to scale,
make sure capacity stays
below limits
● Connect: automate
capacity-based task scaling
● kStreams: automate scaling
of app instances
● Consumer: automate scaling
of instances up to # of
partitions
Kafka
Connect
Consumer App
A
C
kStreams
A
Maintenance Plan
● Monitor all resources & set up
alerting
● Scale test our use case @
peak throughputs
● Benchmark our consumer
● Examine kStreams topology
● Monitor audit logs
26. A Note for Platform Teams
● Platform teams with internal teams consuming Kafka
○ Client quotas
○ Best practices / templated approaches to clients
○ Proactive monitoring & auditing
○ Set of automated scripts for scaling cluster, connect, etc
(and/or automated scaling for self-hosted connect)
○ Set of remediation runbooks / troubleshooting guides
26
27. If You Want to Autoscale Confluent Cloud
27
HIGH CLUSTER LOAD HIGH CONSUMER LAG
HIGH PRODUCER
BUFFER
HONORABLE MENTION:
CONNECTIONS/REQUES
TS
>70% >80%
28. If your cluster is Dedicated & your Ducks are in a Row
28
Kafka Cluster
Dedicated Cluster
2 CKUs
& [ | ]
high or rapidly
growing
consumer lag
high producer
buffer
cluster load >
80%
get # of CKUs
add a CKU
don’t forget a cooldown period
scale up aggressively, scale
down conservatively
31. Upcoming for Kafka & Confluent
31
Predictable Spikes
for Major Events
Predictable Spikes
for Minor Events
Unpredictable
Spikes
Enterprise Tier1
Client-Side Metrics
Update [KIP 714]
Multi-CKU Shrink
Fast Scaling
1
GA Today