SlideShare a Scribd company logo
1 of 33
Download to read offline
Autoscaling Confluent Cloud: Should
We? How Would We?
Julie Price
Senior Product Manager
Amanda Gilbert
Staff Solutions Engineer
Autoscaling
2
Lower Cost
Just-In-Time
Provisioning
Improved Resource
Utilization
Elasticity
Fault Tolerance
Reduced Operations
… is really just a capacity planning tactic
capacity; the ability to hold or contain
3
CPU MEMORY STORAGE
APPS
CPU
CPU
scaling up: adding additional resources
to provisioned nodes
CPU
MEM
STOR
CPU
MEM
STOR
CPU
MEM
STOR
scaling out: adding additional nodes w/
provisioned capacity
auto scaling
4
HARDWARE
HOST OS
CONTAINER ENGINE
BINS &
LIBS
BINS &
LIBS
BINS &
LIBS
APP APP APP
Containers
BINS &
LIBS
APP
● increase or decrease resources on demand
● policy based on load across fleet of servers
● cool down period - time after scaling event when
scaling is on hold
● CSPs have native autoscaling approaches
AUTOSCALING GROUP
Scaling Kafka
5
Kafka Cluster
BROKER
Storage
Memory
CPU
Kafka Connect
BROKER
Storage
Memory
CPU
BROKER
Storage
Memory
CPU
ZK/KRaft
Storage
Memory
CPU
ZK/KRaft
Storage
Memory
CPU
ZK/KRaft
Storage
Memory
CPU
ZK/KRaft
Storage
Memory
CPU
WORKER
[Storage]
Memory
CPU
WORKER
[Storage]
Memory
CPU
ksqlDB
SERVER
Storage
Memory
CPU
SERVER
Storage
Memory
CPU
Producer/Consumer Apps
APP
NODE
Storage
Memory
CPU
APP
NODE
Storage
Memory
CPU
APP
NODE
Storage
Memory
CPU
APP
NODE
Storage
Memory
CPU
7
Kafka Cluster
B B B
B B
Kafka Connect
W
W
ksqlDB
S
S
Producer/Consumer Apps
A A A A
NETWORK
COMPUTE
AZ AZ AZ
Cells
Cells
Cells
OBJECT
STORAGE
CUSTOMERS
Multi-Cloud Networking & Routing Tier
Metadata
Durability Audits
METRICS & OBSERVABILITY
CONNECT
PROCESSING
GOVERNANCE
Data Balancing
Health Checks
Real-time
feedback
data
Other Confluent Cloud Services
GLOBAL CONTROL PLANE
… but I’m running on Confluent Cloud
Learn more
about Kora
wouldn’t it
be great if
adding capacity meant more capacity?
Cluster Capacity is often a Ceiling, not a Floor
10
Kafka Cluster
Throughput - 100/300
MBps
p99 Latency - 6 ms
Kafka
Connect
Consumer App
A
C
ksqlDB
Q
Decreased
Throughput
Poor Query
Performance
Decreased
Throughput
Bottleneck 1: Number of Partitions
11
LIMITING FACTOR
throughput is constrained by the number
of partitions on your topic regardless of
cluster size
BENCHMARK IT
measure the throughput on a single
production partition
SIZE IT
max(t/p, t/c) where:
t= target throughput
p= measured throughput on 1 partition
c= consumption
[AUTO]SCALE IT?
METRICS
received_bytes
sent_bytes
generally, no
provision based on higher scale workload
beware: order guarantees, no downsizing
B
B
B
Your partitioning
strategy matters too!
P1
P2
P3
Bottleneck 2: Producer/Consumer Configs
12
LIMITING FACTOR client configurations
BENCHMARK IT
baseline with the kafka perf tools
baseline your app’s throughput
REMEDIATE IT
use benchmarks to guide the remediation
alter configs based on service goals
re-test
[AUTO]SCALE IT?
METRICS
depends on service goal
(throughput, lag, etc)
if clients are tuned correctly and cluster is
overutilized, a scaling event could be
required
B
B
B
Producer
Application
Consumer
Application
Consumer
Group
Configs should be
aligned to your service
goals
Bottleneck 3: Client Connections
13
LIMITING FACTOR
CPU [attempts] & Memory [requests]
Denied connection attempts
Added latency
BENCHMARK IT
measure total connections, connection
attempts, and requests
REMEDIATION
longer lived connections for less attempts
evaluate your app architecture & configs
audit logs for rogue clients
[AUTO]SCALE IT?
METRICS
active_connection_count
request_count
scaling might not solve your problem
address client configs &/ logic
B
B
B
Producer
Application
Consumer
Application
Rogue clients
unsuccessful connection
attempt still causes a
connection attempt
Clients open a connection to the
leader partition after
authentication
Additional Bottlenecks to Explore
14
partition
imbalance
consumer
parallelism
connector
throughput
stream processing
throughput
uneven load
distribution across
brokers
scale instances up
to # of partitions
limited by the # of
tasks, use
capacity-based
scaling
relies on topology
optimization &
parallelism
Consumer App
Kafka Cluster
Kafka
Connect
C
ksqlDB
Q
One Size Fits All?
15
Scaling Needs are Driven by Use Case
16
Predictable Spikes for
Major Events
Predictable Spikes for
Minor Events
Unpredictable Spikes
These scaling activities are
based on predictable changes
in capacity requirements for
your application.
• Black Friday
• Super Bowl Sunday
• Campaign-Driven
• Product Release
These predictable spikes could
also be generated by ML
models monitoring activity
and predicting an increase in
demand.
These scaling activities are
based on predictable changes
in capacity that happen on a
regular basis.
• Nightly batch jobs
• Daily demand spikes
These predictable spikes may
or may not require additional
capacity.
The predictable and frequent
nature makes these use cases
a good fit for automated
scaling OR overprovisioning.
These scaling activities are
based on unforeseen changes
in capacity requirements.
• Viral Social Media Trend
• Weather Event
• Unknown Supply Chain
Issue
These unpredictable spikes
can cause teams to scramble
to avoid downtime.
Predictable Spikes for Major Events
Predictable, infrequent changes in capacity
requirements
Guidelines
• Benchmark your cluster & resources:
connectors, clients and stream processing
apps. Scale test prior to event.
• Ensure proper client configurations
• Scale out proactively, based on expected
throughputs, client connections, etc
• Set up additional monitoring during the flex
period
17
POS
System
Inventory
Database
Distrib
ution
Center
Cloud
DB/ apps
Web &
Mobile
Snowflak
e
Google
BigQuery
Google Cloud
Storage
Snowflak
e
Google
BigQuery
Cloud Apps
(including
kStreams apps)
Downstream systems
Store
Predictable Spikes for Minor Events
Predictable, frequent changes in capacity
requirements
Guidelines
• Benchmark your applications, cluster &
resources to understand the amount of
traffic they can handle
• Provision your kafka cluster to meet highest
demand point (app instances can be scaled
in place)
• Choose # of partitions based on max
expected throughput
18
19
19
Happy State
Load > 70%
!!!
Alert
Determines if
scaling is
required via
RCA
Kick off scaling scripts
and/or remediate cause
Last Resort State
Load > 70%
!!!
Alert
Load > 80%
Autoscaling
process
zZzZ
19
Unpredictable Spikes
Unforeseen changes in capacity
requirements
Guidelines
• Autoscaling should be a backup
plan, not a one-stop shop
• Monitor, alert, benchmark (alert on &
remediate proactively)
• Alert before you hit your auto scaling
threshold (which should be higher)
• Thresholds depend on how tolerant
you are to app connectivity failures &
consumer lag
• Natural bounds and long cool down
periods
A Tailored Capacity Plan
Are you optimizing for
throughput, latency,
durability, availability, or
something else?
20
What type of degradations can you
handle? (for how long?)
Do you have internal or external
SLOs?
Capacity planning on cluster
capacity alone produces
bottlenecks
Examine each part of your pipeline
& create a scaling plan
Lean on tools that support
scalability for each part of the stack.
Service
Goals
End-to-End
Architecture
Implementing a Capacity Plan
21
Kafka / CC
Cluster
Create a Scaling Plan
22
Determine service goals &
objectives
Define capacity requirements
Create a capacity plan based
on benchmarks & runbooks for
different scaling scenarios
Automate where possible
B
Producer /
Consumer
Applications
Scaling is an end-to-end
problem and needs a complete
solution
Benchmark components at
baseline (& test for
bottlenecks)
B
B
kStreams
Apps
Kafka
Connect
W
W
ksqlDB
S
S
Per Use Case
Per Component
You cannot effectively autoscale
until you benchmark
Kafka / CC
Cluster
Create a Maintenance Plan
23
B
Producer /
Consumer
Applications
Self-managed components can
be hosted on k8s for elastic
scaling
B
B
kStreams
Apps
Kafka
Connect
W
W
ksqlDB
S
S
● Monitor your pipeline (client and server side) and set
up alerting based on benchmarked thresholds
● Scale test before known scaling events
● Benchmark based on configs aligned to your service
goals, tune where necessary
● Set up client quotas
● Protect against bad actors by defining an access
management strategy & monitoring your audit logs
If you’re running CP, utilize SBC
& tiered storage (default for CC)
Our Data Pipeline
24
Kafka Cluster
Deployment Setup
● CC Enterprise Cluster w/
max throughput of
250/750 mbps
● A single connector
benchmarked at 15 MBps
per task
● A kStreams application
with 1 app instance
● A single consumer app
Kafka
Connect
Consumer App
A
C
kStreams
A
Service Goals
● Expected avg throughput
of 10 MBps write, 3x
fanout
● Peak throughputs of
50/150 MBps
● Low tolerance for
degradation at peak
Our Capacity Plan
25
Kafka Cluster
Scaling Plan
● Cluster: no need to scale,
make sure capacity stays
below limits
● Connect: automate
capacity-based task scaling
● kStreams: automate scaling
of app instances
● Consumer: automate scaling
of instances up to # of
partitions
Kafka
Connect
Consumer App
A
C
kStreams
A
Maintenance Plan
● Monitor all resources & set up
alerting
● Scale test our use case @
peak throughputs
● Benchmark our consumer
● Examine kStreams topology
● Monitor audit logs
A Note for Platform Teams
● Platform teams with internal teams consuming Kafka
○ Client quotas
○ Best practices / templated approaches to clients
○ Proactive monitoring & auditing
○ Set of automated scripts for scaling cluster, connect, etc
(and/or automated scaling for self-hosted connect)
○ Set of remediation runbooks / troubleshooting guides
26
If You Want to Autoscale Confluent Cloud
27
HIGH CLUSTER LOAD HIGH CONSUMER LAG
HIGH PRODUCER
BUFFER
HONORABLE MENTION:
CONNECTIONS/REQUES
TS
>70% >80%
If your cluster is Dedicated & your Ducks are in a Row
28
Kafka Cluster
Dedicated Cluster
2 CKUs
& [ | ]
high or rapidly
growing
consumer lag
high producer
buffer
cluster load >
80%
get # of CKUs
add a CKU
don’t forget a cooldown period
scale up aggressively, scale
down conservatively
Tools to Consider
29
CC METRICS API
CLIENT JMX METRICS
CC ADMIN API
What’s coming?
30
Upcoming for Kafka & Confluent
31
Predictable Spikes
for Major Events
Predictable Spikes
for Minor Events
Unpredictable
Spikes
Enterprise Tier1
Client-Side Metrics
Update [KIP 714]
Multi-CKU Shrink
Fast Scaling
1
GA Today
Questions?
32
Autoscaling Kafka: Strategies for Confluent Cloud and Beyond

More Related Content

Similar to Autoscaling Kafka: Strategies for Confluent Cloud and Beyond

JourneyToLowCode_2of4.pdf
JourneyToLowCode_2of4.pdfJourneyToLowCode_2of4.pdf
JourneyToLowCode_2of4.pdfVaibhavVaidya30
 
Weblogic performance tuning2
Weblogic performance tuning2Weblogic performance tuning2
Weblogic performance tuning2Aditya Bhuyan
 
Weblogic Cluster advanced performance tuning
Weblogic Cluster advanced performance tuningWeblogic Cluster advanced performance tuning
Weblogic Cluster advanced performance tuningAditya Bhuyan
 
Informix HA Best Practices
Informix HA Best Practices Informix HA Best Practices
Informix HA Best Practices Scott Lashley
 
Always on high availability best practices for informix
Always on high availability best practices for informixAlways on high availability best practices for informix
Always on high availability best practices for informixIBM_Info_Management
 
Cloud dev ops costs prices sap hana ms
Cloud dev ops costs prices sap hana msCloud dev ops costs prices sap hana ms
Cloud dev ops costs prices sap hana msAjay Kumar Uppal
 
Serverless on AWS : Understanding the hard parts at Froscon 2019
Serverless on AWS : Understanding the hard parts at Froscon 2019Serverless on AWS : Understanding the hard parts at Froscon 2019
Serverless on AWS : Understanding the hard parts at Froscon 2019Vadym Kazulkin
 
Scaling Integration
Scaling IntegrationScaling Integration
Scaling IntegrationKim Clark
 
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...wangbo626
 
Kubernetes: Beyond Baby Steps
Kubernetes: Beyond Baby StepsKubernetes: Beyond Baby Steps
Kubernetes: Beyond Baby StepsDigitalOcean
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...confluent
 
Application performance analytics with Applications Manager
Application performance analytics with Applications ManagerApplication performance analytics with Applications Manager
Application performance analytics with Applications ManagerManageEngine, Zoho Corporation
 
Enterprise Application to Infrastructure Integration - SDN Apps
Enterprise Application to Infrastructure Integration - SDN AppsEnterprise Application to Infrastructure Integration - SDN Apps
Enterprise Application to Infrastructure Integration - SDN AppsMiftakhZein1
 
DEVNET-1153 Enterprise Application to Infrastructure Integration – SDN Apps
DEVNET-1153	Enterprise Application to Infrastructure Integration – SDN AppsDEVNET-1153	Enterprise Application to Infrastructure Integration – SDN Apps
DEVNET-1153 Enterprise Application to Infrastructure Integration – SDN AppsCisco DevNet
 
Ask The Architect: RightScale & AWS Dive Deep into Hybrid IT
Ask The Architect: RightScale & AWS Dive Deep into Hybrid ITAsk The Architect: RightScale & AWS Dive Deep into Hybrid IT
Ask The Architect: RightScale & AWS Dive Deep into Hybrid ITRightScale
 
Building Scalable Applications with Microsoft Azure
Building Scalable Applications with Microsoft AzureBuilding Scalable Applications with Microsoft Azure
Building Scalable Applications with Microsoft AzureFisnik Doko
 

Similar to Autoscaling Kafka: Strategies for Confluent Cloud and Beyond (20)

JourneyToLowCode_2of4.pdf
JourneyToLowCode_2of4.pdfJourneyToLowCode_2of4.pdf
JourneyToLowCode_2of4.pdf
 
Weblogic performance tuning2
Weblogic performance tuning2Weblogic performance tuning2
Weblogic performance tuning2
 
Weblogic Cluster advanced performance tuning
Weblogic Cluster advanced performance tuningWeblogic Cluster advanced performance tuning
Weblogic Cluster advanced performance tuning
 
Autoscaling in Kubernetes
Autoscaling in KubernetesAutoscaling in Kubernetes
Autoscaling in Kubernetes
 
Informix HA Best Practices
Informix HA Best Practices Informix HA Best Practices
Informix HA Best Practices
 
Always on high availability best practices for informix
Always on high availability best practices for informixAlways on high availability best practices for informix
Always on high availability best practices for informix
 
Cloud dev ops costs prices sap hana ms
Cloud dev ops costs prices sap hana msCloud dev ops costs prices sap hana ms
Cloud dev ops costs prices sap hana ms
 
Serverless on AWS : Understanding the hard parts at Froscon 2019
Serverless on AWS : Understanding the hard parts at Froscon 2019Serverless on AWS : Understanding the hard parts at Froscon 2019
Serverless on AWS : Understanding the hard parts at Froscon 2019
 
Scaling Integration
Scaling IntegrationScaling Integration
Scaling Integration
 
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
 
IBM Db2, system i and AS400 monitoring
IBM Db2, system i and AS400 monitoring IBM Db2, system i and AS400 monitoring
IBM Db2, system i and AS400 monitoring
 
Designing Scalable Applications
Designing Scalable ApplicationsDesigning Scalable Applications
Designing Scalable Applications
 
Kubernetes: Beyond Baby Steps
Kubernetes: Beyond Baby StepsKubernetes: Beyond Baby Steps
Kubernetes: Beyond Baby Steps
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
 
sla nptl.pptx
sla nptl.pptxsla nptl.pptx
sla nptl.pptx
 
Application performance analytics with Applications Manager
Application performance analytics with Applications ManagerApplication performance analytics with Applications Manager
Application performance analytics with Applications Manager
 
Enterprise Application to Infrastructure Integration - SDN Apps
Enterprise Application to Infrastructure Integration - SDN AppsEnterprise Application to Infrastructure Integration - SDN Apps
Enterprise Application to Infrastructure Integration - SDN Apps
 
DEVNET-1153 Enterprise Application to Infrastructure Integration – SDN Apps
DEVNET-1153	Enterprise Application to Infrastructure Integration – SDN AppsDEVNET-1153	Enterprise Application to Infrastructure Integration – SDN Apps
DEVNET-1153 Enterprise Application to Infrastructure Integration – SDN Apps
 
Ask The Architect: RightScale & AWS Dive Deep into Hybrid IT
Ask The Architect: RightScale & AWS Dive Deep into Hybrid ITAsk The Architect: RightScale & AWS Dive Deep into Hybrid IT
Ask The Architect: RightScale & AWS Dive Deep into Hybrid IT
 
Building Scalable Applications with Microsoft Azure
Building Scalable Applications with Microsoft AzureBuilding Scalable Applications with Microsoft Azure
Building Scalable Applications with Microsoft Azure
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 

Autoscaling Kafka: Strategies for Confluent Cloud and Beyond

  • 1. Autoscaling Confluent Cloud: Should We? How Would We? Julie Price Senior Product Manager Amanda Gilbert Staff Solutions Engineer
  • 2. Autoscaling 2 Lower Cost Just-In-Time Provisioning Improved Resource Utilization Elasticity Fault Tolerance Reduced Operations … is really just a capacity planning tactic
  • 3. capacity; the ability to hold or contain 3 CPU MEMORY STORAGE APPS CPU CPU scaling up: adding additional resources to provisioned nodes CPU MEM STOR CPU MEM STOR CPU MEM STOR scaling out: adding additional nodes w/ provisioned capacity
  • 4. auto scaling 4 HARDWARE HOST OS CONTAINER ENGINE BINS & LIBS BINS & LIBS BINS & LIBS APP APP APP Containers BINS & LIBS APP ● increase or decrease resources on demand ● policy based on load across fleet of servers ● cool down period - time after scaling event when scaling is on hold ● CSPs have native autoscaling approaches AUTOSCALING GROUP
  • 7. 7 Kafka Cluster B B B B B Kafka Connect W W ksqlDB S S Producer/Consumer Apps A A A A
  • 8. NETWORK COMPUTE AZ AZ AZ Cells Cells Cells OBJECT STORAGE CUSTOMERS Multi-Cloud Networking & Routing Tier Metadata Durability Audits METRICS & OBSERVABILITY CONNECT PROCESSING GOVERNANCE Data Balancing Health Checks Real-time feedback data Other Confluent Cloud Services GLOBAL CONTROL PLANE … but I’m running on Confluent Cloud Learn more about Kora
  • 9. wouldn’t it be great if adding capacity meant more capacity?
  • 10. Cluster Capacity is often a Ceiling, not a Floor 10 Kafka Cluster Throughput - 100/300 MBps p99 Latency - 6 ms Kafka Connect Consumer App A C ksqlDB Q Decreased Throughput Poor Query Performance Decreased Throughput
  • 11. Bottleneck 1: Number of Partitions 11 LIMITING FACTOR throughput is constrained by the number of partitions on your topic regardless of cluster size BENCHMARK IT measure the throughput on a single production partition SIZE IT max(t/p, t/c) where: t= target throughput p= measured throughput on 1 partition c= consumption [AUTO]SCALE IT? METRICS received_bytes sent_bytes generally, no provision based on higher scale workload beware: order guarantees, no downsizing B B B Your partitioning strategy matters too! P1 P2 P3
  • 12. Bottleneck 2: Producer/Consumer Configs 12 LIMITING FACTOR client configurations BENCHMARK IT baseline with the kafka perf tools baseline your app’s throughput REMEDIATE IT use benchmarks to guide the remediation alter configs based on service goals re-test [AUTO]SCALE IT? METRICS depends on service goal (throughput, lag, etc) if clients are tuned correctly and cluster is overutilized, a scaling event could be required B B B Producer Application Consumer Application Consumer Group Configs should be aligned to your service goals
  • 13. Bottleneck 3: Client Connections 13 LIMITING FACTOR CPU [attempts] & Memory [requests] Denied connection attempts Added latency BENCHMARK IT measure total connections, connection attempts, and requests REMEDIATION longer lived connections for less attempts evaluate your app architecture & configs audit logs for rogue clients [AUTO]SCALE IT? METRICS active_connection_count request_count scaling might not solve your problem address client configs &/ logic B B B Producer Application Consumer Application Rogue clients unsuccessful connection attempt still causes a connection attempt Clients open a connection to the leader partition after authentication
  • 14. Additional Bottlenecks to Explore 14 partition imbalance consumer parallelism connector throughput stream processing throughput uneven load distribution across brokers scale instances up to # of partitions limited by the # of tasks, use capacity-based scaling relies on topology optimization & parallelism Consumer App Kafka Cluster Kafka Connect C ksqlDB Q
  • 15. One Size Fits All? 15
  • 16. Scaling Needs are Driven by Use Case 16 Predictable Spikes for Major Events Predictable Spikes for Minor Events Unpredictable Spikes These scaling activities are based on predictable changes in capacity requirements for your application. • Black Friday • Super Bowl Sunday • Campaign-Driven • Product Release These predictable spikes could also be generated by ML models monitoring activity and predicting an increase in demand. These scaling activities are based on predictable changes in capacity that happen on a regular basis. • Nightly batch jobs • Daily demand spikes These predictable spikes may or may not require additional capacity. The predictable and frequent nature makes these use cases a good fit for automated scaling OR overprovisioning. These scaling activities are based on unforeseen changes in capacity requirements. • Viral Social Media Trend • Weather Event • Unknown Supply Chain Issue These unpredictable spikes can cause teams to scramble to avoid downtime.
  • 17. Predictable Spikes for Major Events Predictable, infrequent changes in capacity requirements Guidelines • Benchmark your cluster & resources: connectors, clients and stream processing apps. Scale test prior to event. • Ensure proper client configurations • Scale out proactively, based on expected throughputs, client connections, etc • Set up additional monitoring during the flex period 17 POS System Inventory Database Distrib ution Center Cloud DB/ apps Web & Mobile Snowflak e Google BigQuery Google Cloud Storage Snowflak e Google BigQuery Cloud Apps (including kStreams apps) Downstream systems Store
  • 18. Predictable Spikes for Minor Events Predictable, frequent changes in capacity requirements Guidelines • Benchmark your applications, cluster & resources to understand the amount of traffic they can handle • Provision your kafka cluster to meet highest demand point (app instances can be scaled in place) • Choose # of partitions based on max expected throughput 18
  • 19. 19 19 Happy State Load > 70% !!! Alert Determines if scaling is required via RCA Kick off scaling scripts and/or remediate cause Last Resort State Load > 70% !!! Alert Load > 80% Autoscaling process zZzZ 19 Unpredictable Spikes Unforeseen changes in capacity requirements Guidelines • Autoscaling should be a backup plan, not a one-stop shop • Monitor, alert, benchmark (alert on & remediate proactively) • Alert before you hit your auto scaling threshold (which should be higher) • Thresholds depend on how tolerant you are to app connectivity failures & consumer lag • Natural bounds and long cool down periods
  • 20. A Tailored Capacity Plan Are you optimizing for throughput, latency, durability, availability, or something else? 20 What type of degradations can you handle? (for how long?) Do you have internal or external SLOs? Capacity planning on cluster capacity alone produces bottlenecks Examine each part of your pipeline & create a scaling plan Lean on tools that support scalability for each part of the stack. Service Goals End-to-End Architecture
  • 22. Kafka / CC Cluster Create a Scaling Plan 22 Determine service goals & objectives Define capacity requirements Create a capacity plan based on benchmarks & runbooks for different scaling scenarios Automate where possible B Producer / Consumer Applications Scaling is an end-to-end problem and needs a complete solution Benchmark components at baseline (& test for bottlenecks) B B kStreams Apps Kafka Connect W W ksqlDB S S Per Use Case Per Component You cannot effectively autoscale until you benchmark
  • 23. Kafka / CC Cluster Create a Maintenance Plan 23 B Producer / Consumer Applications Self-managed components can be hosted on k8s for elastic scaling B B kStreams Apps Kafka Connect W W ksqlDB S S ● Monitor your pipeline (client and server side) and set up alerting based on benchmarked thresholds ● Scale test before known scaling events ● Benchmark based on configs aligned to your service goals, tune where necessary ● Set up client quotas ● Protect against bad actors by defining an access management strategy & monitoring your audit logs If you’re running CP, utilize SBC & tiered storage (default for CC)
  • 24. Our Data Pipeline 24 Kafka Cluster Deployment Setup ● CC Enterprise Cluster w/ max throughput of 250/750 mbps ● A single connector benchmarked at 15 MBps per task ● A kStreams application with 1 app instance ● A single consumer app Kafka Connect Consumer App A C kStreams A Service Goals ● Expected avg throughput of 10 MBps write, 3x fanout ● Peak throughputs of 50/150 MBps ● Low tolerance for degradation at peak
  • 25. Our Capacity Plan 25 Kafka Cluster Scaling Plan ● Cluster: no need to scale, make sure capacity stays below limits ● Connect: automate capacity-based task scaling ● kStreams: automate scaling of app instances ● Consumer: automate scaling of instances up to # of partitions Kafka Connect Consumer App A C kStreams A Maintenance Plan ● Monitor all resources & set up alerting ● Scale test our use case @ peak throughputs ● Benchmark our consumer ● Examine kStreams topology ● Monitor audit logs
  • 26. A Note for Platform Teams ● Platform teams with internal teams consuming Kafka ○ Client quotas ○ Best practices / templated approaches to clients ○ Proactive monitoring & auditing ○ Set of automated scripts for scaling cluster, connect, etc (and/or automated scaling for self-hosted connect) ○ Set of remediation runbooks / troubleshooting guides 26
  • 27. If You Want to Autoscale Confluent Cloud 27 HIGH CLUSTER LOAD HIGH CONSUMER LAG HIGH PRODUCER BUFFER HONORABLE MENTION: CONNECTIONS/REQUES TS >70% >80%
  • 28. If your cluster is Dedicated & your Ducks are in a Row 28 Kafka Cluster Dedicated Cluster 2 CKUs & [ | ] high or rapidly growing consumer lag high producer buffer cluster load > 80% get # of CKUs add a CKU don’t forget a cooldown period scale up aggressively, scale down conservatively
  • 29. Tools to Consider 29 CC METRICS API CLIENT JMX METRICS CC ADMIN API
  • 31. Upcoming for Kafka & Confluent 31 Predictable Spikes for Major Events Predictable Spikes for Minor Events Unpredictable Spikes Enterprise Tier1 Client-Side Metrics Update [KIP 714] Multi-CKU Shrink Fast Scaling 1 GA Today