"So, you have built/inherited/discovered one of your many Kafka clusters. How now do you know that it is good enough to sustain and grow your applications? Do you stress test it as a data store, a messaging system, as middleware, or like a REST API? Or are you in production and worried about the next unprecedented surge? Find out from those who have asked and answered before.
Repeatable, and recordable stress testing for Kafka is a challenge for novices and some legends. Real supplies like storage, compute, network, threads etc. do not naturally map to demands of messages, bytes, and milliseconds. In the session, we will cover ways to:
* Define parameters and variables before beginning
* Accommodate for changing conditions - brokers, applications, config, network
* Overlap infrastructure, test design, latency, and throughput
* Meet cost, service level agreements, and multi-tenancy needs while testing
* Do it all without entirely relying on estimation, and extrapolation
We will also discuss common and innovative practices observed in the industry to meet this challenge. At the end of the session, you would walk away with the knowledge needed to set up a repeatable stress test suite without stress."
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Kunapuli
1. Chill, Distill, No Overkill: Best Practices to Stress Test Kafka
Siva Kunapuli
2. About me
2
Teacher, Programmer, Engineer, Architect
• Started early, still at it
• 15+ years
• Services, Product, Consulting, Technical Account Management
Customers
• Financial services, strategic
• “Customer with a problem”
Kafkaesque
• Fail often, and learn
• Challenging to operationalize, but useful
• Around the world
3. 3
Stress testing Kafka:
The challenge
Paradigm driven
Protocol interactions a.k.a. real-time vs. batch
Data storage vs. distribution
Distributed system - components, scale, changing
conditions
Resources
Simple, commodity, cloud
Provisioning demands vs. reality
Structure
Parameters – test design
Recordability, repeatability
Costs and SLAs
4. 4
Stress testing Kafka:
The challenge
Paradigm driven
Protocol interactions a.k.a. real-time vs. batch
Data storage vs. distribution
Distributed system - components, scale, changing
conditions
Resources
Simple, commodity, cloud
Provisioning demands vs. reality
Structure
Parameters – test design
Recordability, repeatability
Costs and SLAs
5. 5
Stress testing Kafka:
The challenge
Paradigm driven
Protocol interactions a.k.a. real-time vs. batch
Data storage vs. distribution
Distributed system - components, scale, changing
conditions
Resources
Simple, commodity, cloud
Provisioning demands vs. reality
Structure
Parameters – test design
Recordability, repeatability
Costs and SLAs
6. 6
Stress testing Kafka:
The challenge
Paradigm driven
Protocol interactions a.k.a. real-time vs. batch
Data storage vs. distribution
Distributed system - components, scale, changing
conditions
Resources
Simple, commodity, cloud
Provisioning demands vs. reality
Structure
Parameters – test design
Recordability, repeatability
Costs and SLAs
8. Stress testing primer
8
Robustness of setup
• Can your system handle stress gracefully?
• Does it fail where and when you’re not looking?
• What is usual, and what is unusual?
Spanning stress (i.e., not selective)
• Component/framework models – Connect, Streams, Core
• Resources rather than use case(s) – storage, IO throughput, memory, CPU
• Concurrency, data access – many clients, simulated network conditions
Mission critical
• No part is open to failure
• Use case is the driver
9. Stress testing primer
9
Robustness of setup
• Can your system handle stress gracefully?
• Does it fail where and when you’re not looking?
• What is usual, and what is unusual?
Spanning stress (i.e., not selective)
• Component/framework models – Connect, Streams, Core
• Resources rather than use case(s) – storage, IO throughput, memory, CPU
• Concurrency, data access – many clients, simulated network conditions
Mission critical
• No part is open to failure
• Use case is the driver
10. Stress testing primer
10
Robustness of setup
• Can your system handle stress gracefully?
• Does it fail where and when you’re not looking?
• What is usual, and what is unusual?
Spanning stress (i.e., not selective)
• Component/framework models – Connect, Streams, Core
• Resources rather than use case(s) – storage, IO throughput, memory, CPU
• Concurrency, data access – many clients, simulated network conditions
Mission critical
• No part is open to failure
• Use case is the driver
11. Stress testing primer
11
Robustness of setup
• Can your system handle stress gracefully?
• Does it fail where and when you’re not looking?
• What is usual, and what is unusual?
Spanning stress (i.e., not selective)
• Component/framework models – Connect, Streams, Core
• Resources rather than use case(s) – storage, IO throughput, memory, CPU
• Concurrency, data access – many clients, simulated network conditions
Mission critical
• No part is open to failure
• Use case is the driver
13. Pre-requisites
13
Concern Ready Bonus points
Environment Identified scaling procedures – adding brokers, storage Automation for scaling
Identified components – connect, streams, core Good component diagram
At least at production scale Tear down after done
Identified network setup Ping, and packet roundtrip
Benchmarks Active (normal) benchmarks published Repeating at regular intervals
Identified SLA Negotiated, and signed off
Multi-tenancy Quotas set
Observability Full cluster metrics captured, and visualized Application metrics
14. Pre-requisites
14
Concern Ready Bonus points
Environment Identified scaling procedures – adding brokers, storage Automation for scaling
Identified components – connect, streams, core Good component diagram
At least at production scale Tear down after done
Identified network setup Ping, and packet roundtrip
Benchmarks Active (normal) benchmarks published Repeating at regular intervals
Identified SLA Negotiated, and signed off
Multi-tenancy Quotas set
Observability Full cluster metrics captured, and visualized Application metrics
15. Pre-requisites
15
Concern Ready Bonus points
Environment Identified scaling procedures – adding brokers, storage Automation for scaling
Identified components – connect, streams, core Good component diagram
At least at production scale Tear down after done
Identified network setup Ping, and packet roundtrip
Benchmarks Active (normal) benchmarks published Repeating at regular intervals
Identified SLA Negotiated, and signed off
Multi-tenancy Quotas set
Observability Full cluster metrics captured, and visualized Application metrics
16. Pre-requisites
16
Concern Ready Bonus points
Environment Identified scaling procedures – adding brokers, storage Automation for scaling
Identified components – connect, streams, core Good component diagram
At least at production scale Tear down after done
Identified network setup Ping, and packet roundtrip
Benchmarks Active (normal) benchmarks published Repeating at regular intervals
Identified SLA Negotiated, and signed off
Multi-tenancy Quotas set
Observability Full cluster metrics captured, and visualized Application metrics
17. Pre-requisites continued
17
Benchmarking
• Other sessions, lightning talks
• OpenMessaging benchmark framework
• Simulate production load – multiple applications, clients, connectors, change data (CDC) etc.
Clean container environments
• No massively parallel multi-function, single purpose, all encompassing clusters
Observability
• APM tools, or DIY
• Must have – production, consumption, topic level, throughput metrics
Multi-tenancy
• Stop, and do not move forward without quotas
• Can pose challenges in separation even with quotas – cluster downtime
18. 18
A good stress test for
Kafka
Stick to the paradigm
Request/response, topic semantics
Push data, and consume
Application is less important
Include all parameters
Component tests
High concurrency tests – race conditions
Resource tests – network, IO, CPU
Specific use cases
Break something, and recover
Change conditions
Memory leaks
19. 19
A good stress test for
Kafka
Stick to the paradigm
Request/response, topic semantics
Push data, and consume
Application is less important
Include all parameters
Component tests
High concurrency tests – race conditions
Resource tests – network, IO, CPU
Specific use cases
Break something, and recover
Change conditions
Memory leaks
20. 20
A good stress test for
Kafka
Stick to the paradigm
Request/response, topic semantics
Push data, and consume
Application is less important
Include all parameters
Component tests
High concurrency tests – race conditions
Resource tests – network, IO, CPU
Specific use cases
Break something, and recover
Change conditions
Memory leaks
21. 21
A good stress test for
Kafka
Stick to the paradigm
Request/response, topic semantics
Push data, and consume
Application is less important
Include all parameters
Component tests
High concurrency tests – race conditions
Resource tests – network, IO, CPU
Specific use cases
Break something, and recover
Change conditions
Memory leaks
24. Kafka internals continued
24
Consumer Group protocol
• Partition assignment, and subscription
• Group coordinator
• Rebalance triggers
• Offset management
Control plane
• Controller with and without KRAFT
• Topic metadata
• Replication
Topics
• Compaction
• Message keys, and partitions
25. Connect
• Change Data Capture
(CDC) has row, timestamp
dependency
• Database/data store
reads/writes
• Protocol shifts – Kafka to
HTTP and back.
Component tests
Streams
• Stateless vs. stateful
• Test against real topology
• Focus on changelog topics,
and state stores
• Streams application reset
tool
25
Others
• Non-Java clients may have
different concurrency
• Zookeeper/KRAFT
• Avoid Admin API tests
especially for topic
metadata, and partition
changes
• Geo-replication
• Multi-tenancy
26. Connect
• Change Data Capture
(CDC) has row, timestamp
dependency
• Database/data store
reads/writes
• Protocol shifts – Kafka to
HTTP and back.
Component tests
Streams
• Stateless vs. stateful
• Test against real topology
• Focus on changelog topics,
and state stores
• Streams application reset
tool
26
Others
• Non-Java clients may have
different concurrency
• Zookeeper/KRAFT
• Avoid Admin API tests
especially for topic
metadata, and partition
changes
• Geo-replication
• Multi-tenancy
27. Connect
• Change Data Capture
(CDC) has row, timestamp
dependency
• Database/data store
reads/writes
• Protocol shifts – Kafka to
HTTP and back.
Component tests
Streams
• Stateless vs. stateful
• Test against real topology
• Focus on changelog topics,
and state stores
• Streams application reset
tool
27
Others
• Non-Java clients may have
different concurrency
• Zookeeper/KRAFT
• Avoid Admin API tests
especially for topic
metadata, and partition
changes
• Geo-replication
• Multi-tenancy
28. Connect
• Change Data Capture
(CDC) has row, timestamp
dependency
• Database/data store
reads/writes
• Protocol shifts – Kafka to
HTTP and back.
Component tests
Streams
• Stateless vs. stateful
• Test against real topology
• Focus on changelog topics,
and state stores
• Streams application reset
tool
28
Others
• Non-Java clients may have
different concurrency
• Zookeeper/KRAFT
• Avoid Admin API tests
especially for topic
metadata, and partition
changes
• Geo-replication
• Multi-tenancy
29. High concurrency
29
Multiple producer, consumer application instances are better
• Can help establish keying/partitioning issues
• Containerization can help, but don’t go overboard
• Different network fragments i.e., different data centers, or availability zones are better
Small, numerous messages are better
• Large messages break concurrency tests and are not normal for Kafka
• Increasing, and decreasing number of messages could be part of test
Transactions/EOS
• If using Transactions API, several additional considerations including Transaction Coordinator are in play
• Exactly Once Semantics (EOS) influences stress
Race conditions
• Not enough partitions on consumer group topic(s)
• Rebalances
30. High concurrency
30
Multiple producer, consumer application instances are better
• Can help establish keying/partitioning issues
• Containerization can help, but don’t go overboard
• Different network fragments i.e., different data centers, or availability zones are better
Small, numerous messages are better
• Large messages break concurrency tests and are not normal for Kafka
• Increasing, and decreasing number of messages could be part of test
Transactions/EOS
• If using Transactions API, several additional considerations including Transaction Coordinator are in play
• Exactly Once Semantics (EOS) influences stress
Race conditions
• Not enough partitions on consumer group topic(s)
• Rebalances
31. High concurrency
31
Multiple producer, consumer application instances are better
• Can help establish keying/partitioning issues
• Containerization can help, but don’t go overboard
• Different network fragments i.e., different data centers, or availability zones are better
Small, numerous messages are better
• Large messages break concurrency tests and are not normal for Kafka
• Increasing, and decreasing number of messages could be part of test
Transactions/EOS
• If using Transactions API, several additional considerations including Transaction Coordinator are in play
• Exactly Once Semantics (EOS) influences stress
Race conditions
• Not enough partitions on consumer group topic(s)
• Rebalances
32. High concurrency
32
Multiple producer, consumer application instances are better
• Can help establish keying/partitioning issues
• Containerization can help, but don’t go overboard
• Different network fragments i.e., different data centers, or availability zones are better
Small, numerous messages are better
• Large messages break concurrency tests and are not normal for Kafka
• Increasing, and decreasing number of messages could be part of test
Transactions/EOS
• If using Transactions API, several additional considerations including Transaction Coordinator are in play
• Exactly Once Semantics (EOS) influences stress
Race conditions
• Not enough partitions on consumer group topic(s)
• Rebalances
33. High concurrency
33
Multiple producer, consumer application instances are better
• Can help establish keying/partitioning issues
• Containerization can help, but don’t go overboard
• Different network fragments i.e., different data centers, or availability zones are better
Small, numerous messages are better
• Large messages break concurrency tests and are not normal for Kafka
• Increasing, and decreasing number of messages could be part of test
Transactions/EOS
• If using Transactions API, several additional considerations including Transaction Coordinator are in play
• Exactly Once Semantics (EOS) influences stress
Race conditions
• Not enough partitions on consumer group topic(s)
• Rebalances
34. Resource tests
34
Resource Before test Observe
IO throughput Identify limits of storage class – device
hardware, or virtualization
Throughput should hit or exceed limit
Stage multiple devices, storage
directories
Usage of all log dirs, should increase
Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure
Network Identify provisioned capacity, ping, and
packet roundtrip
Message bytes for replication + produce/consume should
match
Network partitions known Hit all possible network fragments, and observe differences
CPU Benchmarks for various compression
types known
CPU utilization should continue to be low
If security protocol is MTLS Test with real certificates and right algorithms
Memory Must include if using streams, or
connect
JVM, and RocksDB metrics
35. Resource tests
35
Resource Before test Observe
IO throughput Identify limits of storage class – device
hardware, or virtualization
Throughput should hit or exceed limit
Stage multiple devices, storage
directories
Usage of all log dirs, should increase
Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure
Network Identify provisioned capacity, ping, and
packet roundtrip
Message bytes for replication + produce/consume should
match
Network partitions known Hit all possible network fragments, and observe differences
CPU Benchmarks for various compression
types known
CPU utilization should continue to be low
If security protocol is MTLS Test with real certificates and right algorithms
Memory Must include if using streams, or
connect
JVM, and RocksDB metrics
36. Resource tests
36
Resource Before test Observe
IO throughput Identify limits of storage class – device
hardware, or virtualization
Throughput should hit or exceed limit
Stage multiple devices, storage
directories
Usage of all log dirs, should increase
Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure
Network Identify provisioned capacity, ping, and
packet roundtrip
Message bytes for replication + produce/consume should
match
Network partitions known Hit all possible network fragments, and observe differences
CPU Benchmarks for various compression
types known
CPU utilization should continue to be low
If security protocol is MTLS Test with real certificates and right algorithms
Memory Must include if using streams, or
connect
JVM, and RocksDB metrics
37. Resource tests
37
Resource Before test Observe
IO throughput Identify limits of storage class – device
hardware, or virtualization
Throughput should hit or exceed limit
Stage multiple devices, storage
directories
Usage of all log dirs, should increase
Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure
Network Identify provisioned capacity, ping, and
packet roundtrip
Message bytes for replication + produce/consume should
match
Network partitions known Hit all possible network fragments, and observe differences
CPU Benchmarks for various compression
types known
CPU utilization should continue to be low
If security protocol is MTLS Test with real certificates and right algorithms
Memory Must include if using streams, or
connect
JVM, and RocksDB metrics
38. Resource tests
38
Resource Before test Observe
IO throughput Identify limits of storage class – device
hardware, or virtualization
Throughput should hit or exceed limit
Stage multiple devices, storage
directories
Usage of all log dirs, should increase
Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure
Network Identify provisioned capacity, ping, and
packet roundtrip
Message bytes for replication + produce/consume should
match
Network partitions known Hit all possible network fragments, and observe differences
CPU Benchmarks for various compression
types known
CPU utilization should continue to be low
If security protocol is MTLS Test with real certificates and right algorithms
Memory Must include if using streams, or
connect
JVM, and RocksDB metrics
39. Use case driven
39
Not every use case can cause stress
• Use case needs to be able to push structural boundaries of Kafka i.e., paradigms, components, or resources
• Criticality <> Latency <> Throughput <> Cost
Run full use case with end-to-end latency metrics
• Introduce application specific metrics, simple JMX will do
• End to end latency for critical use cases must be designed upfront, and included in SLA
• Data availability, and system boundaries must be accounted for
Production critical use cases with low latency need good infrastructure
• Purpose of stress testing is to establish system limits, not necessarily to provide insights outside of resilience
• Repeated test cycles are not substitutes for good infrastructure
• Network is usually the bottleneck
Substantial parallelism requires specific tuning
• Thousands of parallel connections while supported may create unknown system states
• Port/socket level limits, TCP and other buffers
40. Chaos can be fun
• Stop brokers, network
devices, and storage
devices
• Pull the plug, cord, or
anything that can be pulled
• Remove certs, change
firewall rules, and
necessary software
components
Staying calm, and breaking Kafka
Increase number of
client instances,
number of messages
• Continuous increase will
start to hit message level
latency, and throughput
• Topic level metrics like
bytes in will start to dip
• Focus on 95th percentile for
stress tests
40
For critical use cases,
identify and introduce
breaking points
• Increase number of
database rows, or remove
Hadoop partitions
• Delete state stores, backup
state stores and observe
• Where load balancers are in
play, test them for real
scenarios
41. Chaos can be fun
• Stop brokers, network
devices, and storage
devices
• Pull the plug, cord, or
anything that can be pulled
• Remove certs, change
firewall rules, and
necessary software
components
Staying calm, and breaking Kafka
Increase number of
client instances,
number of messages
• Continuous increase will
start to hit message level
latency, and throughput
• Topic level metrics like
bytes in will start to dip
• Focus on 95th percentile for
stress tests
41
For critical use cases,
identify and introduce
breaking points
• Increase number of
database rows, or remove
Hadoop partitions
• Delete state stores, backup
state stores and observe
• Where load balancers are in
play, test them for real
scenarios
42. Chaos can be fun
• Stop brokers, network
devices, and storage
devices
• Pull the plug, cord, or
anything that can be pulled
• Remove certs, change
firewall rules, and
necessary software
components
Staying calm, and breaking Kafka
Increase number of
client instances,
number of messages
• Continuous increase will
start to hit message level
latency, and throughput
• Topic level metrics like
bytes in will start to dip
• Focus on 95th percentile for
stress tests
42
For critical use cases,
identify and introduce
breaking points
• Increase number of
database rows, or remove
Hadoop partitions
• Delete state stores, backup
state stores and observe
• Where load balancers are in
play, test them for real
scenarios
43. Chaos can be fun
• Stop brokers, network
devices, and storage
devices
• Pull the plug, cord, or
anything that can be pulled
• Remove certs, change
firewall rules, and
necessary software
components
Staying calm, and breaking Kafka
Increase number of
client instances,
number of messages
• Continuous increase will
start to hit message level
latency, and throughput
• Topic level metrics like
bytes in will start to dip
• Focus on 95th percentile for
stress tests
43
For critical use cases,
identify and introduce
breaking points
• Increase number of
database rows, or remove
Hadoop partitions
• Delete state stores, backup
state stores and observe
• Where load balancers are in
play, test them for real
scenarios
44. Memory and leaks
44
Cannot say where
• Memory leaks can occur in all components
• JVM tuning is not generally required unless setting up for specific environment or use case
• Tuning likely to go overboard – think GC
Use profiler, and have runbook
• Get familiar with the usage of JVM profiler, and have ability to attach to Kafka components
• Can help with application debugging also
May be more likely for REST, and other interactions
• Kafka protocol itself doesn’t rely too much on memory
• Therefore, understand and test with the angle of where data is moving and why
45. 45
Recording results, and
recovering
Results should be metrics
Drop or change in metrics under specific
conditions should be captured
Functional testing i.e., application changes are
interesting observations but not necessarily tied
to stress testing (except critical use cases)
Brokers should be up, retention is your
friend
Bring up any lost brokers, and recovery should be
straightforward
Topic retention will help remove any large volumes
of messages, set it to low when stress testing
Break glass
Procedures should be in place for any stress
testing. For Kafka, this may include ability to drop
and create new cluster.
46. 46
Recording results, and
recovering
Results should be metrics
Drop or change in metrics under specific
conditions should be captured
Functional testing i.e., application changes are
interesting observations but not necessarily tied
to stress testing (except critical use cases)
Brokers should be up, retention is your
friend
Bring up any lost brokers, and recovery should be
straightforward
Topic retention will help remove any large volumes
of messages, set it to low when stress testing
Break glass
Procedures should be in place for any stress
testing. For Kafka, this may include ability to drop
and create new cluster.
47. 47
Recording results, and
recovering
Results should be metrics
Drop or change in metrics under specific
conditions should be captured
Functional testing i.e., application changes are
interesting observations but not necessarily tied
to stress testing (except critical use cases)
Brokers should be up, retention is your
friend
Bring up any lost brokers, and recovery should be
straightforward
Topic retention will help remove any large volumes
of messages, set it to low when stress testing
Break glass
Procedures should be in place for any stress
testing. For Kafka, this may include ability to drop
and create new cluster.
48. 48
Recording results, and
recovering
Results should be metrics
Drop or change in metrics under specific
conditions should be captured
Functional testing i.e., application changes are
interesting observations but not necessarily tied
to stress testing (except critical use cases)
Brokers should be up, retention is your
friend
Bring up any lost brokers, and recovery should be
straightforward
Topic retention will help remove any large volumes
of messages, set it to low when stress testing
Break glass
Procedures should be in place for any stress
testing. For Kafka, this may include ability to drop
and create new cluster.
49. Repeatability
49
Use scripts, and chaos testing
• All tests including for deploying multiple instances can be scripted
• Tie into any popular testing framework
• Think and get ready with automation, and continuous deployment during cluster build
Inheritance framework for Kafka clusters
• Top secret project which no one (including me) works on J
• Start with metrics, benchmarking, and move to stress testing
• Critical use cases cannot be built on poorly understood systems
Cloud vs. on-prem
• On-prem systems are excellent candidates to do stress testing because expansion, and bug fixing takes longer
• Patching of cloud instances is also a good opportunity to repeat
• Automation will help in both cases
51. Sensor data from
multiple devices
• Thousands of devices
sending data to cluster
• Some are real-time
requiring immediate
response, others send large
batches
• Messages need to be
analyzed almost
instantaneously
Some real(ish) scenarios – design a good stress test
CDC from Oracle,
streams, high volume
• Continuously increasing
transaction volume
• Streams processing with
joins/other aggregates
• High volume (>5000
messages/sec)
51
Geo-distributed, ultra
low latency
• Cluster serves multiple
geographies
• Requires ultra-low latency
for messages (<10
milliseconds)
• Volume is low, but will
increase as cluster adoption
increases