SlideShare a Scribd company logo
1 of 53
Download to read offline
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka
Siva Kunapuli
About me
2
Teacher, Programmer, Engineer, Architect
• Started early, still at it
• 15+ years
• Services, Product, Consulting, Technical Account Management
Customers
• Financial services, strategic
• “Customer with a problem”
Kafkaesque
• Fail often, and learn
• Challenging to operationalize, but useful
• Around the world
3
Stress testing Kafka:
The challenge
Paradigm driven
Protocol interactions a.k.a. real-time vs. batch
Data storage vs. distribution
Distributed system - components, scale, changing
conditions
Resources
Simple, commodity, cloud
Provisioning demands vs. reality
Structure
Parameters – test design
Recordability, repeatability
Costs and SLAs
4
Stress testing Kafka:
The challenge
Paradigm driven
Protocol interactions a.k.a. real-time vs. batch
Data storage vs. distribution
Distributed system - components, scale, changing
conditions
Resources
Simple, commodity, cloud
Provisioning demands vs. reality
Structure
Parameters – test design
Recordability, repeatability
Costs and SLAs
5
Stress testing Kafka:
The challenge
Paradigm driven
Protocol interactions a.k.a. real-time vs. batch
Data storage vs. distribution
Distributed system - components, scale, changing
conditions
Resources
Simple, commodity, cloud
Provisioning demands vs. reality
Structure
Parameters – test design
Recordability, repeatability
Costs and SLAs
6
Stress testing Kafka:
The challenge
Paradigm driven
Protocol interactions a.k.a. real-time vs. batch
Data storage vs. distribution
Distributed system - components, scale, changing
conditions
Resources
Simple, commodity, cloud
Provisioning demands vs. reality
Structure
Parameters – test design
Recordability, repeatability
Costs and SLAs
01
Chill
02
Distill
03
No Overkill
04
= Stress free
stress testing
Stress testing primer
8
Robustness of setup
• Can your system handle stress gracefully?
• Does it fail where and when you’re not looking?
• What is usual, and what is unusual?
Spanning stress (i.e., not selective)
• Component/framework models – Connect, Streams, Core
• Resources rather than use case(s) – storage, IO throughput, memory, CPU
• Concurrency, data access – many clients, simulated network conditions
Mission critical
• No part is open to failure
• Use case is the driver
Stress testing primer
9
Robustness of setup
• Can your system handle stress gracefully?
• Does it fail where and when you’re not looking?
• What is usual, and what is unusual?
Spanning stress (i.e., not selective)
• Component/framework models – Connect, Streams, Core
• Resources rather than use case(s) – storage, IO throughput, memory, CPU
• Concurrency, data access – many clients, simulated network conditions
Mission critical
• No part is open to failure
• Use case is the driver
Stress testing primer
10
Robustness of setup
• Can your system handle stress gracefully?
• Does it fail where and when you’re not looking?
• What is usual, and what is unusual?
Spanning stress (i.e., not selective)
• Component/framework models – Connect, Streams, Core
• Resources rather than use case(s) – storage, IO throughput, memory, CPU
• Concurrency, data access – many clients, simulated network conditions
Mission critical
• No part is open to failure
• Use case is the driver
Stress testing primer
11
Robustness of setup
• Can your system handle stress gracefully?
• Does it fail where and when you’re not looking?
• What is usual, and what is unusual?
Spanning stress (i.e., not selective)
• Component/framework models – Connect, Streams, Core
• Resources rather than use case(s) – storage, IO throughput, memory, CPU
• Concurrency, data access – many clients, simulated network conditions
Mission critical
• No part is open to failure
• Use case is the driver
Kafka – an introduction
12
Pre-requisites
13
Concern Ready Bonus points
Environment Identified scaling procedures – adding brokers, storage Automation for scaling
Identified components – connect, streams, core Good component diagram
At least at production scale Tear down after done
Identified network setup Ping, and packet roundtrip
Benchmarks Active (normal) benchmarks published Repeating at regular intervals
Identified SLA Negotiated, and signed off
Multi-tenancy Quotas set
Observability Full cluster metrics captured, and visualized Application metrics
Pre-requisites
14
Concern Ready Bonus points
Environment Identified scaling procedures – adding brokers, storage Automation for scaling
Identified components – connect, streams, core Good component diagram
At least at production scale Tear down after done
Identified network setup Ping, and packet roundtrip
Benchmarks Active (normal) benchmarks published Repeating at regular intervals
Identified SLA Negotiated, and signed off
Multi-tenancy Quotas set
Observability Full cluster metrics captured, and visualized Application metrics
Pre-requisites
15
Concern Ready Bonus points
Environment Identified scaling procedures – adding brokers, storage Automation for scaling
Identified components – connect, streams, core Good component diagram
At least at production scale Tear down after done
Identified network setup Ping, and packet roundtrip
Benchmarks Active (normal) benchmarks published Repeating at regular intervals
Identified SLA Negotiated, and signed off
Multi-tenancy Quotas set
Observability Full cluster metrics captured, and visualized Application metrics
Pre-requisites
16
Concern Ready Bonus points
Environment Identified scaling procedures – adding brokers, storage Automation for scaling
Identified components – connect, streams, core Good component diagram
At least at production scale Tear down after done
Identified network setup Ping, and packet roundtrip
Benchmarks Active (normal) benchmarks published Repeating at regular intervals
Identified SLA Negotiated, and signed off
Multi-tenancy Quotas set
Observability Full cluster metrics captured, and visualized Application metrics
Pre-requisites continued
17
Benchmarking
• Other sessions, lightning talks
• OpenMessaging benchmark framework
• Simulate production load – multiple applications, clients, connectors, change data (CDC) etc.
Clean container environments
• No massively parallel multi-function, single purpose, all encompassing clusters
Observability
• APM tools, or DIY
• Must have – production, consumption, topic level, throughput metrics
Multi-tenancy
• Stop, and do not move forward without quotas
• Can pose challenges in separation even with quotas – cluster downtime
18
A good stress test for
Kafka
Stick to the paradigm
Request/response, topic semantics
Push data, and consume
Application is less important
Include all parameters
Component tests
High concurrency tests – race conditions
Resource tests – network, IO, CPU
Specific use cases
Break something, and recover
Change conditions
Memory leaks
19
A good stress test for
Kafka
Stick to the paradigm
Request/response, topic semantics
Push data, and consume
Application is less important
Include all parameters
Component tests
High concurrency tests – race conditions
Resource tests – network, IO, CPU
Specific use cases
Break something, and recover
Change conditions
Memory leaks
20
A good stress test for
Kafka
Stick to the paradigm
Request/response, topic semantics
Push data, and consume
Application is less important
Include all parameters
Component tests
High concurrency tests – race conditions
Resource tests – network, IO, CPU
Specific use cases
Break something, and recover
Change conditions
Memory leaks
21
A good stress test for
Kafka
Stick to the paradigm
Request/response, topic semantics
Push data, and consume
Application is less important
Include all parameters
Component tests
High concurrency tests – race conditions
Resource tests – network, IO, CPU
Specific use cases
Break something, and recover
Change conditions
Memory leaks
22
Kafka internals
23
Kafka internals continued
24
Consumer Group protocol
• Partition assignment, and subscription
• Group coordinator
• Rebalance triggers
• Offset management
Control plane
• Controller with and without KRAFT
• Topic metadata
• Replication
Topics
• Compaction
• Message keys, and partitions
Connect
• Change Data Capture
(CDC) has row, timestamp
dependency
• Database/data store
reads/writes
• Protocol shifts – Kafka to
HTTP and back.
Component tests
Streams
• Stateless vs. stateful
• Test against real topology
• Focus on changelog topics,
and state stores
• Streams application reset
tool
25
Others
• Non-Java clients may have
different concurrency
• Zookeeper/KRAFT
• Avoid Admin API tests
especially for topic
metadata, and partition
changes
• Geo-replication
• Multi-tenancy
Connect
• Change Data Capture
(CDC) has row, timestamp
dependency
• Database/data store
reads/writes
• Protocol shifts – Kafka to
HTTP and back.
Component tests
Streams
• Stateless vs. stateful
• Test against real topology
• Focus on changelog topics,
and state stores
• Streams application reset
tool
26
Others
• Non-Java clients may have
different concurrency
• Zookeeper/KRAFT
• Avoid Admin API tests
especially for topic
metadata, and partition
changes
• Geo-replication
• Multi-tenancy
Connect
• Change Data Capture
(CDC) has row, timestamp
dependency
• Database/data store
reads/writes
• Protocol shifts – Kafka to
HTTP and back.
Component tests
Streams
• Stateless vs. stateful
• Test against real topology
• Focus on changelog topics,
and state stores
• Streams application reset
tool
27
Others
• Non-Java clients may have
different concurrency
• Zookeeper/KRAFT
• Avoid Admin API tests
especially for topic
metadata, and partition
changes
• Geo-replication
• Multi-tenancy
Connect
• Change Data Capture
(CDC) has row, timestamp
dependency
• Database/data store
reads/writes
• Protocol shifts – Kafka to
HTTP and back.
Component tests
Streams
• Stateless vs. stateful
• Test against real topology
• Focus on changelog topics,
and state stores
• Streams application reset
tool
28
Others
• Non-Java clients may have
different concurrency
• Zookeeper/KRAFT
• Avoid Admin API tests
especially for topic
metadata, and partition
changes
• Geo-replication
• Multi-tenancy
High concurrency
29
Multiple producer, consumer application instances are better
• Can help establish keying/partitioning issues
• Containerization can help, but don’t go overboard
• Different network fragments i.e., different data centers, or availability zones are better
Small, numerous messages are better
• Large messages break concurrency tests and are not normal for Kafka
• Increasing, and decreasing number of messages could be part of test
Transactions/EOS
• If using Transactions API, several additional considerations including Transaction Coordinator are in play
• Exactly Once Semantics (EOS) influences stress
Race conditions
• Not enough partitions on consumer group topic(s)
• Rebalances
High concurrency
30
Multiple producer, consumer application instances are better
• Can help establish keying/partitioning issues
• Containerization can help, but don’t go overboard
• Different network fragments i.e., different data centers, or availability zones are better
Small, numerous messages are better
• Large messages break concurrency tests and are not normal for Kafka
• Increasing, and decreasing number of messages could be part of test
Transactions/EOS
• If using Transactions API, several additional considerations including Transaction Coordinator are in play
• Exactly Once Semantics (EOS) influences stress
Race conditions
• Not enough partitions on consumer group topic(s)
• Rebalances
High concurrency
31
Multiple producer, consumer application instances are better
• Can help establish keying/partitioning issues
• Containerization can help, but don’t go overboard
• Different network fragments i.e., different data centers, or availability zones are better
Small, numerous messages are better
• Large messages break concurrency tests and are not normal for Kafka
• Increasing, and decreasing number of messages could be part of test
Transactions/EOS
• If using Transactions API, several additional considerations including Transaction Coordinator are in play
• Exactly Once Semantics (EOS) influences stress
Race conditions
• Not enough partitions on consumer group topic(s)
• Rebalances
High concurrency
32
Multiple producer, consumer application instances are better
• Can help establish keying/partitioning issues
• Containerization can help, but don’t go overboard
• Different network fragments i.e., different data centers, or availability zones are better
Small, numerous messages are better
• Large messages break concurrency tests and are not normal for Kafka
• Increasing, and decreasing number of messages could be part of test
Transactions/EOS
• If using Transactions API, several additional considerations including Transaction Coordinator are in play
• Exactly Once Semantics (EOS) influences stress
Race conditions
• Not enough partitions on consumer group topic(s)
• Rebalances
High concurrency
33
Multiple producer, consumer application instances are better
• Can help establish keying/partitioning issues
• Containerization can help, but don’t go overboard
• Different network fragments i.e., different data centers, or availability zones are better
Small, numerous messages are better
• Large messages break concurrency tests and are not normal for Kafka
• Increasing, and decreasing number of messages could be part of test
Transactions/EOS
• If using Transactions API, several additional considerations including Transaction Coordinator are in play
• Exactly Once Semantics (EOS) influences stress
Race conditions
• Not enough partitions on consumer group topic(s)
• Rebalances
Resource tests
34
Resource Before test Observe
IO throughput Identify limits of storage class – device
hardware, or virtualization
Throughput should hit or exceed limit
Stage multiple devices, storage
directories
Usage of all log dirs, should increase
Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure
Network Identify provisioned capacity, ping, and
packet roundtrip
Message bytes for replication + produce/consume should
match
Network partitions known Hit all possible network fragments, and observe differences
CPU Benchmarks for various compression
types known
CPU utilization should continue to be low
If security protocol is MTLS Test with real certificates and right algorithms
Memory Must include if using streams, or
connect
JVM, and RocksDB metrics
Resource tests
35
Resource Before test Observe
IO throughput Identify limits of storage class – device
hardware, or virtualization
Throughput should hit or exceed limit
Stage multiple devices, storage
directories
Usage of all log dirs, should increase
Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure
Network Identify provisioned capacity, ping, and
packet roundtrip
Message bytes for replication + produce/consume should
match
Network partitions known Hit all possible network fragments, and observe differences
CPU Benchmarks for various compression
types known
CPU utilization should continue to be low
If security protocol is MTLS Test with real certificates and right algorithms
Memory Must include if using streams, or
connect
JVM, and RocksDB metrics
Resource tests
36
Resource Before test Observe
IO throughput Identify limits of storage class – device
hardware, or virtualization
Throughput should hit or exceed limit
Stage multiple devices, storage
directories
Usage of all log dirs, should increase
Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure
Network Identify provisioned capacity, ping, and
packet roundtrip
Message bytes for replication + produce/consume should
match
Network partitions known Hit all possible network fragments, and observe differences
CPU Benchmarks for various compression
types known
CPU utilization should continue to be low
If security protocol is MTLS Test with real certificates and right algorithms
Memory Must include if using streams, or
connect
JVM, and RocksDB metrics
Resource tests
37
Resource Before test Observe
IO throughput Identify limits of storage class – device
hardware, or virtualization
Throughput should hit or exceed limit
Stage multiple devices, storage
directories
Usage of all log dirs, should increase
Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure
Network Identify provisioned capacity, ping, and
packet roundtrip
Message bytes for replication + produce/consume should
match
Network partitions known Hit all possible network fragments, and observe differences
CPU Benchmarks for various compression
types known
CPU utilization should continue to be low
If security protocol is MTLS Test with real certificates and right algorithms
Memory Must include if using streams, or
connect
JVM, and RocksDB metrics
Resource tests
38
Resource Before test Observe
IO throughput Identify limits of storage class – device
hardware, or virtualization
Throughput should hit or exceed limit
Stage multiple devices, storage
directories
Usage of all log dirs, should increase
Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure
Network Identify provisioned capacity, ping, and
packet roundtrip
Message bytes for replication + produce/consume should
match
Network partitions known Hit all possible network fragments, and observe differences
CPU Benchmarks for various compression
types known
CPU utilization should continue to be low
If security protocol is MTLS Test with real certificates and right algorithms
Memory Must include if using streams, or
connect
JVM, and RocksDB metrics
Use case driven
39
Not every use case can cause stress
• Use case needs to be able to push structural boundaries of Kafka i.e., paradigms, components, or resources
• Criticality <> Latency <> Throughput <> Cost
Run full use case with end-to-end latency metrics
• Introduce application specific metrics, simple JMX will do
• End to end latency for critical use cases must be designed upfront, and included in SLA
• Data availability, and system boundaries must be accounted for
Production critical use cases with low latency need good infrastructure
• Purpose of stress testing is to establish system limits, not necessarily to provide insights outside of resilience
• Repeated test cycles are not substitutes for good infrastructure
• Network is usually the bottleneck
Substantial parallelism requires specific tuning
• Thousands of parallel connections while supported may create unknown system states
• Port/socket level limits, TCP and other buffers
Chaos can be fun
• Stop brokers, network
devices, and storage
devices
• Pull the plug, cord, or
anything that can be pulled
• Remove certs, change
firewall rules, and
necessary software
components
Staying calm, and breaking Kafka
Increase number of
client instances,
number of messages
• Continuous increase will
start to hit message level
latency, and throughput
• Topic level metrics like
bytes in will start to dip
• Focus on 95th percentile for
stress tests
40
For critical use cases,
identify and introduce
breaking points
• Increase number of
database rows, or remove
Hadoop partitions
• Delete state stores, backup
state stores and observe
• Where load balancers are in
play, test them for real
scenarios
Chaos can be fun
• Stop brokers, network
devices, and storage
devices
• Pull the plug, cord, or
anything that can be pulled
• Remove certs, change
firewall rules, and
necessary software
components
Staying calm, and breaking Kafka
Increase number of
client instances,
number of messages
• Continuous increase will
start to hit message level
latency, and throughput
• Topic level metrics like
bytes in will start to dip
• Focus on 95th percentile for
stress tests
41
For critical use cases,
identify and introduce
breaking points
• Increase number of
database rows, or remove
Hadoop partitions
• Delete state stores, backup
state stores and observe
• Where load balancers are in
play, test them for real
scenarios
Chaos can be fun
• Stop brokers, network
devices, and storage
devices
• Pull the plug, cord, or
anything that can be pulled
• Remove certs, change
firewall rules, and
necessary software
components
Staying calm, and breaking Kafka
Increase number of
client instances,
number of messages
• Continuous increase will
start to hit message level
latency, and throughput
• Topic level metrics like
bytes in will start to dip
• Focus on 95th percentile for
stress tests
42
For critical use cases,
identify and introduce
breaking points
• Increase number of
database rows, or remove
Hadoop partitions
• Delete state stores, backup
state stores and observe
• Where load balancers are in
play, test them for real
scenarios
Chaos can be fun
• Stop brokers, network
devices, and storage
devices
• Pull the plug, cord, or
anything that can be pulled
• Remove certs, change
firewall rules, and
necessary software
components
Staying calm, and breaking Kafka
Increase number of
client instances,
number of messages
• Continuous increase will
start to hit message level
latency, and throughput
• Topic level metrics like
bytes in will start to dip
• Focus on 95th percentile for
stress tests
43
For critical use cases,
identify and introduce
breaking points
• Increase number of
database rows, or remove
Hadoop partitions
• Delete state stores, backup
state stores and observe
• Where load balancers are in
play, test them for real
scenarios
Memory and leaks
44
Cannot say where
• Memory leaks can occur in all components
• JVM tuning is not generally required unless setting up for specific environment or use case
• Tuning likely to go overboard – think GC
Use profiler, and have runbook
• Get familiar with the usage of JVM profiler, and have ability to attach to Kafka components
• Can help with application debugging also
May be more likely for REST, and other interactions
• Kafka protocol itself doesn’t rely too much on memory
• Therefore, understand and test with the angle of where data is moving and why
45
Recording results, and
recovering
Results should be metrics
Drop or change in metrics under specific
conditions should be captured
Functional testing i.e., application changes are
interesting observations but not necessarily tied
to stress testing (except critical use cases)
Brokers should be up, retention is your
friend
Bring up any lost brokers, and recovery should be
straightforward
Topic retention will help remove any large volumes
of messages, set it to low when stress testing
Break glass
Procedures should be in place for any stress
testing. For Kafka, this may include ability to drop
and create new cluster.
46
Recording results, and
recovering
Results should be metrics
Drop or change in metrics under specific
conditions should be captured
Functional testing i.e., application changes are
interesting observations but not necessarily tied
to stress testing (except critical use cases)
Brokers should be up, retention is your
friend
Bring up any lost brokers, and recovery should be
straightforward
Topic retention will help remove any large volumes
of messages, set it to low when stress testing
Break glass
Procedures should be in place for any stress
testing. For Kafka, this may include ability to drop
and create new cluster.
47
Recording results, and
recovering
Results should be metrics
Drop or change in metrics under specific
conditions should be captured
Functional testing i.e., application changes are
interesting observations but not necessarily tied
to stress testing (except critical use cases)
Brokers should be up, retention is your
friend
Bring up any lost brokers, and recovery should be
straightforward
Topic retention will help remove any large volumes
of messages, set it to low when stress testing
Break glass
Procedures should be in place for any stress
testing. For Kafka, this may include ability to drop
and create new cluster.
48
Recording results, and
recovering
Results should be metrics
Drop or change in metrics under specific
conditions should be captured
Functional testing i.e., application changes are
interesting observations but not necessarily tied
to stress testing (except critical use cases)
Brokers should be up, retention is your
friend
Bring up any lost brokers, and recovery should be
straightforward
Topic retention will help remove any large volumes
of messages, set it to low when stress testing
Break glass
Procedures should be in place for any stress
testing. For Kafka, this may include ability to drop
and create new cluster.
Repeatability
49
Use scripts, and chaos testing
• All tests including for deploying multiple instances can be scripted
• Tie into any popular testing framework
• Think and get ready with automation, and continuous deployment during cluster build
Inheritance framework for Kafka clusters
• Top secret project which no one (including me) works on J
• Start with metrics, benchmarking, and move to stress testing
• Critical use cases cannot be built on poorly understood systems
Cloud vs. on-prem
• On-prem systems are excellent candidates to do stress testing because expansion, and bug fixing takes longer
• Patching of cloud instances is also a good opportunity to repeat
• Automation will help in both cases
50
Scenario illustrations, and exercise
Sensor data from
multiple devices
• Thousands of devices
sending data to cluster
• Some are real-time
requiring immediate
response, others send large
batches
• Messages need to be
analyzed almost
instantaneously
Some real(ish) scenarios – design a good stress test
CDC from Oracle,
streams, high volume
• Continuously increasing
transaction volume
• Streams processing with
joins/other aggregates
• High volume (>5000
messages/sec)
51
Geo-distributed, ultra
low latency
• Cluster serves multiple
geographies
• Requires ultra-low latency
for messages (<10
milliseconds)
• Volume is low, but will
increase as cluster adoption
increases
52
Questions
Thank you!
Siva Kunapuli

More Related Content

What's hot

The Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes EverythingThe Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes EverythingKai Wähner
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache KafkaAmir Sedighi
 
Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streamsconfluent
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Walking through the Spring Stack for Apache Kafka with Soby Chacko | Kafka S...
 Walking through the Spring Stack for Apache Kafka with Soby Chacko | Kafka S... Walking through the Spring Stack for Apache Kafka with Soby Chacko | Kafka S...
Walking through the Spring Stack for Apache Kafka with Soby Chacko | Kafka S...HostedbyConfluent
 
Common Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache KafkaCommon Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache Kafkaconfluent
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaGuido Schmutz
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Araf Karsh Hamid
 
Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsReal-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsMichael Rainey
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...Flink Forward
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformJean-Paul Azar
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controllerconfluent
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 

What's hot (20)

Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes EverythingThe Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streams
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Walking through the Spring Stack for Apache Kafka with Soby Chacko | Kafka S...
 Walking through the Spring Stack for Apache Kafka with Soby Chacko | Kafka S... Walking through the Spring Stack for Apache Kafka with Soby Chacko | Kafka S...
Walking through the Spring Stack for Apache Kafka with Soby Chacko | Kafka S...
 
Common Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache KafkaCommon Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache Kafka
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics
 
Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsReal-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
 
kafka
kafkakafka
kafka
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platform
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 

Similar to Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Kunapuli

Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestRodolfo Kohn
 
Benchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsBenchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsNGINX, Inc.
 
DrupalCamp LA 2014 - A Perfect Launch, Every Time
DrupalCamp LA 2014 - A Perfect Launch, Every TimeDrupalCamp LA 2014 - A Perfect Launch, Every Time
DrupalCamp LA 2014 - A Perfect Launch, Every TimeSuzanne Aldrich
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Lari Hotari
 
Integration strategies best practices- Mulesoft meetup April 2018
Integration strategies   best practices- Mulesoft meetup April 2018Integration strategies   best practices- Mulesoft meetup April 2018
Integration strategies best practices- Mulesoft meetup April 2018Rohan Rasane
 
VMworld 2013: Building a Validation Factory for VMware Partners
VMworld 2013: Building a Validation Factory for VMware Partners VMworld 2013: Building a Validation Factory for VMware Partners
VMworld 2013: Building a Validation Factory for VMware Partners VMworld
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackC4Media
 
Multiple Dimensions of Load Testing
Multiple Dimensions of Load TestingMultiple Dimensions of Load Testing
Multiple Dimensions of Load TestingAlexander Podelko
 
Load Testing Best Practices
Load Testing Best PracticesLoad Testing Best Practices
Load Testing Best PracticesApica
 
July webinar l How to Handle the Holiday Retail Rush with Agile Performance T...
July webinar l How to Handle the Holiday Retail Rush with Agile Performance T...July webinar l How to Handle the Holiday Retail Rush with Agile Performance T...
July webinar l How to Handle the Holiday Retail Rush with Agile Performance T...Apica
 
SCALE 16x on-prem container orchestrator deployment
SCALE 16x on-prem container orchestrator deploymentSCALE 16x on-prem container orchestrator deployment
SCALE 16x on-prem container orchestrator deploymentSteve Wong
 
Holiday Readiness: Best Practices for Successful Holiday Readiness Testing
Holiday Readiness: Best Practices for Successful Holiday Readiness TestingHoliday Readiness: Best Practices for Successful Holiday Readiness Testing
Holiday Readiness: Best Practices for Successful Holiday Readiness TestingApica
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesBhakti Mehta
 
Architecting with power vm
Architecting with power vmArchitecting with power vm
Architecting with power vmCharlie Cler
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephenSteve Feldman
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Govind Kanshi
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interactionGovind Kanshi
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Cloudera, Inc.
 
Comprehensive Performance Testing: From Early Dev to Live Production
Comprehensive Performance Testing: From Early Dev to Live ProductionComprehensive Performance Testing: From Early Dev to Live Production
Comprehensive Performance Testing: From Early Dev to Live ProductionTechWell
 

Similar to Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Kunapuli (20)

Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance Test
 
Benchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsBenchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and Results
 
Performance Testing Overview
Performance Testing OverviewPerformance Testing Overview
Performance Testing Overview
 
DrupalCamp LA 2014 - A Perfect Launch, Every Time
DrupalCamp LA 2014 - A Perfect Launch, Every TimeDrupalCamp LA 2014 - A Perfect Launch, Every Time
DrupalCamp LA 2014 - A Perfect Launch, Every Time
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
Integration strategies best practices- Mulesoft meetup April 2018
Integration strategies   best practices- Mulesoft meetup April 2018Integration strategies   best practices- Mulesoft meetup April 2018
Integration strategies best practices- Mulesoft meetup April 2018
 
VMworld 2013: Building a Validation Factory for VMware Partners
VMworld 2013: Building a Validation Factory for VMware Partners VMworld 2013: Building a Validation Factory for VMware Partners
VMworld 2013: Building a Validation Factory for VMware Partners
 
Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes Back
 
Multiple Dimensions of Load Testing
Multiple Dimensions of Load TestingMultiple Dimensions of Load Testing
Multiple Dimensions of Load Testing
 
Load Testing Best Practices
Load Testing Best PracticesLoad Testing Best Practices
Load Testing Best Practices
 
July webinar l How to Handle the Holiday Retail Rush with Agile Performance T...
July webinar l How to Handle the Holiday Retail Rush with Agile Performance T...July webinar l How to Handle the Holiday Retail Rush with Agile Performance T...
July webinar l How to Handle the Holiday Retail Rush with Agile Performance T...
 
SCALE 16x on-prem container orchestrator deployment
SCALE 16x on-prem container orchestrator deploymentSCALE 16x on-prem container orchestrator deployment
SCALE 16x on-prem container orchestrator deployment
 
Holiday Readiness: Best Practices for Successful Holiday Readiness Testing
Holiday Readiness: Best Practices for Successful Holiday Readiness TestingHoliday Readiness: Best Practices for Successful Holiday Readiness Testing
Holiday Readiness: Best Practices for Successful Holiday Readiness Testing
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
 
Architecting with power vm
Architecting with power vmArchitecting with power vm
Architecting with power vm
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interaction
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
 
Comprehensive Performance Testing: From Early Dev to Live Production
Comprehensive Performance Testing: From Early Dev to Live ProductionComprehensive Performance Testing: From Early Dev to Live Production
Comprehensive Performance Testing: From Early Dev to Live Production
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 

Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Kunapuli

  • 1. Chill, Distill, No Overkill: Best Practices to Stress Test Kafka Siva Kunapuli
  • 2. About me 2 Teacher, Programmer, Engineer, Architect • Started early, still at it • 15+ years • Services, Product, Consulting, Technical Account Management Customers • Financial services, strategic • “Customer with a problem” Kafkaesque • Fail often, and learn • Challenging to operationalize, but useful • Around the world
  • 3. 3 Stress testing Kafka: The challenge Paradigm driven Protocol interactions a.k.a. real-time vs. batch Data storage vs. distribution Distributed system - components, scale, changing conditions Resources Simple, commodity, cloud Provisioning demands vs. reality Structure Parameters – test design Recordability, repeatability Costs and SLAs
  • 4. 4 Stress testing Kafka: The challenge Paradigm driven Protocol interactions a.k.a. real-time vs. batch Data storage vs. distribution Distributed system - components, scale, changing conditions Resources Simple, commodity, cloud Provisioning demands vs. reality Structure Parameters – test design Recordability, repeatability Costs and SLAs
  • 5. 5 Stress testing Kafka: The challenge Paradigm driven Protocol interactions a.k.a. real-time vs. batch Data storage vs. distribution Distributed system - components, scale, changing conditions Resources Simple, commodity, cloud Provisioning demands vs. reality Structure Parameters – test design Recordability, repeatability Costs and SLAs
  • 6. 6 Stress testing Kafka: The challenge Paradigm driven Protocol interactions a.k.a. real-time vs. batch Data storage vs. distribution Distributed system - components, scale, changing conditions Resources Simple, commodity, cloud Provisioning demands vs. reality Structure Parameters – test design Recordability, repeatability Costs and SLAs
  • 8. Stress testing primer 8 Robustness of setup • Can your system handle stress gracefully? • Does it fail where and when you’re not looking? • What is usual, and what is unusual? Spanning stress (i.e., not selective) • Component/framework models – Connect, Streams, Core • Resources rather than use case(s) – storage, IO throughput, memory, CPU • Concurrency, data access – many clients, simulated network conditions Mission critical • No part is open to failure • Use case is the driver
  • 9. Stress testing primer 9 Robustness of setup • Can your system handle stress gracefully? • Does it fail where and when you’re not looking? • What is usual, and what is unusual? Spanning stress (i.e., not selective) • Component/framework models – Connect, Streams, Core • Resources rather than use case(s) – storage, IO throughput, memory, CPU • Concurrency, data access – many clients, simulated network conditions Mission critical • No part is open to failure • Use case is the driver
  • 10. Stress testing primer 10 Robustness of setup • Can your system handle stress gracefully? • Does it fail where and when you’re not looking? • What is usual, and what is unusual? Spanning stress (i.e., not selective) • Component/framework models – Connect, Streams, Core • Resources rather than use case(s) – storage, IO throughput, memory, CPU • Concurrency, data access – many clients, simulated network conditions Mission critical • No part is open to failure • Use case is the driver
  • 11. Stress testing primer 11 Robustness of setup • Can your system handle stress gracefully? • Does it fail where and when you’re not looking? • What is usual, and what is unusual? Spanning stress (i.e., not selective) • Component/framework models – Connect, Streams, Core • Resources rather than use case(s) – storage, IO throughput, memory, CPU • Concurrency, data access – many clients, simulated network conditions Mission critical • No part is open to failure • Use case is the driver
  • 12. Kafka – an introduction 12
  • 13. Pre-requisites 13 Concern Ready Bonus points Environment Identified scaling procedures – adding brokers, storage Automation for scaling Identified components – connect, streams, core Good component diagram At least at production scale Tear down after done Identified network setup Ping, and packet roundtrip Benchmarks Active (normal) benchmarks published Repeating at regular intervals Identified SLA Negotiated, and signed off Multi-tenancy Quotas set Observability Full cluster metrics captured, and visualized Application metrics
  • 14. Pre-requisites 14 Concern Ready Bonus points Environment Identified scaling procedures – adding brokers, storage Automation for scaling Identified components – connect, streams, core Good component diagram At least at production scale Tear down after done Identified network setup Ping, and packet roundtrip Benchmarks Active (normal) benchmarks published Repeating at regular intervals Identified SLA Negotiated, and signed off Multi-tenancy Quotas set Observability Full cluster metrics captured, and visualized Application metrics
  • 15. Pre-requisites 15 Concern Ready Bonus points Environment Identified scaling procedures – adding brokers, storage Automation for scaling Identified components – connect, streams, core Good component diagram At least at production scale Tear down after done Identified network setup Ping, and packet roundtrip Benchmarks Active (normal) benchmarks published Repeating at regular intervals Identified SLA Negotiated, and signed off Multi-tenancy Quotas set Observability Full cluster metrics captured, and visualized Application metrics
  • 16. Pre-requisites 16 Concern Ready Bonus points Environment Identified scaling procedures – adding brokers, storage Automation for scaling Identified components – connect, streams, core Good component diagram At least at production scale Tear down after done Identified network setup Ping, and packet roundtrip Benchmarks Active (normal) benchmarks published Repeating at regular intervals Identified SLA Negotiated, and signed off Multi-tenancy Quotas set Observability Full cluster metrics captured, and visualized Application metrics
  • 17. Pre-requisites continued 17 Benchmarking • Other sessions, lightning talks • OpenMessaging benchmark framework • Simulate production load – multiple applications, clients, connectors, change data (CDC) etc. Clean container environments • No massively parallel multi-function, single purpose, all encompassing clusters Observability • APM tools, or DIY • Must have – production, consumption, topic level, throughput metrics Multi-tenancy • Stop, and do not move forward without quotas • Can pose challenges in separation even with quotas – cluster downtime
  • 18. 18 A good stress test for Kafka Stick to the paradigm Request/response, topic semantics Push data, and consume Application is less important Include all parameters Component tests High concurrency tests – race conditions Resource tests – network, IO, CPU Specific use cases Break something, and recover Change conditions Memory leaks
  • 19. 19 A good stress test for Kafka Stick to the paradigm Request/response, topic semantics Push data, and consume Application is less important Include all parameters Component tests High concurrency tests – race conditions Resource tests – network, IO, CPU Specific use cases Break something, and recover Change conditions Memory leaks
  • 20. 20 A good stress test for Kafka Stick to the paradigm Request/response, topic semantics Push data, and consume Application is less important Include all parameters Component tests High concurrency tests – race conditions Resource tests – network, IO, CPU Specific use cases Break something, and recover Change conditions Memory leaks
  • 21. 21 A good stress test for Kafka Stick to the paradigm Request/response, topic semantics Push data, and consume Application is less important Include all parameters Component tests High concurrency tests – race conditions Resource tests – network, IO, CPU Specific use cases Break something, and recover Change conditions Memory leaks
  • 22. 22
  • 24. Kafka internals continued 24 Consumer Group protocol • Partition assignment, and subscription • Group coordinator • Rebalance triggers • Offset management Control plane • Controller with and without KRAFT • Topic metadata • Replication Topics • Compaction • Message keys, and partitions
  • 25. Connect • Change Data Capture (CDC) has row, timestamp dependency • Database/data store reads/writes • Protocol shifts – Kafka to HTTP and back. Component tests Streams • Stateless vs. stateful • Test against real topology • Focus on changelog topics, and state stores • Streams application reset tool 25 Others • Non-Java clients may have different concurrency • Zookeeper/KRAFT • Avoid Admin API tests especially for topic metadata, and partition changes • Geo-replication • Multi-tenancy
  • 26. Connect • Change Data Capture (CDC) has row, timestamp dependency • Database/data store reads/writes • Protocol shifts – Kafka to HTTP and back. Component tests Streams • Stateless vs. stateful • Test against real topology • Focus on changelog topics, and state stores • Streams application reset tool 26 Others • Non-Java clients may have different concurrency • Zookeeper/KRAFT • Avoid Admin API tests especially for topic metadata, and partition changes • Geo-replication • Multi-tenancy
  • 27. Connect • Change Data Capture (CDC) has row, timestamp dependency • Database/data store reads/writes • Protocol shifts – Kafka to HTTP and back. Component tests Streams • Stateless vs. stateful • Test against real topology • Focus on changelog topics, and state stores • Streams application reset tool 27 Others • Non-Java clients may have different concurrency • Zookeeper/KRAFT • Avoid Admin API tests especially for topic metadata, and partition changes • Geo-replication • Multi-tenancy
  • 28. Connect • Change Data Capture (CDC) has row, timestamp dependency • Database/data store reads/writes • Protocol shifts – Kafka to HTTP and back. Component tests Streams • Stateless vs. stateful • Test against real topology • Focus on changelog topics, and state stores • Streams application reset tool 28 Others • Non-Java clients may have different concurrency • Zookeeper/KRAFT • Avoid Admin API tests especially for topic metadata, and partition changes • Geo-replication • Multi-tenancy
  • 29. High concurrency 29 Multiple producer, consumer application instances are better • Can help establish keying/partitioning issues • Containerization can help, but don’t go overboard • Different network fragments i.e., different data centers, or availability zones are better Small, numerous messages are better • Large messages break concurrency tests and are not normal for Kafka • Increasing, and decreasing number of messages could be part of test Transactions/EOS • If using Transactions API, several additional considerations including Transaction Coordinator are in play • Exactly Once Semantics (EOS) influences stress Race conditions • Not enough partitions on consumer group topic(s) • Rebalances
  • 30. High concurrency 30 Multiple producer, consumer application instances are better • Can help establish keying/partitioning issues • Containerization can help, but don’t go overboard • Different network fragments i.e., different data centers, or availability zones are better Small, numerous messages are better • Large messages break concurrency tests and are not normal for Kafka • Increasing, and decreasing number of messages could be part of test Transactions/EOS • If using Transactions API, several additional considerations including Transaction Coordinator are in play • Exactly Once Semantics (EOS) influences stress Race conditions • Not enough partitions on consumer group topic(s) • Rebalances
  • 31. High concurrency 31 Multiple producer, consumer application instances are better • Can help establish keying/partitioning issues • Containerization can help, but don’t go overboard • Different network fragments i.e., different data centers, or availability zones are better Small, numerous messages are better • Large messages break concurrency tests and are not normal for Kafka • Increasing, and decreasing number of messages could be part of test Transactions/EOS • If using Transactions API, several additional considerations including Transaction Coordinator are in play • Exactly Once Semantics (EOS) influences stress Race conditions • Not enough partitions on consumer group topic(s) • Rebalances
  • 32. High concurrency 32 Multiple producer, consumer application instances are better • Can help establish keying/partitioning issues • Containerization can help, but don’t go overboard • Different network fragments i.e., different data centers, or availability zones are better Small, numerous messages are better • Large messages break concurrency tests and are not normal for Kafka • Increasing, and decreasing number of messages could be part of test Transactions/EOS • If using Transactions API, several additional considerations including Transaction Coordinator are in play • Exactly Once Semantics (EOS) influences stress Race conditions • Not enough partitions on consumer group topic(s) • Rebalances
  • 33. High concurrency 33 Multiple producer, consumer application instances are better • Can help establish keying/partitioning issues • Containerization can help, but don’t go overboard • Different network fragments i.e., different data centers, or availability zones are better Small, numerous messages are better • Large messages break concurrency tests and are not normal for Kafka • Increasing, and decreasing number of messages could be part of test Transactions/EOS • If using Transactions API, several additional considerations including Transaction Coordinator are in play • Exactly Once Semantics (EOS) influences stress Race conditions • Not enough partitions on consumer group topic(s) • Rebalances
  • 34. Resource tests 34 Resource Before test Observe IO throughput Identify limits of storage class – device hardware, or virtualization Throughput should hit or exceed limit Stage multiple devices, storage directories Usage of all log dirs, should increase Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure Network Identify provisioned capacity, ping, and packet roundtrip Message bytes for replication + produce/consume should match Network partitions known Hit all possible network fragments, and observe differences CPU Benchmarks for various compression types known CPU utilization should continue to be low If security protocol is MTLS Test with real certificates and right algorithms Memory Must include if using streams, or connect JVM, and RocksDB metrics
  • 35. Resource tests 35 Resource Before test Observe IO throughput Identify limits of storage class – device hardware, or virtualization Throughput should hit or exceed limit Stage multiple devices, storage directories Usage of all log dirs, should increase Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure Network Identify provisioned capacity, ping, and packet roundtrip Message bytes for replication + produce/consume should match Network partitions known Hit all possible network fragments, and observe differences CPU Benchmarks for various compression types known CPU utilization should continue to be low If security protocol is MTLS Test with real certificates and right algorithms Memory Must include if using streams, or connect JVM, and RocksDB metrics
  • 36. Resource tests 36 Resource Before test Observe IO throughput Identify limits of storage class – device hardware, or virtualization Throughput should hit or exceed limit Stage multiple devices, storage directories Usage of all log dirs, should increase Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure Network Identify provisioned capacity, ping, and packet roundtrip Message bytes for replication + produce/consume should match Network partitions known Hit all possible network fragments, and observe differences CPU Benchmarks for various compression types known CPU utilization should continue to be low If security protocol is MTLS Test with real certificates and right algorithms Memory Must include if using streams, or connect JVM, and RocksDB metrics
  • 37. Resource tests 37 Resource Before test Observe IO throughput Identify limits of storage class – device hardware, or virtualization Throughput should hit or exceed limit Stage multiple devices, storage directories Usage of all log dirs, should increase Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure Network Identify provisioned capacity, ping, and packet roundtrip Message bytes for replication + produce/consume should match Network partitions known Hit all possible network fragments, and observe differences CPU Benchmarks for various compression types known CPU utilization should continue to be low If security protocol is MTLS Test with real certificates and right algorithms Memory Must include if using streams, or connect JVM, and RocksDB metrics
  • 38. Resource tests 38 Resource Before test Observe IO throughput Identify limits of storage class – device hardware, or virtualization Throughput should hit or exceed limit Stage multiple devices, storage directories Usage of all log dirs, should increase Allow for ongoing snapshots/backups Effects of snapshots/backups, and their failure Network Identify provisioned capacity, ping, and packet roundtrip Message bytes for replication + produce/consume should match Network partitions known Hit all possible network fragments, and observe differences CPU Benchmarks for various compression types known CPU utilization should continue to be low If security protocol is MTLS Test with real certificates and right algorithms Memory Must include if using streams, or connect JVM, and RocksDB metrics
  • 39. Use case driven 39 Not every use case can cause stress • Use case needs to be able to push structural boundaries of Kafka i.e., paradigms, components, or resources • Criticality <> Latency <> Throughput <> Cost Run full use case with end-to-end latency metrics • Introduce application specific metrics, simple JMX will do • End to end latency for critical use cases must be designed upfront, and included in SLA • Data availability, and system boundaries must be accounted for Production critical use cases with low latency need good infrastructure • Purpose of stress testing is to establish system limits, not necessarily to provide insights outside of resilience • Repeated test cycles are not substitutes for good infrastructure • Network is usually the bottleneck Substantial parallelism requires specific tuning • Thousands of parallel connections while supported may create unknown system states • Port/socket level limits, TCP and other buffers
  • 40. Chaos can be fun • Stop brokers, network devices, and storage devices • Pull the plug, cord, or anything that can be pulled • Remove certs, change firewall rules, and necessary software components Staying calm, and breaking Kafka Increase number of client instances, number of messages • Continuous increase will start to hit message level latency, and throughput • Topic level metrics like bytes in will start to dip • Focus on 95th percentile for stress tests 40 For critical use cases, identify and introduce breaking points • Increase number of database rows, or remove Hadoop partitions • Delete state stores, backup state stores and observe • Where load balancers are in play, test them for real scenarios
  • 41. Chaos can be fun • Stop brokers, network devices, and storage devices • Pull the plug, cord, or anything that can be pulled • Remove certs, change firewall rules, and necessary software components Staying calm, and breaking Kafka Increase number of client instances, number of messages • Continuous increase will start to hit message level latency, and throughput • Topic level metrics like bytes in will start to dip • Focus on 95th percentile for stress tests 41 For critical use cases, identify and introduce breaking points • Increase number of database rows, or remove Hadoop partitions • Delete state stores, backup state stores and observe • Where load balancers are in play, test them for real scenarios
  • 42. Chaos can be fun • Stop brokers, network devices, and storage devices • Pull the plug, cord, or anything that can be pulled • Remove certs, change firewall rules, and necessary software components Staying calm, and breaking Kafka Increase number of client instances, number of messages • Continuous increase will start to hit message level latency, and throughput • Topic level metrics like bytes in will start to dip • Focus on 95th percentile for stress tests 42 For critical use cases, identify and introduce breaking points • Increase number of database rows, or remove Hadoop partitions • Delete state stores, backup state stores and observe • Where load balancers are in play, test them for real scenarios
  • 43. Chaos can be fun • Stop brokers, network devices, and storage devices • Pull the plug, cord, or anything that can be pulled • Remove certs, change firewall rules, and necessary software components Staying calm, and breaking Kafka Increase number of client instances, number of messages • Continuous increase will start to hit message level latency, and throughput • Topic level metrics like bytes in will start to dip • Focus on 95th percentile for stress tests 43 For critical use cases, identify and introduce breaking points • Increase number of database rows, or remove Hadoop partitions • Delete state stores, backup state stores and observe • Where load balancers are in play, test them for real scenarios
  • 44. Memory and leaks 44 Cannot say where • Memory leaks can occur in all components • JVM tuning is not generally required unless setting up for specific environment or use case • Tuning likely to go overboard – think GC Use profiler, and have runbook • Get familiar with the usage of JVM profiler, and have ability to attach to Kafka components • Can help with application debugging also May be more likely for REST, and other interactions • Kafka protocol itself doesn’t rely too much on memory • Therefore, understand and test with the angle of where data is moving and why
  • 45. 45 Recording results, and recovering Results should be metrics Drop or change in metrics under specific conditions should be captured Functional testing i.e., application changes are interesting observations but not necessarily tied to stress testing (except critical use cases) Brokers should be up, retention is your friend Bring up any lost brokers, and recovery should be straightforward Topic retention will help remove any large volumes of messages, set it to low when stress testing Break glass Procedures should be in place for any stress testing. For Kafka, this may include ability to drop and create new cluster.
  • 46. 46 Recording results, and recovering Results should be metrics Drop or change in metrics under specific conditions should be captured Functional testing i.e., application changes are interesting observations but not necessarily tied to stress testing (except critical use cases) Brokers should be up, retention is your friend Bring up any lost brokers, and recovery should be straightforward Topic retention will help remove any large volumes of messages, set it to low when stress testing Break glass Procedures should be in place for any stress testing. For Kafka, this may include ability to drop and create new cluster.
  • 47. 47 Recording results, and recovering Results should be metrics Drop or change in metrics under specific conditions should be captured Functional testing i.e., application changes are interesting observations but not necessarily tied to stress testing (except critical use cases) Brokers should be up, retention is your friend Bring up any lost brokers, and recovery should be straightforward Topic retention will help remove any large volumes of messages, set it to low when stress testing Break glass Procedures should be in place for any stress testing. For Kafka, this may include ability to drop and create new cluster.
  • 48. 48 Recording results, and recovering Results should be metrics Drop or change in metrics under specific conditions should be captured Functional testing i.e., application changes are interesting observations but not necessarily tied to stress testing (except critical use cases) Brokers should be up, retention is your friend Bring up any lost brokers, and recovery should be straightforward Topic retention will help remove any large volumes of messages, set it to low when stress testing Break glass Procedures should be in place for any stress testing. For Kafka, this may include ability to drop and create new cluster.
  • 49. Repeatability 49 Use scripts, and chaos testing • All tests including for deploying multiple instances can be scripted • Tie into any popular testing framework • Think and get ready with automation, and continuous deployment during cluster build Inheritance framework for Kafka clusters • Top secret project which no one (including me) works on J • Start with metrics, benchmarking, and move to stress testing • Critical use cases cannot be built on poorly understood systems Cloud vs. on-prem • On-prem systems are excellent candidates to do stress testing because expansion, and bug fixing takes longer • Patching of cloud instances is also a good opportunity to repeat • Automation will help in both cases
  • 51. Sensor data from multiple devices • Thousands of devices sending data to cluster • Some are real-time requiring immediate response, others send large batches • Messages need to be analyzed almost instantaneously Some real(ish) scenarios – design a good stress test CDC from Oracle, streams, high volume • Continuously increasing transaction volume • Streams processing with joins/other aggregates • High volume (>5000 messages/sec) 51 Geo-distributed, ultra low latency • Cluster serves multiple geographies • Requires ultra-low latency for messages (<10 milliseconds) • Volume is low, but will increase as cluster adoption increases