SlideShare a Scribd company logo
1 of 29
Download to read offline
The Lord of The Streams
Preparing Your Kafka Streams Applications for Production and Beyond
The Lord of The Streams:
Preparing Your Kafka Streams Application For Production and Beyond
Rohan Desai
Co-Founder, Responsive
You’ve Built Your App. Now What?
How It Started How It’s Going
3
You’ve Built Your App. Now What?
1. Stabilizing
2. Sizing
3. Monitoring
4
Stabilizing and Sizing Is Hard
Expectation Reality
5
Stabilizing → Sizing → Monitoring
Stabilizing and Sizing Is Hard: Why?
6
Stabilizing → Sizing → Monitoring
It’s a Process, Not a Magic Formula
Sizing is an experimental process of trial and error
as you navigate towards a good configuration
1. Try running it with the smallest possible cluster
2. Make sure your application is stable
a. If it’s not, then debug, tune, and go back to step 1
3. Make sure your application is keeping up
a. It it’s not, then, debug, tune or scale, and go back to
step 1
4. Celebrate!
7
Stabilizing → Sizing → Monitoring
Your Starting Cluster
Start with a relatively small cluster
- 1-2 cores
- 4-8 GB memory per core, 100s
of GB disk
- If stateful, run 2 nodes
8
Stabilizing → Sizing → Monitoring
Stability Checklist: Check Committed Offsets
$ kafka-consumer-groups --bootstrap-server my.bootstrap.server:9092 --describe --group
responsive --offsets
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
HOST CLIENT-ID
...
responsive input 21 100 120 20
responsive-36e36732-6b13-4423-a039-3c1d650c1d1f-StreamThread-2-consumer-09647258-64a1-45
17-96fd-6232ba9e3078 /1.2.3.4
responsive-36e36732-6b13-4423-a039-3c1d650c1d1f-StreamThread-1-consumer
...
Good: Check committed offsets periodically & make sure they’re advancing
Better: Export it as a metric if you can
9
Stabilizing → Sizing → Monitoring
Stability Checklist: Make Sure No Rebalances
last-rebalance-seconds-ago
mbean:kafka.consumer:type=consumer-coordinator-metrics,client-id={clientId}
10
Stabilizing → Sizing → Monitoring
Stability Checklist: Make Sure Clients RUNNING
state
mbean:kafka.streams:type=streams-metrics,client-id={clientId}
11
Stabilizing → Sizing → Monitoring
Stability Checklist: Making Sure You’re Bounding Memory Usage
https://kafka.apache.org/37/documentation/streams/developer-guide/
memory-mgmt
12
Stabilizing → Sizing → Monitoring
Debugging an Unstable Application
Debugging Rebalances: https://www.responsive.dev/blog/guide-to-kafka-streams-rebalancing
Debugging State: https://www.responsive.dev/blog/guide-to-kafka-streams-state
Symptom Remediation
Encountered the following exception during
processing and the registered exception
handler opted to
SHUTDOWN_KAFKA_STREAMS_CLIENT
Debug the reported exception
removing member <member_id> on heartbeat
expiration
Tune session.timeout.ms
consumer poll timeout has expired Tune max.poll.interval.ms or max.poll.records
Streams app crashes with out-of-storage Make sure you’re setting state.dir correctly
13
Stabilizing → Sizing → Monitoring
Sizing/Tuning
14
Stabilizing → Sizing → Monitoring
Sizing/Tuning: When To Scale or Tune
records-lag
mbean:kafka.consumer:type=consumer-fetch-manager-metrics,partition={partiti
on},topic={topic},client-id={clientId}
15
Stabilizing → Sizing → Monitoring
Is Kafka Streams the Problem?
Gold standard: wallclock profile w/
async profiler
Check “total time spent” metrics
io-wait-time-ns-total
mbean:
kafka.consumer:type=consumer-metrics,client-id={clientid}
blocked-time-ns-total
mbean:
kafka.streams:type=stream-thread-metrics,thread-id={threadid}
Check external bottlenecks
16
Stabilizing → Sizing → Monitoring
Do I Need More Memory For Reads?
17
Stabilizing → Sizing → Monitoring
Do I Need More Memory For Reads?
It’s challenging to measure hit rate directly. There are metrics but they’re
all at DEBUG level
Kafka Streams Cache:
hit-ratio-avg
mbean:kafka.consumer:type=streams-record-cache-metrics,client-id={clientId},thread-id={t
hreadid},task-id={taskid},record-cache-id={storeid}
RocksDB Cache
block-cache-data-hit-ratio, block-cache-index-hit-ratio, block-cache-filter-hit-ratio
mbean:kafka.streams:type=stream-state-metrics,client-id={cliendid},thread-id={threadid},ta
sk-id={taskid},state-id={storeid}
18
Stabilizing → Sizing → Monitoring
Do I Need More Memory For Reads?
Usually good enough to look at total cached memory and iostat
# free -mh
total used free shared buff/cache available
Mem: 15Gi 3.3Gi 8.9Gi 2.0Mi 3.0Gi 11Gi
# iostat -kdx 10
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz ...
nvme1n1 3550.80 70363.20 0.00 0.00 0.88 19.82 ...
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz ...
nvme1n1 2177.80 27339.20 0.00 0.00 0.70 12.55 ...
Example from application running on k8s with mem limit 4Gi:
19
Stabilizing → Sizing → Monitoring
Do I Need More Memory For Writes?
20
Stabilizing → Sizing → Monitoring
- You tried read or write memory sizing changes, but didn’t notice much
change to IOPS
- At this point you probably just need more IOPS
Do I Need More Disk Capacity (IOPS)
21
Stabilizing → Sizing → Monitoring
Tuning Thread Count
● Most cases, you’re going to
be fine with ~2 threads per
core
● If you have long blocking calls
or slow disks, consider tuning
up
● Remember, you can only add
threads up to the number of
tasks
22
Stabilizing → Sizing → Monitoring
Scaling Options
Vertical Horizontal
23
Stabilizing → Sizing → Monitoring
Monitoring A Scale Up
24
Stabilizing → Sizing → Monitoring
Monitoring
25
Stabilizing → Sizing → Monitoring
Monitoring SLOs: Lag/Expected Latency
26
Stabilizing → Sizing → Monitoring
Monitoring SLOs: Utilization
27
Stabilizing → Sizing → Monitoring
28
Resources
- https://www.responsive.dev/blog/a-size-for-every-stream
- https://www.responsive.dev/blog/guide-to-kafka-streams-rebalancing
- https://www.responsive.dev/blog/guide-to-kafka-streams-state
- Async Profiler: https://github.com/async-profiler/async-profiler
- https://kafka.apache.org/10/documentation/streams/developer-guide/memor
y-mgmt
29

More Related Content

Similar to Preparing Your Kafka Streams Application For Production and Beyond

Continues Deployment - Tech Talk week
Continues Deployment - Tech Talk weekContinues Deployment - Tech Talk week
Continues Deployment - Tech Talk weekrantav
 
Autoscaling Confluent Cloud: Should We? How Would We?
Autoscaling Confluent Cloud: Should We? How Would We?Autoscaling Confluent Cloud: Should We? How Would We?
Autoscaling Confluent Cloud: Should We? How Would We?HostedbyConfluent
 
Site Performance Challenge: Magento with CloudMaestro
Site Performance Challenge: Magento with CloudMaestroSite Performance Challenge: Magento with CloudMaestro
Site Performance Challenge: Magento with CloudMaestroLagrange Systems
 
Master VMware Performance and Capacity Management
Master VMware Performance and Capacity ManagementMaster VMware Performance and Capacity Management
Master VMware Performance and Capacity ManagementIwan Rahabok
 
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"LogeekNightUkraine
 
Rails Conf Europe 2007 Notes
Rails Conf  Europe 2007  NotesRails Conf  Europe 2007  Notes
Rails Conf Europe 2007 NotesRoss Lawley
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...DataWorks Summit
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5Peter Lawrey
 
Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case StudyHeinrich Hartmann
 
WebSphere Technical University: Introduction to the Java Diagnostic Tools
WebSphere Technical University: Introduction to the Java Diagnostic ToolsWebSphere Technical University: Introduction to the Java Diagnostic Tools
WebSphere Technical University: Introduction to the Java Diagnostic ToolsChris Bailey
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUGslandelle
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleDatabricks
 
Top-5-java-perf-problems-jax_mainz_2024.pptx
Top-5-java-perf-problems-jax_mainz_2024.pptxTop-5-java-perf-problems-jax_mainz_2024.pptx
Top-5-java-perf-problems-jax_mainz_2024.pptxTier1 app
 
Scale search powered apps with Elastisearch, k8s and go - Maxime Boisvert
Scale search powered apps with Elastisearch, k8s and go - Maxime BoisvertScale search powered apps with Elastisearch, k8s and go - Maxime Boisvert
Scale search powered apps with Elastisearch, k8s and go - Maxime BoisvertWeb à Québec
 
Configuring policies in v c ops
Configuring policies in v c opsConfiguring policies in v c ops
Configuring policies in v c opsSunny Dua
 
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization Manager
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization ManagerVMworld 2011 Review: Preparing for vSphere 5 with Virtualization Manager
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization ManagerSolarWinds
 
VMworld 2013: Practical Real World Reporting with vCenter Operations
VMworld 2013: Practical Real World Reporting with vCenter OperationsVMworld 2013: Practical Real World Reporting with vCenter Operations
VMworld 2013: Practical Real World Reporting with vCenter OperationsVMworld
 
Monitoring lessons from waze sre team
Monitoring lessons from waze sre teamMonitoring lessons from waze sre team
Monitoring lessons from waze sre teamYonit Gruber-Hazani
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
 
Wed-12-05pm-box-salmanahmed
Wed-12-05pm-box-salmanahmedWed-12-05pm-box-salmanahmed
Wed-12-05pm-box-salmanahmedSalman Ahmed
 

Similar to Preparing Your Kafka Streams Application For Production and Beyond (20)

Continues Deployment - Tech Talk week
Continues Deployment - Tech Talk weekContinues Deployment - Tech Talk week
Continues Deployment - Tech Talk week
 
Autoscaling Confluent Cloud: Should We? How Would We?
Autoscaling Confluent Cloud: Should We? How Would We?Autoscaling Confluent Cloud: Should We? How Would We?
Autoscaling Confluent Cloud: Should We? How Would We?
 
Site Performance Challenge: Magento with CloudMaestro
Site Performance Challenge: Magento with CloudMaestroSite Performance Challenge: Magento with CloudMaestro
Site Performance Challenge: Magento with CloudMaestro
 
Master VMware Performance and Capacity Management
Master VMware Performance and Capacity ManagementMaster VMware Performance and Capacity Management
Master VMware Performance and Capacity Management
 
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
 
Rails Conf Europe 2007 Notes
Rails Conf  Europe 2007  NotesRails Conf  Europe 2007  Notes
Rails Conf Europe 2007 Notes
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5
 
Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case Study
 
WebSphere Technical University: Introduction to the Java Diagnostic Tools
WebSphere Technical University: Introduction to the Java Diagnostic ToolsWebSphere Technical University: Introduction to the Java Diagnostic Tools
WebSphere Technical University: Introduction to the Java Diagnostic Tools
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale
 
Top-5-java-perf-problems-jax_mainz_2024.pptx
Top-5-java-perf-problems-jax_mainz_2024.pptxTop-5-java-perf-problems-jax_mainz_2024.pptx
Top-5-java-perf-problems-jax_mainz_2024.pptx
 
Scale search powered apps with Elastisearch, k8s and go - Maxime Boisvert
Scale search powered apps with Elastisearch, k8s and go - Maxime BoisvertScale search powered apps with Elastisearch, k8s and go - Maxime Boisvert
Scale search powered apps with Elastisearch, k8s and go - Maxime Boisvert
 
Configuring policies in v c ops
Configuring policies in v c opsConfiguring policies in v c ops
Configuring policies in v c ops
 
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization Manager
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization ManagerVMworld 2011 Review: Preparing for vSphere 5 with Virtualization Manager
VMworld 2011 Review: Preparing for vSphere 5 with Virtualization Manager
 
VMworld 2013: Practical Real World Reporting with vCenter Operations
VMworld 2013: Practical Real World Reporting with vCenter OperationsVMworld 2013: Practical Real World Reporting with vCenter Operations
VMworld 2013: Practical Real World Reporting with vCenter Operations
 
Monitoring lessons from waze sre team
Monitoring lessons from waze sre teamMonitoring lessons from waze sre team
Monitoring lessons from waze sre team
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
 
Wed-12-05pm-box-salmanahmed
Wed-12-05pm-box-salmanahmedWed-12-05pm-box-salmanahmed
Wed-12-05pm-box-salmanahmed
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Preparing Your Kafka Streams Application For Production and Beyond

  • 1. The Lord of The Streams Preparing Your Kafka Streams Applications for Production and Beyond
  • 2. The Lord of The Streams: Preparing Your Kafka Streams Application For Production and Beyond Rohan Desai Co-Founder, Responsive
  • 3. You’ve Built Your App. Now What? How It Started How It’s Going 3
  • 4. You’ve Built Your App. Now What? 1. Stabilizing 2. Sizing 3. Monitoring 4
  • 5. Stabilizing and Sizing Is Hard Expectation Reality 5 Stabilizing → Sizing → Monitoring
  • 6. Stabilizing and Sizing Is Hard: Why? 6 Stabilizing → Sizing → Monitoring
  • 7. It’s a Process, Not a Magic Formula Sizing is an experimental process of trial and error as you navigate towards a good configuration 1. Try running it with the smallest possible cluster 2. Make sure your application is stable a. If it’s not, then debug, tune, and go back to step 1 3. Make sure your application is keeping up a. It it’s not, then, debug, tune or scale, and go back to step 1 4. Celebrate! 7 Stabilizing → Sizing → Monitoring
  • 8. Your Starting Cluster Start with a relatively small cluster - 1-2 cores - 4-8 GB memory per core, 100s of GB disk - If stateful, run 2 nodes 8 Stabilizing → Sizing → Monitoring
  • 9. Stability Checklist: Check Committed Offsets $ kafka-consumer-groups --bootstrap-server my.bootstrap.server:9092 --describe --group responsive --offsets GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID ... responsive input 21 100 120 20 responsive-36e36732-6b13-4423-a039-3c1d650c1d1f-StreamThread-2-consumer-09647258-64a1-45 17-96fd-6232ba9e3078 /1.2.3.4 responsive-36e36732-6b13-4423-a039-3c1d650c1d1f-StreamThread-1-consumer ... Good: Check committed offsets periodically & make sure they’re advancing Better: Export it as a metric if you can 9 Stabilizing → Sizing → Monitoring
  • 10. Stability Checklist: Make Sure No Rebalances last-rebalance-seconds-ago mbean:kafka.consumer:type=consumer-coordinator-metrics,client-id={clientId} 10 Stabilizing → Sizing → Monitoring
  • 11. Stability Checklist: Make Sure Clients RUNNING state mbean:kafka.streams:type=streams-metrics,client-id={clientId} 11 Stabilizing → Sizing → Monitoring
  • 12. Stability Checklist: Making Sure You’re Bounding Memory Usage https://kafka.apache.org/37/documentation/streams/developer-guide/ memory-mgmt 12 Stabilizing → Sizing → Monitoring
  • 13. Debugging an Unstable Application Debugging Rebalances: https://www.responsive.dev/blog/guide-to-kafka-streams-rebalancing Debugging State: https://www.responsive.dev/blog/guide-to-kafka-streams-state Symptom Remediation Encountered the following exception during processing and the registered exception handler opted to SHUTDOWN_KAFKA_STREAMS_CLIENT Debug the reported exception removing member <member_id> on heartbeat expiration Tune session.timeout.ms consumer poll timeout has expired Tune max.poll.interval.ms or max.poll.records Streams app crashes with out-of-storage Make sure you’re setting state.dir correctly 13 Stabilizing → Sizing → Monitoring
  • 15. Sizing/Tuning: When To Scale or Tune records-lag mbean:kafka.consumer:type=consumer-fetch-manager-metrics,partition={partiti on},topic={topic},client-id={clientId} 15 Stabilizing → Sizing → Monitoring
  • 16. Is Kafka Streams the Problem? Gold standard: wallclock profile w/ async profiler Check “total time spent” metrics io-wait-time-ns-total mbean: kafka.consumer:type=consumer-metrics,client-id={clientid} blocked-time-ns-total mbean: kafka.streams:type=stream-thread-metrics,thread-id={threadid} Check external bottlenecks 16 Stabilizing → Sizing → Monitoring
  • 17. Do I Need More Memory For Reads? 17 Stabilizing → Sizing → Monitoring
  • 18. Do I Need More Memory For Reads? It’s challenging to measure hit rate directly. There are metrics but they’re all at DEBUG level Kafka Streams Cache: hit-ratio-avg mbean:kafka.consumer:type=streams-record-cache-metrics,client-id={clientId},thread-id={t hreadid},task-id={taskid},record-cache-id={storeid} RocksDB Cache block-cache-data-hit-ratio, block-cache-index-hit-ratio, block-cache-filter-hit-ratio mbean:kafka.streams:type=stream-state-metrics,client-id={cliendid},thread-id={threadid},ta sk-id={taskid},state-id={storeid} 18 Stabilizing → Sizing → Monitoring
  • 19. Do I Need More Memory For Reads? Usually good enough to look at total cached memory and iostat # free -mh total used free shared buff/cache available Mem: 15Gi 3.3Gi 8.9Gi 2.0Mi 3.0Gi 11Gi # iostat -kdx 10 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz ... nvme1n1 3550.80 70363.20 0.00 0.00 0.88 19.82 ... Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz ... nvme1n1 2177.80 27339.20 0.00 0.00 0.70 12.55 ... Example from application running on k8s with mem limit 4Gi: 19 Stabilizing → Sizing → Monitoring
  • 20. Do I Need More Memory For Writes? 20 Stabilizing → Sizing → Monitoring
  • 21. - You tried read or write memory sizing changes, but didn’t notice much change to IOPS - At this point you probably just need more IOPS Do I Need More Disk Capacity (IOPS) 21 Stabilizing → Sizing → Monitoring
  • 22. Tuning Thread Count ● Most cases, you’re going to be fine with ~2 threads per core ● If you have long blocking calls or slow disks, consider tuning up ● Remember, you can only add threads up to the number of tasks 22 Stabilizing → Sizing → Monitoring
  • 24. Monitoring A Scale Up 24 Stabilizing → Sizing → Monitoring
  • 26. Monitoring SLOs: Lag/Expected Latency 26 Stabilizing → Sizing → Monitoring
  • 27. Monitoring SLOs: Utilization 27 Stabilizing → Sizing → Monitoring
  • 28. 28
  • 29. Resources - https://www.responsive.dev/blog/a-size-for-every-stream - https://www.responsive.dev/blog/guide-to-kafka-streams-rebalancing - https://www.responsive.dev/blog/guide-to-kafka-streams-state - Async Profiler: https://github.com/async-profiler/async-profiler - https://kafka.apache.org/10/documentation/streams/developer-guide/memor y-mgmt 29