SlideShare a Scribd company logo
Unleashing your Kafka Streams
Application Metrics
Neil Buesing
Kinetic Edge
Current 2023
So Many Metrics
• Kafka Broker
• Network
• Machine / Pod
• JVM
• Kafka Client
• Producer
• Consumer
• Admin
• Kafka Streams
Stress Free Monitoring
Katniss
(best pair programmer)
Kafka Streams - GitHub Inquiry
(path:**/pom.xml OR path://**/build.gradle)
AND kafka-streams
AND NOT (org:confluentinc OR org:apache)
September 2023 = 8.9K
(23K kafka-clients)
Kafka Streams Architecture
processor
task
thread
JVM
task allocation to threads -
streams-consumer-group
topic
partition
t1
p0
t1
p1
t2
p1
t3
p1
t2
p0
t3
p0
• Instance (JVM)
• Client (Topology)
• Thread (Worker)
• Task
• sub-topology
• partition
• Processor
• State Stores
Kafka Streams Architecture
state store
Con
fi
guration
• metrics.recording.level — INFO, DEBUG, TRACE
• metrics.sample.window.ms — 30_000
• metrics.num.samples — 2
Exporting
• JMX Reporter + Prometheus Agent
• Dashboards Demonstrated with this configuration
• No renaming of metrics in extraction rules
• Micrometer
• renames some metrics — makes dashboards incompatible.
• does not export info metrics
Exporting
• JMX Reporter + Prometheus Agent
• Dashboards Demonstrated with this configuration
• No renaming of metrics in extraction rules
• Micrometer
• renames some metrics — makes dashboards incompatible.
• does not export info metrics
private String meterName(MetricName metricName) {
String name = METRIC_NAME_PREFIX + metricName.group() + "." + metricName.name();
return name.replaceAll("-metrics", "").replaceAll("-", ".");
}
currentMetrics.forEach((name, metric) -> {
// Filter out non-numeric values
// Filter out metrics from groups that include metadata
if (!(metric.metricValue() instanceof Number) || APP_INFO.equals(name.group()) || METRICS_COUNT.equals(name.group()))
return;
1.11.2
Kafka Stream Metrics
state store
Client Metrics
• version
• state
• num threads
• application_id
• job (application)
• instance (JVM)
• client (topology)
• thread (worker)
processor
task
thread
JVM
state store
Thread Metrics
• operation
• commit
• poll
• process
• punctuate
• ratio
• metric
• total
• rate
• latency-{avg|max}
• job (application)
• instance (JVM)
• client (topology)
• thread (worker)
processor
task
thread
JVM
computing rate within grafana from a
total metric
rate(metric_total{}[$__rate_interval])
state store
Task Metrics
processor
task
thread
JVM
• operation
• commit
• process
• punctuate
• metric
• total
• rate
• job (application)
• instance (JVM)
• client (topology)
• thread (worker)
• task
• sub-topology
• partition
KAFKA-9441
(2.6)
state store
Processor Metrics
processor
task
thread
JVM
• operation
• process
• total/rate
• suppression-emit
• total/rate
• record-e2e-latency
• min/max/avg
• job (application)
• instance (JVM)
• client (topology)
• thread (worker)
• task
• sub-topology
• partition
• process node
state store
Topic Metrics
• operation
• consumed
• produced
• metric
• records total
• bytes total
• job (application)
• instance (JVM)
• client (topology)
• thread (worker)
• task
• process node
• topic
processor
task
thread
JVM
consumed
metrics
produced
metrics
source & sink processors only
topic=“topic_name” attribute included
state store
Record Cache Metrics
processor
task
thread
JVM
• metric
• hit-ratio-min
• hit-ratio-max
• hit-ratio-avg
• job (application)
• instance (JVM)
• thread (worker)
• task
• subtopology
• partition
• store name
state store
State Store Metrics
processor
task
thread
JVM
• operation
• all
• delete
• fetch
• get
• pre
fi
x-scan *
• put, put-all, put-if-absent
• range
• remove
• restore
•
fl
ush
• computed
• record-e2e **
• metric
• rate
• latency-avg (ns)
• latency-max (ns)
• job (application)
• instance (JVM)
• thread (worker)
• task
• subtopology
• partition
• store type
• store name
• other
• suppression-bu
ff
er-size
• suppression-bu
ff
er-count
• metric
• avg
• max
* KIP 614 / KAFKA-10648 - Pre
fi
x Scan support to State Stores - 2.8
** KIP 613 / KAFKA-10054 - e2e latency metrics
2.7
TRACE
RocksDB Metrics
• ~40 Metrics
• Property Metrics
• INFO
• Statistical Metrics
• DEBUG*
• KIPs
• 471 (2.4)
• 607 (2.7)
• KAFKA-8941 (3.2) **
background-errors
block-cache-capacity
block-cache-data-hit-ratio
block-cache-filter-hit-ratio
block-cache-index-hit-ratio
block-cache-pinned-usage
block-cache-usage
bytes-read
bytes-read-compaction
bytes-written
bytes-written-compaction
compaction-pending
compaction-time **
cur-size-active-mem-table
cur-size-all-mem-tables
estimate-num-keys
estimate-pending-compaction-bytes
estimate-table-readers-mem
live-sst-files-size
memtable-bytes-flushed
mem-table-flush-pending
memtable-flush-time **
memtable-hit-ratio
number-file-errors
number-open-files
num-deletes-active-mem-table
num-deletes-imm-mem-tables
num-entries-active-mem-table
num-entries-imm-mem-tables
num-immutable-mem-table
num-live-versions
num-running-compactions
num-running-flushes
size-all-mem-tables
total-sst-files-size
write-stall-duration
* emit as 0 in INFO
https://docs.con
fl
uent.io/platform/current/streams/monitoring.html#rocksdb-metrics
https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h - “struct Properties”
The Demo Applications
Application
rekey
(user)
attach
user
price aggregate
orders
fl
atmap
(line-
item)
rekey
(sku)
pickup
attach
store
rekey
(order)
pickup peek
users
ktable
stores
global
ktable products ktable
fi
lter
RocksDB
sub-topology
indicator
Analytic Applications
(Tumbling Window)
line-item product-stats
aggregate toStream mapValue
windowBy
TimeWindows
.ofSize(30sec)
RocksDB
line-item product-stats
aggregate toStream mapValue
windowBy
TimeWindows
.ofSize(30sec)
.advanceBy(15sec)
RocksDB
Analytic Applications
(Hopping Window)
line-item product-stats
aggregate toStream mapValue
windowBy
SlidingWindows
.ofSize(30sec)
suppress
RocksDB
InMemory
Analytic Applications
(Sliding Window)
line-item product-stats
aggregate toStream mapValue
windowBy
SessionWindows
.ofInactivityGap(30sec)
RocksDB
Analytic Applications
(Session Window)
line-item product-stats
aggregate toStream processor mapValue
punctuate
schedule(15s, wall-clock)
Analytic Applications
(No Window)
Dashboards
Sub-topology Listing
Sub-topology Listing
name your processors
Sub-topology Listing
prometheus — label_replace
label_replace(
kafka_streams_stream_processor_node_metrics_process_total{} *
on (instance) group_left(application_id)
kafka_streams_info{application_id=~"${application}"},
"subtopology", "$1", "task_id", "(.*)_.*"
)
prometheus — on / group_left
name your processors
Task Assignment
Task Assignment
Task Assignment
10 Threads
Task Assignment
11 Threads
Task Assignment
19 Threads
Task Assignment
21 Threads
Threads
Topics (Produced Consumed)
End to End Latency
State Stores
Demo
Recommendations
• Thread ⬌ Task Assignment
• Provide Topology and Task Assignment Dashboards
• End-To-End Latency Metrics
• Consumed and Produced Metrics
• State Store Metrics
• DSL could change implementation
• ∴ always check metrics after upgrade
• Don’t collected too frequently
• these demo collections ⥇ what production collections
• Verify Documentation
Recommendations
Resources
• Kinetic Edge Blog https://www.kineticedge.io/insights
• GitHub
• https://github.com/kineticedge/kafka-streams-dashboards
• Sample Applications, dashboards, kraft cluster, zk cluster, “load-balanced” cluster
• docker - tra
ffi
c control, single image to run application (with libraries built in — fast!)
• KIPs
• KIP-444: Augment metrics for Kafka Streams - 2.4 (& 2.5, 2.6)
• KIP-471: Expose RocksDB Metrics in Kafka Streams - 2.4 (& 3.2)
• KIP-607: Add Metrics to Kafka Streams to Report Properties of RocksDB - 2.7
• KIP-613: Add end-to-end latency metrics to Streams - 2.6 (& 2.7)
• KIP-846: Source/sink node metrics for Consumed/Produced throughput in Streams - 3.3
• KIP-869: Improve Streams State Restoration Visibility - 3.5 (They keep on coming!)
• StateUpdated also needs to be enabled (experimental?)
Resources
restart < 3 seconds
Thank you
Questions?

More Related Content

Similar to Unleashing your Kafka Streams Application Metrics!

On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
Chin Huang
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
Prakash Chockalingam
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureDataStax Academy
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profiler
Ihor Bobak
 
Building a Scalable Real-Time Fleet Management IoT Data Tracker with Kafka St...
Building a Scalable Real-Time Fleet Management IoT Data Tracker with Kafka St...Building a Scalable Real-Time Fleet Management IoT Data Tracker with Kafka St...
Building a Scalable Real-Time Fleet Management IoT Data Tracker with Kafka St...
HostedbyConfluent
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKBigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
nagachika t
 
Eranea's solution and technology for mainframe migration / transformation : d...
Eranea's solution and technology for mainframe migration / transformation : d...Eranea's solution and technology for mainframe migration / transformation : d...
Eranea's solution and technology for mainframe migration / transformation : d...
Eranea
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
Qureshi Tehmina
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and PerformanceVMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld
 

Similar to Unleashing your Kafka Streams Application Metrics! (20)

On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profiler
 
Building a Scalable Real-Time Fleet Management IoT Data Tracker with Kafka St...
Building a Scalable Real-Time Fleet Management IoT Data Tracker with Kafka St...Building a Scalable Real-Time Fleet Management IoT Data Tracker with Kafka St...
Building a Scalable Real-Time Fleet Management IoT Data Tracker with Kafka St...
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKBigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
 
Eranea's solution and technology for mainframe migration / transformation : d...
Eranea's solution and technology for mainframe migration / transformation : d...Eranea's solution and technology for mainframe migration / transformation : d...
Eranea's solution and technology for mainframe migration / transformation : d...
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and PerformanceVMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 

Recently uploaded (20)

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 

Unleashing your Kafka Streams Application Metrics!

  • 1. Unleashing your Kafka Streams Application Metrics Neil Buesing Kinetic Edge Current 2023
  • 2.
  • 3. So Many Metrics • Kafka Broker • Network • Machine / Pod • JVM • Kafka Client • Producer • Consumer • Admin • Kafka Streams
  • 5. Kafka Streams - GitHub Inquiry (path:**/pom.xml OR path://**/build.gradle) AND kafka-streams AND NOT (org:confluentinc OR org:apache) September 2023 = 8.9K (23K kafka-clients)
  • 7. processor task thread JVM task allocation to threads - streams-consumer-group topic partition t1 p0 t1 p1 t2 p1 t3 p1 t2 p0 t3 p0 • Instance (JVM) • Client (Topology) • Thread (Worker) • Task • sub-topology • partition • Processor • State Stores Kafka Streams Architecture state store
  • 8. Con fi guration • metrics.recording.level — INFO, DEBUG, TRACE • metrics.sample.window.ms — 30_000 • metrics.num.samples — 2
  • 9. Exporting • JMX Reporter + Prometheus Agent • Dashboards Demonstrated with this configuration • No renaming of metrics in extraction rules • Micrometer • renames some metrics — makes dashboards incompatible. • does not export info metrics
  • 10. Exporting • JMX Reporter + Prometheus Agent • Dashboards Demonstrated with this configuration • No renaming of metrics in extraction rules • Micrometer • renames some metrics — makes dashboards incompatible. • does not export info metrics private String meterName(MetricName metricName) { String name = METRIC_NAME_PREFIX + metricName.group() + "." + metricName.name(); return name.replaceAll("-metrics", "").replaceAll("-", "."); } currentMetrics.forEach((name, metric) -> { // Filter out non-numeric values // Filter out metrics from groups that include metadata if (!(metric.metricValue() instanceof Number) || APP_INFO.equals(name.group()) || METRICS_COUNT.equals(name.group())) return; 1.11.2
  • 12. state store Client Metrics • version • state • num threads • application_id • job (application) • instance (JVM) • client (topology) • thread (worker) processor task thread JVM
  • 13. state store Thread Metrics • operation • commit • poll • process • punctuate • ratio • metric • total • rate • latency-{avg|max} • job (application) • instance (JVM) • client (topology) • thread (worker) processor task thread JVM computing rate within grafana from a total metric rate(metric_total{}[$__rate_interval])
  • 14. state store Task Metrics processor task thread JVM • operation • commit • process • punctuate • metric • total • rate • job (application) • instance (JVM) • client (topology) • thread (worker) • task • sub-topology • partition KAFKA-9441 (2.6)
  • 15. state store Processor Metrics processor task thread JVM • operation • process • total/rate • suppression-emit • total/rate • record-e2e-latency • min/max/avg • job (application) • instance (JVM) • client (topology) • thread (worker) • task • sub-topology • partition • process node
  • 16. state store Topic Metrics • operation • consumed • produced • metric • records total • bytes total • job (application) • instance (JVM) • client (topology) • thread (worker) • task • process node • topic processor task thread JVM consumed metrics produced metrics source & sink processors only topic=“topic_name” attribute included
  • 17. state store Record Cache Metrics processor task thread JVM • metric • hit-ratio-min • hit-ratio-max • hit-ratio-avg • job (application) • instance (JVM) • thread (worker) • task • subtopology • partition • store name
  • 18. state store State Store Metrics processor task thread JVM • operation • all • delete • fetch • get • pre fi x-scan * • put, put-all, put-if-absent • range • remove • restore • fl ush • computed • record-e2e ** • metric • rate • latency-avg (ns) • latency-max (ns) • job (application) • instance (JVM) • thread (worker) • task • subtopology • partition • store type • store name • other • suppression-bu ff er-size • suppression-bu ff er-count • metric • avg • max * KIP 614 / KAFKA-10648 - Pre fi x Scan support to State Stores - 2.8 ** KIP 613 / KAFKA-10054 - e2e latency metrics 2.7 TRACE
  • 19. RocksDB Metrics • ~40 Metrics • Property Metrics • INFO • Statistical Metrics • DEBUG* • KIPs • 471 (2.4) • 607 (2.7) • KAFKA-8941 (3.2) ** background-errors block-cache-capacity block-cache-data-hit-ratio block-cache-filter-hit-ratio block-cache-index-hit-ratio block-cache-pinned-usage block-cache-usage bytes-read bytes-read-compaction bytes-written bytes-written-compaction compaction-pending compaction-time ** cur-size-active-mem-table cur-size-all-mem-tables estimate-num-keys estimate-pending-compaction-bytes estimate-table-readers-mem live-sst-files-size memtable-bytes-flushed mem-table-flush-pending memtable-flush-time ** memtable-hit-ratio number-file-errors number-open-files num-deletes-active-mem-table num-deletes-imm-mem-tables num-entries-active-mem-table num-entries-imm-mem-tables num-immutable-mem-table num-live-versions num-running-compactions num-running-flushes size-all-mem-tables total-sst-files-size write-stall-duration * emit as 0 in INFO https://docs.con fl uent.io/platform/current/streams/monitoring.html#rocksdb-metrics https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h - “struct Properties”
  • 22. Analytic Applications (Tumbling Window) line-item product-stats aggregate toStream mapValue windowBy TimeWindows .ofSize(30sec) RocksDB
  • 23. line-item product-stats aggregate toStream mapValue windowBy TimeWindows .ofSize(30sec) .advanceBy(15sec) RocksDB Analytic Applications (Hopping Window)
  • 24. line-item product-stats aggregate toStream mapValue windowBy SlidingWindows .ofSize(30sec) suppress RocksDB InMemory Analytic Applications (Sliding Window)
  • 25. line-item product-stats aggregate toStream mapValue windowBy SessionWindows .ofInactivityGap(30sec) RocksDB Analytic Applications (Session Window)
  • 26. line-item product-stats aggregate toStream processor mapValue punctuate schedule(15s, wall-clock) Analytic Applications (No Window)
  • 30. Sub-topology Listing prometheus — label_replace label_replace( kafka_streams_stream_processor_node_metrics_process_total{} * on (instance) group_left(application_id) kafka_streams_info{application_id=~"${application}"}, "subtopology", "$1", "task_id", "(.*)_.*" ) prometheus — on / group_left name your processors
  • 39. End to End Latency
  • 41. Demo
  • 43. • Thread ⬌ Task Assignment • Provide Topology and Task Assignment Dashboards • End-To-End Latency Metrics • Consumed and Produced Metrics • State Store Metrics • DSL could change implementation • ∴ always check metrics after upgrade • Don’t collected too frequently • these demo collections ⥇ what production collections • Verify Documentation Recommendations
  • 45. • Kinetic Edge Blog https://www.kineticedge.io/insights • GitHub • https://github.com/kineticedge/kafka-streams-dashboards • Sample Applications, dashboards, kraft cluster, zk cluster, “load-balanced” cluster • docker - tra ffi c control, single image to run application (with libraries built in — fast!) • KIPs • KIP-444: Augment metrics for Kafka Streams - 2.4 (& 2.5, 2.6) • KIP-471: Expose RocksDB Metrics in Kafka Streams - 2.4 (& 3.2) • KIP-607: Add Metrics to Kafka Streams to Report Properties of RocksDB - 2.7 • KIP-613: Add end-to-end latency metrics to Streams - 2.6 (& 2.7) • KIP-846: Source/sink node metrics for Consumed/Produced throughput in Streams - 3.3 • KIP-869: Improve Streams State Restoration Visibility - 3.5 (They keep on coming!) • StateUpdated also needs to be enabled (experimental?) Resources restart < 3 seconds