You cannot operate what you cannot measure. In this talk, I am going to present the built-in metrics framework of Kafka Streams that supports monitoring Kafka Streams applications. You will learn how to setup monitoring of metrics for your Kafka Streams applications and you will hear about the following recent improvements to the metrics framework that aim to extend and simplify monitoring. KIP-444 aims to simplify and extend the built-in metrics framework. The RocksDB metrics introduced in KIP-471 and KIP-607 allow you to look directly into the built-in persistent state stores of your Kafka Streams applications. Finally, KIP-613 specifies metrics that measure end-to-end latencies in your applications. This talk will help you collect intel about the behavior of your Kafka Streams applications, and will allow you to reason about the deployment. In the end, you will be able to better understand your applications and run them in a more robust manner.
DevoxxFR 2024 Reproducible Builds with Apache Maven
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna, Confluent
1. Mind the App
How to monitor your Kafka Streams applications
Bruno Cadonna, Kafka Summit 2021 Europe
2. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
About me
2
Bruno Cadonna
Contributor to Apache Kafka &
Software Developer at Confluent
3. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Content
3
• Basics about metrics in Kafka
• Metrics in Kafka Streams
• KIP-444: Improving Kafka Streams’ metrics
• KIP-471 and KIP-607: RocksDB metrics
• KIP-613: End-to-end latency metrics
• Takeaways
5. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
A metric in Kafka
5
• consists of a name, a value, and a configuration
6. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
A metric in Kafka
6
• consists of a name, a value, and a configuration
• a metric name is composed of
• name
• group
• tags
• description
7. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
A metric in Kafka
7
• consists of a name, a value, and a configuration
• a metric name is composed of
• name
• group
• tags
• description
• a metric value inherits from the Object class, e.g. integral number, decimal number, string, …
8. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
A metric in Kafka
8
• consists of a name, a value, and a configuration
• a metric name is composed of
• name
• group
• tags
• description
• a metric value inherits from the Object class, e.g. integral number, decimal number, string, …
• metric config contains the recording level which can be INFO, DEBUG, TRACE
9. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
A metric in Kafka
9
• consists of a name, a value, and a configuration
• a metric name is composed of
• name
• group
• tags
• description
• a metric value inherits from the Object class, e.g. integral number, decimal number, string, …
• metric config contains the recording level which can be INFO, DEBUG, TRACE
• example:
• name: process-rate
• group: stream-thread-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1
• description: The average number of processed records per second
• value: 123456.78
• recording level: INFO
10. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
A sensor in Kafka
10
• maintains a sequence of recorded values
11. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
A sensor in Kafka
11
• maintains a sequence of recorded values
• maintains a set of metrics
12. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
A sensor in Kafka
12
• maintains a sequence of recorded values
• maintains a set of metrics
• each metric specifies an aggregation on the recorded values
13. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
A sensor in Kafka
13
• maintains a sequence of recorded values
• maintains a set of metrics
• each metric specifies an aggregation on the recorded values
• each time a value is recorded all metrics in a sensor are updated
14. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
A sensor in Kafka
14
• maintains a sequence of recorded values
• maintains a set of metrics
• each metric specifies an aggregation for the recorded values
• each time a value is recorded all metrics in a sensor are updated
• example:
• process-rate and process-total are recorded by the same sensor
• process-rate computes the number of processed records over time
• process-total computes the total number of processed records
16. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Anatomy of a Kafka Streams application
16
Kafka Streams client
17. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Anatomy of a Kafka Streams application
17
stream thread 1
stream thread 2
Kafka Streams client
18. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Anatomy of a Kafka Streams application
18
stream thread 1
task 1
task 2
task 3
task 4
task 5
processor node
state store
cache
stream thread 2
Kafka Streams client
19. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
How does Kafka Streams report metrics?
19
Kafka Streams client
metrics()
read-only map of metrics
20. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
How does Kafka Streams report metrics?
20
metrics()
read-only map of metrics
JMX reporter
implements
MetricsReporter
my reporter
implements
MetricsReporter
Kafka Streams config:
metric.reporter
by default,
no need to set
Kafka Streams client
21. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
How does Kafka Streams report metrics?
21
metrics()
read-only map of metrics
JMX reporter
implements
MetricsReporter
my reporter
implements
MetricsReporter
Kafka Streams config:
metric.reporter
interface MetricsReporter {
// called when a metric is added or updated
void metricChange(KafkaMetric metric);
// called when a metric is removed
void metricRemoval(KafkaMetric metric);
}
by default,
no need to set
Kafka Streams client
22. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
jconsole
22
23. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
jconsole
23
metric name
metric description
metric value
24. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
jconsole
24
metric name
tag: thread-id
metric group
metric description
metric value
25. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Datadog
25
26. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Datadog
26
metric name
27. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Datadog
27
metric group
tags
metric name
28. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
What metrics does Kafka Streams expose?
28
• Kafka Streams client level:
• name: state
• group: stream-metrics
• tags: client-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003
29. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
What metrics does Kafka Streams expose?
29
• Kafka Streams client level:
• name: state
• group: stream-metrics
• tags: client-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003
• stream thread level:
• name: process-rate
• group: stream-thread-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1
30. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
What metrics does Kafka Streams expose?
30
• Kafka Streams client level:
• name: state
• group: stream-metrics
• tags: client-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003
• stream thread level:
• name: process-rate
• group: stream-thread-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1
• task level:
• name: process-latency-avg
• group: stream-task-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1,
task-id = 0_1
31. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
…some more metrics
31
• processor node level
• name: process-rate
• group: stream-processor-node-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1,
task-id = 0_1,
processor-node-id = KSTREAM-SINK-0000000004
32. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
…some more metrics
32
• processor node level
• name: process-rate
• group: stream-processor-node-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1,
task-id = 0_1,
processor-node-id = KSTREAM-SINK-0000000004
• state store level
• name: put-rate
• group: stream-state-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1,
task-id = 0_1,
rocksdb-state-id = count-items
33. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
…some more metrics
33
• processor node level
• name: process-rate
• group: stream-processor-node-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1,
task-id = 0_1,
processor-node-id = KSTREAM-SINK-0000000004
• state store level
• name: put-rate
• group: stream-state-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1,
task-id = 0_1,
rocksdb-state-id = count-items
• cache level
• name: hit-ratio-avg
• group: stream-record-cache-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1,
task-id = 0_1,
record-cache-id = 0_1-count-items
34. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
… and finally
34
• all metrics of embedded consumers, producers, and admin client
• name: last-rebalance-seconds-ago
• group: consumer-coordinator-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1-consumer
36. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
New metrics
36
• introduces client-level metrics
• version,
• commit-id,
• application-id,
• topology-description,
• state,
• alive-stream-threads
37. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
New metrics
37
• introduces client-level metrics
• version,
• commit-id,
• application-id,
• topology-description,
• state,
• alive-stream-threads
• introduces new task level metrics
• active-process-ratio,
• standby-process-ratio (not yet implemented),
• dropped-records
38. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Refactorings
38
• renames some metric names and some metric tags
• client-level and stream thread-level metrics on INFO and most metrics on lower levels on
DEBUG
• removes all parent metrics except one and let users do the roll-up themselves
• removes overlapping metrics
• dropped-records (task-level, INFO) replaces
• late-records-drop (processor node, INFO),
• skipped-records (processor node, INFO),
• expired-window-record-drop (state store, DEBUG)
39. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Improving custom metrics
39
• Sensor addLatencyRateTotalSensor(final String scopeName,
final String entityName,
final String operationName,
final Sensor.RecordingLevel recordingLevel,
final String... tags);
• Sensor addRateTotalSensor(final String scopeName,
final String entityName,
final String operationName,
final Sensor.RecordingLevel recordingLevel,
final String... tags);
40. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Improving custom metrics
40
• Sensor addLatencyRateTotalSensor(final String scopeName,
final String entityName,
final String operationName,
final Sensor.RecordingLevel recordingLevel,
final String... tags);
• Sensor addRateTotalSensor(final String scopeName,
final String entityName,
final String operationName,
final Sensor.RecordingLevel recordingLevel,
final String... tags);
• only available where you have access to the ProcessorContext
41. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Improving custom metrics
41
• Sensor addLatencyRateTotalSensor(final String scopeName,
final String entityName,
final String operationName,
final Sensor.RecordingLevel recordingLevel,
final String... tags);
• Sensor addRateTotalSensor(final String scopeName,
final String entityName,
final String operationName,
final Sensor.RecordingLevel recordingLevel,
final String... tags);
• only available where you have access to the ProcessorContext
• you can add additional metrics to the sensor with Sensor#add()
42. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Example of custom metrics
42
public class Processor<String, String, String, String>() {
private ProcessorContext context;
private KeyValueStore<String, Integer> kvStore;
private Sensor countEmptyRecords;
@Overrid
public void init(final ProcessorContext<String, String> context) {
this.context = context;
countEmptyRecords = context.metrics().addRateTotalSensor(
"word-counter",
"word-counter" + context.taskId(),
"count-empty-messages",
RecordingLevel.INFO
);
kvStore = context.getStateStore("Counts");
}
@Override
public void process(final Record<String, String> record) {
final String[] words = record.value().toLowerCase(Locale.getDefault()).split(" ");
if (words.length == 0) {
countEmptyRecords.record();
}
for (final String word : words) {
final Integer oldValue = kvStore.get(word);
if (oldValue == null) {
kvStore.put(word, 1);
} else {
kvStore.put(word, oldValue + 1);
}
}
}
};
44. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
RocksDB metrics
44
• RocksDB is the default state store in Kafka Streams
45. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
RocksDB metrics
45
• RocksDB is the default state store in Kafka Streams
• statistics-based metrics (KIP-471, AK 2.4): cumulative measurements over time collected by
RocksDB
• name: bytes-written-rate
• group: stream-state-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1,
task-id = 0_1,
rocksdb-state-id = count-items
46. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
RocksDB metrics
46
• RocksDB is the default state store in Kafka Streams
• statistics-based metrics (KIP-471, AK 2.4): cumulative measurements over time collected by
RocksDB
• name: bytes-written-rate
• group: stream-state-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1,
task-id = 0_1,
rocksdb-state-id = count-items
• properties-based metrics (KIP-607, AK 2.7): properties exposed by RocksDB providing current
measurements
• name: block-cache-usage
• group: stream-state-metrics
• tags: thread-id = myapp-2d0b492c-87f1-11eb-8dcd-0242ac130003-StreamThread-1,
task-id = 0_1,
rocksdb-state-id = count-items
47. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Recording RocksDB metrics
47
• statistics-based metrics
• collecting statistics-based metrics may have an impact on performance
• recording metrics during state store operations might be costly
• instead each state store has a metric recorder
• all metric recorders are triggered once per minute by one dedicated thread that is started at Kafka Streams client start-up
48. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Recording RocksDB metrics
48
• statistics-based metrics
• collecting statistics-based metrics may have an impact on performance
• recording metrics during state store operations might be costly
• instead each state store has a metric recorder
• all metric recorders are triggered once per minute by one dedicated thread that is started at Kafka Streams client start-up
• properties-based metrics
• all properties-based metrics are gauges
• a gauge executes some given code each time the metric is queried
• properties-based metrics query RocksDB properties
49. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
When to look at RocksDB metrics?
49
• high memory usage
• size-all-mem-tables
• block-cache-usage
• block-cache-pinned-usage
• estimate-table-readers-mem
statistics-based metrics
properties-based metrics
50. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
When to look at RocksDB metrics?
50
• high memory usage
• size-all-mem-tables
• block-cache-usage
• block-cache-pinned-usage
• estimate-table-readers-mem
• high disk usage
• total-sst-files-size
statistics-based metrics
properties-based metrics
51. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
When to look at RocksDB metrics?
51
• high memory usage
• size-all-mem-tables
• block-cache-usage
• block-cache-pinned-usage
• estimate-table-readers-mem
• high disk usage
• total-sst-files-size
• high disk I/O and write stalls
• memtable-bytes-flushed-[rate | total]
• bytes-[read | written]-compaction-rate
• write-stall-duration-[avg | total]
• memtable-hit-ratio
• block-cache-[data | index | filter]-hit-ratio
statistics-based metrics
properties-based metrics
52. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
When to look at RocksDB metrics?
52
• high memory usage
• size-all-mem-tables
• block-cache-usage
• block-cache-pinned-usage
• estimate-table-readers-mem
• high disk usage
• total-sst-files-size
• high disk I/O and write stalls
• memtable-bytes-flushed-[rate | total]
• bytes-[read | written]-compaction-rate
• write-stall-duration-[avg | total]
• memtable-hit-ratio
• block-cache-[data | index | filter]-hit-ratio
• too many open files
• number-open-files
statistics-based metrics
properties-based metrics
53. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
When to look at RocksDB metrics?
53
• high memory usage
• size-all-mem-tables
• block-cache-usage
• block-cache-pinned-usage
• estimate-table-readers-mem
• high disk usage
• total-sst-files-size
• high disk I/O and write stalls
• memtable-bytes-flushed-[rate | total]
• bytes-[read | written]-compaction-rate
• write-stall-duration-[avg | total]
• memtable-hit-ratio
• block-cache-[data | index | filter]-hit-ratio
• too many open files
• number-open-files
for more details, check out the blog post:
How to Tune RocksDB for Your Kafka Streams Application
https://www.confluent.io/blog/how-to-tune-rocksdb-kafka-streams-state-stores-performance/
statistics-based metrics
properties-based metrics
55. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
End-to-end-latency metrics
55
source node filter
aggregation
sink node
56. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
End-to-end-latency metrics
56
source node filter
aggregation
sink node
consumption latency (INFO) name: record-e2e-latency-[min | max | avg]
group: stream-processor-node-metrics
tags: thread-id = myapp-…,
task-id = 0_1,
processor-node-id = KSTREAM-SOURCE-0000000004
event time processing time
57. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
End-to-end-latency metrics
57
source node filter
aggregation
sink node
consumption latency (INFO) name: record-e2e-latency-[min | max | avg]
group: stream-processor-node-metrics
tags: thread-id = myapp-…,
task-id = 0_1,
processor-node-id = KSTREAM-SOURCE-0000000004
event time processing time
full end-to-end latency (INFO) name: record-e2e-latency-[min | max | avg]
group: stream-processor-node-metrics
tags: thread-id = myapp-…,
task-id = 0_1,
processor-node-id = KSTREAM-SINK-0000000004
event time processing time
58. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
End-to-end-latency metrics
58
source node filter
aggregation
sink node
begin-to-state latency (TRACE)
event time processing time
name: record-e2e-latency-[min | max | avg]
group: stream-state-metrics
tags: thread-id = myapp-…,
task-id = 0_1,
rocksdb-state-id = count-items
consumption latency (INFO) name: record-e2e-latency-[min | max | avg]
group: stream-processor-node-metrics
tags: thread-id = myapp-…,
task-id = 0_1,
processor-node-id = KSTREAM-SOURCE-0000000004
event time processing time
full end-to-end latency (INFO) name: record-e2e-latency-[min | max | avg]
group: stream-processor-node-metrics
tags: thread-id = myapp-…,
task-id = 0_1,
processor-node-id = KSTREAM-SINK-0000000004
event time processing time
59. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
End-to-end-latency metrics (advanced)
59
source node filter
aggregation
sink node source node filter
aggregation
sink node
task 1 task 2
event time processing time
processing time
event time
60. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
End-to-end-latency metrics (advanced)
60
source node filter
aggregation
sink node source node filter
aggregation
sink node
task 1 task 2
event time processing time
processing time
event time
event time processing time
processing delay of task 2
62. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Takeaways
62
• Kafka Streams exposes various metrics on different levels
• metrics were consolidated recently-ish
• RocksDB metrics let you gain insight into state stores
• Kafka Streams allows monitoring record end-to-end latencies