"So you've built an awesome event driven application using Kafka Streams, and are ready to unleash it in production. But, there's a few really important problems you'll have to work out to make sure your rollout goes off without a hitch, and so that you'll know if it doesn't:
- How do you size your resources to run your application? How much cpu/memory/disk will you need?
- What kind of SLOs should you be setting for a Kafka Streams application?
- What should you be measuring/monitoring to know if your application is healthy?
- What should you be measuring/monitoring to know if your applciation needs more or fewer resources?
- What should you be measuring/monitoring to track operations like rolls and scaling?
In this talk, we'll walk you through everything you need to know to prepare for running your Kafka Streams application against a real workload, and to set yourself up for success once its up and running."
7. It’s a Process, Not a Magic Formula
Sizing is an experimental process of trial and error
as you navigate towards a good configuration
1. Try running it with the smallest possible cluster
2. Make sure your application is stable
a. If it’s not, then debug, tune, and go back to step 1
3. Make sure your application is keeping up
a. It it’s not, then, debug, tune or scale, and go back to
step 1
4. Celebrate!
7
Stabilizing → Sizing → Monitoring
8. Your Starting Cluster
Start with a relatively small cluster
- 1-2 cores
- 4-8 GB memory per core, 100s
of GB disk
- If stateful, run 2 nodes
8
Stabilizing → Sizing → Monitoring
9. Stability Checklist: Check Committed Offsets
$ kafka-consumer-groups --bootstrap-server my.bootstrap.server:9092 --describe --group
responsive --offsets
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
HOST CLIENT-ID
...
responsive input 21 100 120 20
responsive-36e36732-6b13-4423-a039-3c1d650c1d1f-StreamThread-2-consumer-09647258-64a1-45
17-96fd-6232ba9e3078 /1.2.3.4
responsive-36e36732-6b13-4423-a039-3c1d650c1d1f-StreamThread-1-consumer
...
Good: Check committed offsets periodically & make sure they’re advancing
Better: Export it as a metric if you can
9
Stabilizing → Sizing → Monitoring
10. Stability Checklist: Make Sure No Rebalances
last-rebalance-seconds-ago
mbean:kafka.consumer:type=consumer-coordinator-metrics,client-id={clientId}
10
Stabilizing → Sizing → Monitoring
11. Stability Checklist: Make Sure Clients RUNNING
state
mbean:kafka.streams:type=streams-metrics,client-id={clientId}
11
Stabilizing → Sizing → Monitoring
13. Debugging an Unstable Application
Debugging Rebalances: https://www.responsive.dev/blog/guide-to-kafka-streams-rebalancing
Debugging State: https://www.responsive.dev/blog/guide-to-kafka-streams-state
Symptom Remediation
Encountered the following exception during
processing and the registered exception
handler opted to
SHUTDOWN_KAFKA_STREAMS_CLIENT
Debug the reported exception
removing member <member_id> on heartbeat
expiration
Tune session.timeout.ms
consumer poll timeout has expired Tune max.poll.interval.ms or max.poll.records
Streams app crashes with out-of-storage Make sure you’re setting state.dir correctly
13
Stabilizing → Sizing → Monitoring
17. Do I Need More Memory For Reads?
17
Stabilizing → Sizing → Monitoring
18. Do I Need More Memory For Reads?
It’s challenging to measure hit rate directly. There are metrics but they’re
all at DEBUG level
Kafka Streams Cache:
hit-ratio-avg
mbean:kafka.consumer:type=streams-record-cache-metrics,client-id={clientId},thread-id={t
hreadid},task-id={taskid},record-cache-id={storeid}
RocksDB Cache
block-cache-data-hit-ratio, block-cache-index-hit-ratio, block-cache-filter-hit-ratio
mbean:kafka.streams:type=stream-state-metrics,client-id={cliendid},thread-id={threadid},ta
sk-id={taskid},state-id={storeid}
18
Stabilizing → Sizing → Monitoring
19. Do I Need More Memory For Reads?
Usually good enough to look at total cached memory and iostat
# free -mh
total used free shared buff/cache available
Mem: 15Gi 3.3Gi 8.9Gi 2.0Mi 3.0Gi 11Gi
# iostat -kdx 10
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz ...
nvme1n1 3550.80 70363.20 0.00 0.00 0.88 19.82 ...
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz ...
nvme1n1 2177.80 27339.20 0.00 0.00 0.70 12.55 ...
Example from application running on k8s with mem limit 4Gi:
19
Stabilizing → Sizing → Monitoring
20. Do I Need More Memory For Writes?
20
Stabilizing → Sizing → Monitoring
21. - You tried read or write memory sizing changes, but didn’t notice much
change to IOPS
- At this point you probably just need more IOPS
Do I Need More Disk Capacity (IOPS)
21
Stabilizing → Sizing → Monitoring
22. Tuning Thread Count
● Most cases, you’re going to
be fine with ~2 threads per
core
● If you have long blocking calls
or slow disks, consider tuning
up
● Remember, you can only add
threads up to the number of
tasks
22
Stabilizing → Sizing → Monitoring