Preparing Your Kafka Streams Application For Production and Beyond

The Lord of The Streams
Preparing Your Kafka Streams Applications for Production and Beyond

The Lord of The Streams:
Preparing Your Kafka Streams Application For Production and Beyond
Rohan Desai
Co-Founder, Responsive

You’ve Built Your App. Now What?
How It Started How It’s Going
3

You’ve Built Your App. Now What?
1. Stabilizing
2. Sizing
3. Monitoring
4

Stabilizing and Sizing Is Hard
Expectation Reality
5
Stabilizing → Sizing → Monitoring

Stabilizing and Sizing Is Hard: Why?
6

It’s a Process, Not a Magic Formula
Sizing is an experimental process of trial and error
as you navigate towards a good conﬁguration
1. Try running it with the smallest possible cluster
2. Make sure your application is stable
a. If it’s not, then debug, tune, and go back to step 1
3. Make sure your application is keeping up
a. It it’s not, then, debug, tune or scale, and go back to
step 1
4. Celebrate!
7

Your Starting Cluster
Start with a relatively small cluster
- 1-2 cores
- 4-8 GB memory per core, 100s
of GB disk
- If stateful, run 2 nodes
8

Stability Checklist: Check Committed Oﬀsets
$ kafka-consumer-groups --bootstrap-server my.bootstrap.server:9092 --describe --group
responsive --offsets
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
HOST CLIENT-ID
...
responsive input 21 100 120 20
responsive-36e36732-6b13-4423-a039-3c1d650c1d1f-StreamThread-2-consumer-09647258-64a1-45
17-96fd-6232ba9e3078 /1.2.3.4
responsive-36e36732-6b13-4423-a039-3c1d650c1d1f-StreamThread-1-consumer
...
Good: Check committed offsets periodically & make sure they’re advancing
Better: Export it as a metric if you can
9

Stability Checklist: Make Sure No Rebalances
last-rebalance-seconds-ago
mbean:kafka.consumer:type=consumer-coordinator-metrics,client-id={clientId}
10

Stability Checklist: Make Sure Clients RUNNING
state
mbean:kafka.streams:type=streams-metrics,client-id={clientId}
11

Stability Checklist: Making Sure You’re Bounding Memory Usage
https://kafka.apache.org/37/documentation/streams/developer-guide/
memory-mgmt
12

Debugging an Unstable Application
Debugging Rebalances: https://www.responsive.dev/blog/guide-to-kafka-streams-rebalancing
Debugging State: https://www.responsive.dev/blog/guide-to-kafka-streams-state
Symptom Remediation
Encountered the following exception during
processing and the registered exception
handler opted to
SHUTDOWN_KAFKA_STREAMS_CLIENT
Debug the reported exception
removing member <member_id> on heartbeat
expiration
Tune session.timeout.ms
consumer poll timeout has expired Tune max.poll.interval.ms or max.poll.records
Streams app crashes with out-of-storage Make sure you’re setting state.dir correctly
13

Sizing/Tuning
14

Sizing/Tuning: When To Scale or Tune
records-lag
mbean:kafka.consumer:type=consumer-fetch-manager-metrics,partition={partiti
on},topic={topic},client-id={clientId}
15

Is Kafka Streams the Problem?
Gold standard: wallclock proﬁle w/
async proﬁler
Check “total time spent” metrics
io-wait-time-ns-total
mbean:
kafka.consumer:type=consumer-metrics,client-id={clientid}
blocked-time-ns-total
mbean:
kafka.streams:type=stream-thread-metrics,thread-id={threadid}
Check external bottlenecks
16

Do I Need More Memory For Reads?
17

It’s challenging to measure hit rate directly. There are metrics but they’re
all at DEBUG level
Kafka Streams Cache:
hit-ratio-avg
mbean:kafka.consumer:type=streams-record-cache-metrics,client-id={clientId},thread-id={t
hreadid},task-id={taskid},record-cache-id={storeid}
RocksDB Cache
block-cache-data-hit-ratio, block-cache-index-hit-ratio, block-cache-ﬁlter-hit-ratio
mbean:kafka.streams:type=stream-state-metrics,client-id={cliendid},thread-id={threadid},ta
sk-id={taskid},state-id={storeid}
18

Usually good enough to look at total cached memory and iostat
# free -mh
total used free shared buff/cache available
Mem: 15Gi 3.3Gi 8.9Gi 2.0Mi 3.0Gi 11Gi
# iostat -kdx 10
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz ...
nvme1n1 3550.80 70363.20 0.00 0.00 0.88 19.82 ...
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz ...
nvme1n1 2177.80 27339.20 0.00 0.00 0.70 12.55 ...
Example from application running on k8s with mem limit 4Gi:
19

Do I Need More Memory For Writes?
20

- You tried read or write memory sizing changes, but didn’t notice much
change to IOPS
- At this point you probably just need more IOPS
Do I Need More Disk Capacity (IOPS)
21

Tuning Thread Count
● Most cases, you’re going to
be ﬁne with ~2 threads per
core
● If you have long blocking calls
or slow disks, consider tuning
up
● Remember, you can only add
threads up to the number of
tasks
22

Scaling Options
Vertical Horizontal
23

Monitoring A Scale Up
24

Monitoring
25

Monitoring SLOs: Lag/Expected Latency
26

Monitoring SLOs: Utilization
27

Resources
- https://www.responsive.dev/blog/a-size-for-every-stream
- https://www.responsive.dev/blog/guide-to-kafka-streams-rebalancing
- https://www.responsive.dev/blog/guide-to-kafka-streams-state
- Async Profiler: https://github.com/async-profiler/async-profiler
- https://kafka.apache.org/10/documentation/streams/developer-guide/memor
y-mgmt
29

Preparing Your Kafka Streams Application For Production and Beyond

Recommended

Recommended

More Related Content

Similar to Preparing Your Kafka Streams Application For Production and Beyond

Similar to Preparing Your Kafka Streams Application For Production and Beyond (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Preparing Your Kafka Streams Application For Production and Beyond