Are you running at scale? Did you experience “voodoo problems” in your infrastructure? We have a 5M messages/sec cluster that taught us some valuable lessons. Seeing our Kafka clusters become sluggish or crash, taking our production services with them, we have some insights that we hope help you steer your next production incident and make sure your data pipelines run smoothly. We’ll tell the story of skews and anomalies in CPU and disk metrics - drawing graphs and conclusions. Understand how compacted topics, partitions distribution, and RAM can affect your cluster’s performance. Finally, look at how a small configuration drift can rattle your cluster. Our goal is to provide you with the tools and knowledge to navigate this uncharted territory.
6. The App Economy is a huge and
fast-growing opportunity
140B
Apps downloaded
globally in 2020 2
6.7B
Devices globally 2
407B
Size of the App
Economy by 20261
7. The ironSource platform unlocks business success
for the two core constituents of the App Economy
APP
App Developers
SDK code integrated in
tens of thousands of Apps
DEVICE
1.
As of December 31, 2021
Telecom Developers
Integrated on 1B+
cumulative devices
8. A Kafka cluster at scale
>100TB
of data
5M
messages
per second
>1,000
consumers &
producers
9. 3 stories
Bits and bytes
Configuration time bombs
Brokers tell their stories
12. Our evening takes a turn
DISK I/O READ TIME DISK I/O READ BYTES
Time
2 s
4 s
6 s
8 s
400
600
200
REQUEST QUEUE SIZE
Time
200
MiB
400
MiB
600
MiB
Time
19. Unexpected behavior
NORMALIZED LOAD AVERAGE
0.5
1.0
1.5
SERVER INTERRUPTS TOTAL
2 K
4 K
Time
Time
◼◼◼◼ Old Brokers ◼◼ New Brokers ◼◼◼◼ Old Brokers ◼◼ New Brokers
24. Can you spot the difference in disk writes?
WRITE KB PER SECOND (AVG)
100 K
200 K
300 K
WRITE OPS PER SECOND (AVG)
2,500
5,000
Time
Time
25. Can you spot the difference in network traffic?
BROKER BYTES IN (AVG)
150 MB
200 MB
250 MB
BROKER BYTES OUT (AVG)
300 MB
400 MB
Time
Time
350 MB
26. iostat to the rescue
READ KB PER SECOND (MAX) WRITE OPS PER SECOND (MAX)
Time
2,000
3,000
1,000
READ OPS PER SECOND (MAX)
Time
2,500
5,000
7,500
Time
200 K
300 K
100 K
28. Looking at read/write processing time
WRITE PROCESSING TIME
10 ms
20 ms
30 ms
READ PROCESSING TIME
4 ms
12 ms
Time
Time
8 ms
◼ Broker 1 ◼ Broker 2 ◼ Broker 3 ◼ Broker 1 ◼ Broker 2 ◼ Broker 3
31. 3 stories combined
Keep an aligned configuration
Monitor anomalies between brokers
Watch for disk performance
32. Elad Eldor
Data Infrastructure TL
@ironSource
Works with stability and performance tuning
of Spark, Presto, Druid and Kafka clusters
linkedin.com/in/elad-eldor/
Hi there 👋
33. Kafka needs (lots of) RAM
Kafka topic
with a single partition
High retention for a
compacted topic
How disks can affect your Kafka cluster?
36. What’s
compacted
topic?
● A topic with log compaction
● Log compaction is done in the background
periodically
○ Deletes the duplicate records
○ Removes keys with null value
(Tombstone records)
● Cleaning doesn’t block producers and
consumers
● Log compaction requires both RAM
and CPU cycles on the brokers
38. Troubleshooting
✔ High load average, sy%, disk util% ➜ disk contention
✔ No rogue broker
✔ Cluster hosts compacted topics
✔ Topic’s retention was 24 hours
✔ Root Cause - big compacted topic with high retention
✔ High retention ➜ higher kernel cpu time && higer disk utilization
Change the retention for compacted topic!!
40. A rogue Kafka broker
LOAD AVG USER TIME
Time
20
40
Time
50%
100%
41. Same traffic - in & out
BYTES OUT OF BROKERS BYTES IN OF BROKERS
Time Time
42. Why a single broker behaves
differently than the others?
43. Num partitions per topic per broker
NUM PARTITIONS PER TOPIC PER BROKER
Topic D
NUM PARTITIONS PER BROKER
Broker 1 Broker 2 Broker 3
Num
partitions
Topic A Topic B Topic C Topic E
◼ Broker 1 ◼ Broker 2 ◼ Broker 3
Num
partitions
44. ◼ Num consumers ◼ Topic throughput (in num events/sec)
Topic A Topic B Topic C Topic D
Many consumers on a small topic
NUM CONSUMERS VS. TOPIC SIZE
45. Troubleshooting
✔ Same traffic - in all brokers
✔ High load average and us% - in a single broker
✔ No partition skew (per broker)
✔ Found partition skew (per topic and broker)
✔ Found a rogue topic
➜ A single broker is overloaded
➜ May affect all consumers and producers
47. Rogue broker - checklist
Don’t look
only at traffic
per broker
Partition skew -
per topic and
broker
Consuming
rate per topic
Num
consumers
(connections)
per topic
48. Num partitions per topic per broker - general case
NUM PARTITIONS PER TOPIC PER BROKER
Topic A
Num
partitions
NUM PARTITIONS PER BROKER
◼ Broker 1 ◼ Broker 2 ◼ Broker 3
Topic B Topic C Topic D Topic E Broker 1 Broker 2 Broker 3
Num
partitions
54. Disk util % vs. page cache hit %
HIGH DISK UTIL VS. PAGE CACHE HIT RATIO
Page
Cache
hit
%
Time
Disk
util
%
◼ Disk util % ◼ Page Cache hit %
55. More RAM, less disk util%
DISK UTIL %
100%
Time
50%
128GB RAM
Tripled the RAM
384GB RAM
Immediate drop from ~43% to ~13% in peak time
56. Scenarios causing lags
Replay of a big topic
Consumers are slow
A new consumer / producer that
trashes the page cache
57. Kafka needs RAM (and lots of it)
● Once Kafka starts reading from disks, it’s hard to recover from it
○ Avoid reads from disks
○ That’s true for both SAS and SSD as well
Consumer
lag
➜
➜
IO throughput
and iops
Page cache
hit %
➜
58. Summary
✔ High load average, cpu sy%, disk util% ➜ disk contention
✔ Remember to change a compacted topic’s retention
✔ Rogue broker?
● Don’t look only at the incoming & outgoing traffic
● Num partitions per topic per broker
● Consumption rate & num consumers
✔ Monitor disk utilization & page cache hit ratio
✔ Do not save on RAM