The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad Eldor | Kafka Summit London 2022

The details that matter:
Kafka in production, at scale

Avoiding blind spots in
your Kafka infrastructure

Or Arnon
Promoting collaboration and DevOps culture |
Leading an amazing DevOps Team
@ironSource
linkedin.com/in/oarnon/
Hi there 👋

The App Economy is a huge and
fast-growing opportunity
140B
Apps downloaded
globally in 2020 2
6.7B
Devices globally 2
407B
Size of the App
Economy by 20261

The ironSource platform unlocks business success
for the two core constituents of the App Economy
APP
App Developers
SDK code integrated in
tens of thousands of Apps
DEVICE
1.
As of December 31, 2021
Telecom Developers
Integrated on 1B+
cumulative devices

A Kafka cluster at scale
>100TB
of data
5M
messages
per second
>1,000
consumers &
producers

3 stories
Bits and bytes
Conﬁguration time bombs
Brokers tell their stories

Brokers tell their stories
How we discovered the gap by looking back

Our evening takes a turn
DISK I/O READ TIME DISK I/O READ BYTES
Time
2 s
4 s
6 s
8 s
400
600
200
REQUEST QUEUE SIZE
Time
200
MiB
400
MiB
600
MiB
Time

The usual suspects
Consumer/producer deployments
Increased traﬃc
A misbehaving broker

Finding the gap, looking back
SERVER SYSTEM CPU %
Time
10%
20%
30%
DISK I/O READ TIME
1 s
2 s
3 s
Time

Finding the gap, looking back
REQUEST QUEUE SIZE
200
400
600
PRODUCE LATENCY 99TH PER BROKER
2 s
4 s
6 s
Time
Time

Lesson learned
Scale your
graphs properly
Replace your
broker
Detect
anomalies

Conﬁguration time bombs
How a conﬁguration change rattled our cluster

Peak traﬃc behavior
NORMALIZED LOAD AVERAGE
Time
0.5
1.0
1.5

Unexpected behavior
0.5
1.0
1.5
SERVER INTERRUPTS TOTAL
2 K
4 K
Time
Time
◼◼◼◼ Old Brokers ◼◼ New Brokers ◼◼◼◼ Old Brokers ◼◼ New Brokers

Talking about io.threads
➜
High
io.threads
Increased
CPU load
Increased
interrupts
Context
switches

Back to normal
Time
0.5
1.0
1.5
Aligning io.threads to 2

3 takeaways
Monitor for
conﬁguration
drifts
Monitor your
change during
peak traﬃc
Persist to code
when safe
</>

Bits and bytes
Uncovering an underlying disk issue

Can you spot the diﬀerence in disk writes?
WRITE KB PER SECOND (AVG)
100 K
200 K
300 K
WRITE OPS PER SECOND (AVG)
2,500
5,000
Time
Time

Can you spot the diﬀerence in network traﬃc?
BROKER BYTES IN (AVG)
150 MB
200 MB
250 MB
BROKER BYTES OUT (AVG)
300 MB
400 MB
Time
Time
350 MB

iostat to the rescue
READ KB PER SECOND (MAX) WRITE OPS PER SECOND (MAX)
Time
2,000
3,000
1,000
READ OPS PER SECOND (MAX)
Time
2,500
5,000
7,500
Time
200 K
300 K
100 K

Looking at queue size
WRITE REQUEST QUEUE SIZE (w_await)
Time
50
100
150
◼ Broker 1 ◼ Broker 2 ◼ Broker 3

Looking at read/write processing time
WRITE PROCESSING TIME
10 ms
20 ms
30 ms
READ PROCESSING TIME
4 ms
12 ms
Time
Time
8 ms
◼ Broker 1 ◼ Broker 2 ◼ Broker 3 ◼ Broker 1 ◼ Broker 2 ◼ Broker 3

Putting things together
➜
Slow reads
and writes
➜
Capped
throughput
Queue size
Even
distribution ➜

Learned lessons
Dig beyond
aggregative metrics
Do not assume even
IO performance

3 stories combined
Keep an aligned conﬁguration
Monitor anomalies between brokers
Watch for disk performance

Elad Eldor
Data Infrastructure TL
@ironSource
Works with stability and performance tuning
of Spark, Presto, Druid and Kafka clusters
linkedin.com/in/elad-eldor/
Hi there 👋

Kafka needs (lots of) RAM
Kafka topic
with a single partition
High retention for a
compacted topic
How disks can aﬀect your Kafka cluster?

High retention for a compacted
topic

Load Average us%
Time
40
20
Time
20%
10%
High retention for a compacted topic
sy%
Time
100%
Disk Util %
50%
Time
20%
10%

What’s
compacted
topic?
● A topic with log compaction
● Log compaction is done in the background
periodically
○ Deletes the duplicate records
○ Removes keys with null value
(Tombstone records)
● Cleaning doesn’t block producers and
consumers
● Log compaction requires both RAM
and CPU cycles on the brokers

Compacted topic
Log before compaction
Oﬀset 0 1 2 3 4 5 6 7 8
Key K1 K2 K1 K3 K2 K4 K5 K5 K6
Values V1 V2 V3 V4 V5 V6 V7 V8 V9
Log after compaction
2 3 4 5 7 8
Key K1 K3 K2 K4 K5 K6
Values V3 V4 V5 V6 V8 V9
Compaction

Troubleshooting
✔ High load average, sy%, disk util% ➜ disk contention
✔ No rogue broker
✔ Cluster hosts compacted topics
✔ Topic’s retention was 24 hours
✔ Root Cause - big compacted topic with high retention
✔ High retention ➜ higher kernel cpu time && higer disk utilization
Change the retention for compacted topic!!

A Kafka topic
with a single Partition

A rogue Kafka broker
LOAD AVG USER TIME
Time
20
40
Time
50%
100%

Same traﬃc - in & out
BYTES OUT OF BROKERS BYTES IN OF BROKERS
Time Time

Why a single broker behaves
diﬀerently than the others?

Num partitions per topic per broker
NUM PARTITIONS PER TOPIC PER BROKER
Topic D
NUM PARTITIONS PER BROKER
Broker 1 Broker 2 Broker 3
Num
partitions
Topic A Topic B Topic C Topic E
Num
partitions

◼ Num consumers ◼ Topic throughput (in num events/sec)
Topic A Topic B Topic C Topic D
Many consumers on a small topic
NUM CONSUMERS VS. TOPIC SIZE

Troubleshooting
✔ Same traﬃc - in all brokers
✔ High load average and us% - in a single broker
✔ No partition skew (per broker)
✔ Found partition skew (per topic and broker)
✔ Found a rogue topic
➜ A single broker is overloaded
➜ May aﬀect all consumers and producers

Rogue topic
Low incoming
traﬃc
Single
partition
Many
consumers

Rogue broker - checklist
Don’t look
only at traﬃc
per broker
Partition skew -
per topic and
broker
Consuming
rate per topic
Num
consumers
(connections)
per topic

Num partitions per topic per broker - general case
NUM PARTITIONS PER TOPIC PER BROKER
Topic A
Num
partitions
NUM PARTITIONS PER BROKER
Topic B Topic C Topic D Topic E Broker 1 Broker 2 Broker 3
Num
partitions

Kafka cluster needs (lots of) RAM

Consumer lag - all consumers are lagging
CONSUMER LAG - ALL PARTITIONS
2M
4M
6M
Time

iostat - throughput
IOSTAT - RMB/S
500 MB
Time
250 MB

iostat - IOPS
IOSTAT - R/S
5,000
Time
2,500

CPU iowait %
IO WAIT %
20%
Time
10%

Disk util % vs. page cache hit %
HIGH DISK UTIL VS. PAGE CACHE HIT RATIO
Page
Cache
hit
%
Time
Disk
util
%
◼ Disk util % ◼ Page Cache hit %

More RAM, less disk util%
DISK UTIL %
100%
Time
50%
128GB RAM
Tripled the RAM
384GB RAM
Immediate drop from ~43% to ~13% in peak time

Scenarios causing lags
Replay of a big topic
Consumers are slow
A new consumer / producer that
trashes the page cache

Kafka needs RAM (and lots of it)
● Once Kafka starts reading from disks, it’s hard to recover from it
○ Avoid reads from disks
○ That’s true for both SAS and SSD as well
Consumer
lag
➜
➜
IO throughput
and iops
Page cache
hit %
➜

Summary
✔ High load average, cpu sy%, disk util% ➜ disk contention
✔ Remember to change a compacted topic’s retention
✔ Rogue broker?
● Don’t look only at the incoming & outgoing traﬃc
● Num partitions per topic per broker
● Consumption rate & num consumers
✔ Monitor disk utilization & page cache hit ratio
✔ Do not save on RAM

The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad Eldor | Kafka Summit London 2022

Recommended

Recommended

More Related Content

Similar to The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad Eldor | Kafka Summit London 2022

Similar to The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad Eldor | Kafka Summit London 2022 (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad Eldor | Kafka Summit London 2022