SlideShare a Scribd company logo
1 of 59
Download to read offline
The details that matter:
Kafka in production, at scale
Avoiding blind spots in
your Kafka infrastructure
Big Scale
Or Arnon
Promoting collaboration and DevOps culture |
Leading an amazing DevOps Team
@ironSource
linkedin.com/in/oarnon/
Hi there 👋
Powering the
App Economy
The App Economy is a huge and
fast-growing opportunity
140B
Apps downloaded
globally in 2020 2
6.7B
Devices globally 2
407B
Size of the App
Economy by 20261
The ironSource platform unlocks business success
for the two core constituents of the App Economy
APP
App Developers
SDK code integrated in
tens of thousands of Apps
DEVICE
1.
As of December 31, 2021
Telecom Developers
Integrated on 1B+
cumulative devices
A Kafka cluster at scale
>100TB
of data
5M
messages
per second
>1,000
consumers &
producers
3 stories
Bits and bytes
Configuration time bombs
Brokers tell their stories
Brokers tell their stories
How we discovered the gap by looking back
Our evening takes a turn
DISK I/O READ TIME DISK I/O READ BYTES
Time
2 s
4 s
6 s
8 s
400
600
200
REQUEST QUEUE SIZE
Time
200
MiB
400
MiB
600
MiB
Time
The usual suspects
Consumer/producer deployments
Increased traffic
A misbehaving broker
Finding the gap, looking back
SERVER SYSTEM CPU %
Time
10%
20%
30%
DISK I/O READ TIME
1 s
2 s
3 s
Time
Finding the gap, looking back
REQUEST QUEUE SIZE
200
400
600
PRODUCE LATENCY 99TH PER BROKER
2 s
4 s
6 s
Time
Time
Lesson learned
Scale your
graphs properly
Replace your
broker
Detect
anomalies
Configuration time bombs
How a configuration change rattled our cluster
Peak traffic behavior
NORMALIZED LOAD AVERAGE
Time
0.5
1.0
1.5
Unexpected behavior
NORMALIZED LOAD AVERAGE
0.5
1.0
1.5
SERVER INTERRUPTS TOTAL
2 K
4 K
Time
Time
◼◼◼◼ Old Brokers ◼◼ New Brokers ◼◼◼◼ Old Brokers ◼◼ New Brokers
Talking about io.threads
➜
High
io.threads
Increased
CPU load
Increased
interrupts
Context
switches
Back to normal
NORMALIZED LOAD AVERAGE
Time
0.5
1.0
1.5
Aligning io.threads to 2
3 takeaways
Monitor for
configuration
drifts
Monitor your
change during
peak traffic
Persist to code
when safe
</>
Bits and bytes
Uncovering an underlying disk issue
Can you spot the difference in disk writes?
WRITE KB PER SECOND (AVG)
100 K
200 K
300 K
WRITE OPS PER SECOND (AVG)
2,500
5,000
Time
Time
Can you spot the difference in network traffic?
BROKER BYTES IN (AVG)
150 MB
200 MB
250 MB
BROKER BYTES OUT (AVG)
300 MB
400 MB
Time
Time
350 MB
iostat to the rescue
READ KB PER SECOND (MAX) WRITE OPS PER SECOND (MAX)
Time
2,000
3,000
1,000
READ OPS PER SECOND (MAX)
Time
2,500
5,000
7,500
Time
200 K
300 K
100 K
Looking at queue size
WRITE REQUEST QUEUE SIZE (w_await)
Time
50
100
150
◼ Broker 1 ◼ Broker 2 ◼ Broker 3
Looking at read/write processing time
WRITE PROCESSING TIME
10 ms
20 ms
30 ms
READ PROCESSING TIME
4 ms
12 ms
Time
Time
8 ms
◼ Broker 1 ◼ Broker 2 ◼ Broker 3 ◼ Broker 1 ◼ Broker 2 ◼ Broker 3
Putting things together
➜
Slow reads
and writes
➜
Capped
throughput
Queue size
Even
distribution ➜
Learned lessons
Dig beyond
aggregative metrics
Do not assume even
IO performance
3 stories combined
Keep an aligned configuration
Monitor anomalies between brokers
Watch for disk performance
Elad Eldor
Data Infrastructure TL
@ironSource
Works with stability and performance tuning
of Spark, Presto, Druid and Kafka clusters
linkedin.com/in/elad-eldor/
Hi there 👋
Kafka needs (lots of) RAM
Kafka topic
with a single partition
High retention for a
compacted topic
How disks can affect your Kafka cluster?
High retention for a compacted
topic
Load Average us%
Time
40
20
Time
20%
10%
High retention for a compacted topic
sy%
Time
100%
Disk Util %
50%
Time
20%
10%
What’s
compacted
topic?
● A topic with log compaction
● Log compaction is done in the background
periodically
○ Deletes the duplicate records
○ Removes keys with null value
(Tombstone records)
● Cleaning doesn’t block producers and
consumers
● Log compaction requires both RAM
and CPU cycles on the brokers
Compacted topic
Log before compaction
Offset 0 1 2 3 4 5 6 7 8
Key K1 K2 K1 K3 K2 K4 K5 K5 K6
Values V1 V2 V3 V4 V5 V6 V7 V8 V9
Log after compaction
2 3 4 5 7 8
Key K1 K3 K2 K4 K5 K6
Values V3 V4 V5 V6 V8 V9
Compaction
Troubleshooting
✔ High load average, sy%, disk util% ➜ disk contention
✔ No rogue broker
✔ Cluster hosts compacted topics
✔ Topic’s retention was 24 hours
✔ Root Cause - big compacted topic with high retention
✔ High retention ➜ higher kernel cpu time && higer disk utilization
Change the retention for compacted topic!!
A Kafka topic
with a single Partition
A rogue Kafka broker
LOAD AVG USER TIME
Time
20
40
Time
50%
100%
Same traffic - in & out
BYTES OUT OF BROKERS BYTES IN OF BROKERS
Time Time
Why a single broker behaves
differently than the others?
Num partitions per topic per broker
NUM PARTITIONS PER TOPIC PER BROKER
Topic D
NUM PARTITIONS PER BROKER
Broker 1 Broker 2 Broker 3
Num
partitions
Topic A Topic B Topic C Topic E
◼ Broker 1 ◼ Broker 2 ◼ Broker 3
Num
partitions
◼ Num consumers ◼ Topic throughput (in num events/sec)
Topic A Topic B Topic C Topic D
Many consumers on a small topic
NUM CONSUMERS VS. TOPIC SIZE
Troubleshooting
✔ Same traffic - in all brokers
✔ High load average and us% - in a single broker
✔ No partition skew (per broker)
✔ Found partition skew (per topic and broker)
✔ Found a rogue topic
➜ A single broker is overloaded
➜ May affect all consumers and producers
Rogue topic
Low incoming
traffic
Single
partition
Many
consumers
Rogue broker - checklist
Don’t look
only at traffic
per broker
Partition skew -
per topic and
broker
Consuming
rate per topic
Num
consumers
(connections)
per topic
Num partitions per topic per broker - general case
NUM PARTITIONS PER TOPIC PER BROKER
Topic A
Num
partitions
NUM PARTITIONS PER BROKER
◼ Broker 1 ◼ Broker 2 ◼ Broker 3
Topic B Topic C Topic D Topic E Broker 1 Broker 2 Broker 3
Num
partitions
Kafka cluster needs (lots of) RAM
Consumer lag - all consumers are lagging
CONSUMER LAG - ALL PARTITIONS
2M
4M
6M
Time
iostat - throughput
IOSTAT - RMB/S
500 MB
Time
250 MB
iostat - IOPS
IOSTAT - R/S
5,000
Time
2,500
CPU iowait %
IO WAIT %
20%
Time
10%
Disk util % vs. page cache hit %
HIGH DISK UTIL VS. PAGE CACHE HIT RATIO
Page
Cache
hit
%
Time
Disk
util
%
◼ Disk util % ◼ Page Cache hit %
More RAM, less disk util%
DISK UTIL %
100%
Time
50%
128GB RAM
Tripled the RAM
384GB RAM
Immediate drop from ~43% to ~13% in peak time
Scenarios causing lags
Replay of a big topic
Consumers are slow
A new consumer / producer that
trashes the page cache
Kafka needs RAM (and lots of it)
● Once Kafka starts reading from disks, it’s hard to recover from it
○ Avoid reads from disks
○ That’s true for both SAS and SSD as well
Consumer
lag
➜
➜
IO throughput
and iops
Page cache
hit %
➜
Summary
✔ High load average, cpu sy%, disk util% ➜ disk contention
✔ Remember to change a compacted topic’s retention
✔ Rogue broker?
● Don’t look only at the incoming & outgoing traffic
● Num partitions per topic per broker
● Consumption rate & num consumers
✔ Monitor disk utilization & page cache hit ratio
✔ Do not save on RAM
Questions?

More Related Content

Similar to The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad Eldor | Kafka Summit London 2022

Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Cur...
Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Cur...Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Cur...
Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Cur...HostedbyConfluent
 
Directory Write Leases in MagFS
Directory Write Leases in MagFSDirectory Write Leases in MagFS
Directory Write Leases in MagFSMaginatics
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedGuozhang Wang
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelinesSumant Tambe
 
Running a Megasite on Microsoft Technologies
Running a Megasite on Microsoft TechnologiesRunning a Megasite on Microsoft Technologies
Running a Megasite on Microsoft Technologiesgoodfriday
 
EMC IT's Best Practices
EMC IT's Best PracticesEMC IT's Best Practices
EMC IT's Best Practiceswebhostingguy
 
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Soroosh Khodami
 
AWS Summit Seoul 2015 - EBS 성능 향상 및 EC2 비용 최적화 기법
AWS Summit Seoul 2015 - EBS 성능 향상 및 EC2 비용 최적화 기법AWS Summit Seoul 2015 - EBS 성능 향상 및 EC2 비용 최적화 기법
AWS Summit Seoul 2015 - EBS 성능 향상 및 EC2 비용 최적화 기법Amazon Web Services Korea
 
High Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databaseHigh Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databasePeter Lawrey
 
Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics HeroTechWell
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:Tony Antony
 
Day 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfDay 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfRedis Labs
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailInternet World
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Виталий Стародубцев
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudCeph Community
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudPatrick McGarry
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefGaurav "GP" Pal
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4Gaurav "GP" Pal
 

Similar to The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad Eldor | Kafka Summit London 2022 (20)

Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Cur...
Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Cur...Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Cur...
Fan-in Flames: Scaling Kafka to Millions of Producers With Ryanne Dolan | Cur...
 
Directory Write Leases in MagFS
Directory Write Leases in MagFSDirectory Write Leases in MagFS
Directory Write Leases in MagFS
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
Running a Megasite on Microsoft Technologies
Running a Megasite on Microsoft TechnologiesRunning a Megasite on Microsoft Technologies
Running a Megasite on Microsoft Technologies
 
EMC IT's Best Practices
EMC IT's Best PracticesEMC IT's Best Practices
EMC IT's Best Practices
 
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
 
AWS Summit Seoul 2015 - EBS 성능 향상 및 EC2 비용 최적화 기법
AWS Summit Seoul 2015 - EBS 성능 향상 및 EC2 비용 최적화 기법AWS Summit Seoul 2015 - EBS 성능 향상 및 EC2 비용 최적화 기법
AWS Summit Seoul 2015 - EBS 성능 향상 및 EC2 비용 최적화 기법
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
High Frequency Trading and NoSQL database
High Frequency Trading and NoSQL databaseHigh Frequency Trading and NoSQL database
High Frequency Trading and NoSQL database
 
Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics Hero
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
 
Day 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfDay 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConf
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 

The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad Eldor | Kafka Summit London 2022

  • 1. The details that matter: Kafka in production, at scale
  • 2. Avoiding blind spots in your Kafka infrastructure
  • 4. Or Arnon Promoting collaboration and DevOps culture | Leading an amazing DevOps Team @ironSource linkedin.com/in/oarnon/ Hi there 👋
  • 6. The App Economy is a huge and fast-growing opportunity 140B Apps downloaded globally in 2020 2 6.7B Devices globally 2 407B Size of the App Economy by 20261
  • 7. The ironSource platform unlocks business success for the two core constituents of the App Economy APP App Developers SDK code integrated in tens of thousands of Apps DEVICE 1. As of December 31, 2021 Telecom Developers Integrated on 1B+ cumulative devices
  • 8. A Kafka cluster at scale >100TB of data 5M messages per second >1,000 consumers & producers
  • 9. 3 stories Bits and bytes Configuration time bombs Brokers tell their stories
  • 10. Brokers tell their stories How we discovered the gap by looking back
  • 11.
  • 12. Our evening takes a turn DISK I/O READ TIME DISK I/O READ BYTES Time 2 s 4 s 6 s 8 s 400 600 200 REQUEST QUEUE SIZE Time 200 MiB 400 MiB 600 MiB Time
  • 13. The usual suspects Consumer/producer deployments Increased traffic A misbehaving broker
  • 14. Finding the gap, looking back SERVER SYSTEM CPU % Time 10% 20% 30% DISK I/O READ TIME 1 s 2 s 3 s Time
  • 15. Finding the gap, looking back REQUEST QUEUE SIZE 200 400 600 PRODUCE LATENCY 99TH PER BROKER 2 s 4 s 6 s Time Time
  • 16. Lesson learned Scale your graphs properly Replace your broker Detect anomalies
  • 17. Configuration time bombs How a configuration change rattled our cluster
  • 18. Peak traffic behavior NORMALIZED LOAD AVERAGE Time 0.5 1.0 1.5
  • 19. Unexpected behavior NORMALIZED LOAD AVERAGE 0.5 1.0 1.5 SERVER INTERRUPTS TOTAL 2 K 4 K Time Time ◼◼◼◼ Old Brokers ◼◼ New Brokers ◼◼◼◼ Old Brokers ◼◼ New Brokers
  • 20. Talking about io.threads ➜ High io.threads Increased CPU load Increased interrupts Context switches
  • 21. Back to normal NORMALIZED LOAD AVERAGE Time 0.5 1.0 1.5 Aligning io.threads to 2
  • 22. 3 takeaways Monitor for configuration drifts Monitor your change during peak traffic Persist to code when safe </>
  • 23. Bits and bytes Uncovering an underlying disk issue
  • 24. Can you spot the difference in disk writes? WRITE KB PER SECOND (AVG) 100 K 200 K 300 K WRITE OPS PER SECOND (AVG) 2,500 5,000 Time Time
  • 25. Can you spot the difference in network traffic? BROKER BYTES IN (AVG) 150 MB 200 MB 250 MB BROKER BYTES OUT (AVG) 300 MB 400 MB Time Time 350 MB
  • 26. iostat to the rescue READ KB PER SECOND (MAX) WRITE OPS PER SECOND (MAX) Time 2,000 3,000 1,000 READ OPS PER SECOND (MAX) Time 2,500 5,000 7,500 Time 200 K 300 K 100 K
  • 27. Looking at queue size WRITE REQUEST QUEUE SIZE (w_await) Time 50 100 150 ◼ Broker 1 ◼ Broker 2 ◼ Broker 3
  • 28. Looking at read/write processing time WRITE PROCESSING TIME 10 ms 20 ms 30 ms READ PROCESSING TIME 4 ms 12 ms Time Time 8 ms ◼ Broker 1 ◼ Broker 2 ◼ Broker 3 ◼ Broker 1 ◼ Broker 2 ◼ Broker 3
  • 29. Putting things together ➜ Slow reads and writes ➜ Capped throughput Queue size Even distribution ➜
  • 30. Learned lessons Dig beyond aggregative metrics Do not assume even IO performance
  • 31. 3 stories combined Keep an aligned configuration Monitor anomalies between brokers Watch for disk performance
  • 32. Elad Eldor Data Infrastructure TL @ironSource Works with stability and performance tuning of Spark, Presto, Druid and Kafka clusters linkedin.com/in/elad-eldor/ Hi there 👋
  • 33. Kafka needs (lots of) RAM Kafka topic with a single partition High retention for a compacted topic How disks can affect your Kafka cluster?
  • 34. High retention for a compacted topic
  • 35. Load Average us% Time 40 20 Time 20% 10% High retention for a compacted topic sy% Time 100% Disk Util % 50% Time 20% 10%
  • 36. What’s compacted topic? ● A topic with log compaction ● Log compaction is done in the background periodically ○ Deletes the duplicate records ○ Removes keys with null value (Tombstone records) ● Cleaning doesn’t block producers and consumers ● Log compaction requires both RAM and CPU cycles on the brokers
  • 37. Compacted topic Log before compaction Offset 0 1 2 3 4 5 6 7 8 Key K1 K2 K1 K3 K2 K4 K5 K5 K6 Values V1 V2 V3 V4 V5 V6 V7 V8 V9 Log after compaction 2 3 4 5 7 8 Key K1 K3 K2 K4 K5 K6 Values V3 V4 V5 V6 V8 V9 Compaction
  • 38. Troubleshooting ✔ High load average, sy%, disk util% ➜ disk contention ✔ No rogue broker ✔ Cluster hosts compacted topics ✔ Topic’s retention was 24 hours ✔ Root Cause - big compacted topic with high retention ✔ High retention ➜ higher kernel cpu time && higer disk utilization Change the retention for compacted topic!!
  • 39. A Kafka topic with a single Partition
  • 40. A rogue Kafka broker LOAD AVG USER TIME Time 20 40 Time 50% 100%
  • 41. Same traffic - in & out BYTES OUT OF BROKERS BYTES IN OF BROKERS Time Time
  • 42. Why a single broker behaves differently than the others?
  • 43. Num partitions per topic per broker NUM PARTITIONS PER TOPIC PER BROKER Topic D NUM PARTITIONS PER BROKER Broker 1 Broker 2 Broker 3 Num partitions Topic A Topic B Topic C Topic E ◼ Broker 1 ◼ Broker 2 ◼ Broker 3 Num partitions
  • 44. ◼ Num consumers ◼ Topic throughput (in num events/sec) Topic A Topic B Topic C Topic D Many consumers on a small topic NUM CONSUMERS VS. TOPIC SIZE
  • 45. Troubleshooting ✔ Same traffic - in all brokers ✔ High load average and us% - in a single broker ✔ No partition skew (per broker) ✔ Found partition skew (per topic and broker) ✔ Found a rogue topic ➜ A single broker is overloaded ➜ May affect all consumers and producers
  • 47. Rogue broker - checklist Don’t look only at traffic per broker Partition skew - per topic and broker Consuming rate per topic Num consumers (connections) per topic
  • 48. Num partitions per topic per broker - general case NUM PARTITIONS PER TOPIC PER BROKER Topic A Num partitions NUM PARTITIONS PER BROKER ◼ Broker 1 ◼ Broker 2 ◼ Broker 3 Topic B Topic C Topic D Topic E Broker 1 Broker 2 Broker 3 Num partitions
  • 49. Kafka cluster needs (lots of) RAM
  • 50. Consumer lag - all consumers are lagging CONSUMER LAG - ALL PARTITIONS 2M 4M 6M Time
  • 51. iostat - throughput IOSTAT - RMB/S 500 MB Time 250 MB
  • 52. iostat - IOPS IOSTAT - R/S 5,000 Time 2,500
  • 53. CPU iowait % IO WAIT % 20% Time 10%
  • 54. Disk util % vs. page cache hit % HIGH DISK UTIL VS. PAGE CACHE HIT RATIO Page Cache hit % Time Disk util % ◼ Disk util % ◼ Page Cache hit %
  • 55. More RAM, less disk util% DISK UTIL % 100% Time 50% 128GB RAM Tripled the RAM 384GB RAM Immediate drop from ~43% to ~13% in peak time
  • 56. Scenarios causing lags Replay of a big topic Consumers are slow A new consumer / producer that trashes the page cache
  • 57. Kafka needs RAM (and lots of it) ● Once Kafka starts reading from disks, it’s hard to recover from it ○ Avoid reads from disks ○ That’s true for both SAS and SSD as well Consumer lag ➜ ➜ IO throughput and iops Page cache hit % ➜
  • 58. Summary ✔ High load average, cpu sy%, disk util% ➜ disk contention ✔ Remember to change a compacted topic’s retention ✔ Rogue broker? ● Don’t look only at the incoming & outgoing traffic ● Num partitions per topic per broker ● Consumption rate & num consumers ✔ Monitor disk utilization & page cache hit ratio ✔ Do not save on RAM