Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune Up Performance

Getting Under the Hood of Kafka Streams -
Optimizing Storage engine to Tune Up Performance

Agenda
★ What is a Storage Engine?
★ State store in Kafka Streams
★ 3 tips and 1 game changing trick
to improve performance today
★ What is Speedb
2

Storage Engine
- A software component responsible for managing how data is
stored, retrieved, and updated in a database or data processing
system
- Plays a crucial role in determining the performance and
efficiency of data storage and retrieval operations
- Embedded in the application software stack
- Data management capabilities: snapshots, transactions, data is
ordered, can run iterators
4

Stateful Kafka Streams
- Kafka streams uses a storage engine for state storing
- Stateful enables to process the current event in the context of other
events
- Track information over time, in real-time data processing
Example of State storing operations:
- Count() - how many times have we seen a key in the event stream
- Aggregate() - allows aggregate data you’ve seen until now , combine events,
they don’t have to be of the same type.
6

RocksDB: Kafka’s default storage engine
- The default storage engine in Kafka is RocksDB
- RocksDB is a popular key-value storage engine, by Facebook
- A fork of LevelDB, by Google
- LSM tree implementation
- Better than B-tree on write intensive workload
- Very complex, it has dozens of parameters to config
7

Why does it matter?
- The storage engine resides in the data path, has a major impact on the
overall performance
- Handles all put/get/delete/update operations
- Example: state restore:
Scale out or recovering
Lower is better
8

3 tips to increase your stateful streaming
performance today
and 1 game changing trick

Speedb/RocksDB High Level Architecture
Memtable is full
SST file
Read
Write
Memtable
Flush
L0
L1
L2
.
.
Ln
Data not in the memtable
Immutable
memtable
SST file
SST file SST file SST file
SST file SST file SST file SST file
SST file SST file SST file SST file SST file
10

1. Write buffer size (memtable)
- Data is written to the write buffer
- When the memtable is full the data is flushed to the
disk
- The size of the memtable affects the flush frequency
and the file size
DB2 DB4
SST1
SST1
L0 SST1 SST1
mem1 mem2 mem3
Flush
mem3
Writes
11

Write Buffer Size - Considerations
Larger write buffer:
Fewer flush operations
Reduced write latency
Lower write amplification
(less data written to disk)
Improved read performance for recent data
Consumes extra memory
Smaller write buffer:
More frequent flushes
Higher write amplification
Increased write latency
May require more frequent disk access
for recent data
12

Write Buffer Size - db_bench Test Results
Write Amplification
- Write amplification changed from 11 to 8
(30% lower)
OPs
- Up to 50% increase in write performance
- Average of 20% more
30% decrease
lower is better
13

P99 Comparison: 64MB vs 16MB
TEST NAME P99 - 64MB P99 - 16MB P99 - diff
readrandomwriterandom_5 1940.18 2588.6 -25.05%
Up to ~40% decrease in P99 latency when write
buffer size is set to 64MB
14

Write Buffer Size
For a total increase in RAM footprint of
(64(recommended)-16(default today))*4(max memtable num) per CF in MB
(so around 200MB) you gain 30% performance increase.
So --> if you have
Low number of partitions or a lot of free memory
This would be good for you
Note: It's not a system wide, you can increase one specific CF
(partition) and not the rest. 15

2. Performance vs Memory Tradeoff
- In heavy write workload scenario the write buffer can not handle the
heavy write rate
- There are 2 options:
- Consume extra memory to handle the new writes
- Delay the writes and potentially getting into to application
stalls
MAX memtables=2
16

Performance
Allow_stalls:disabled
RocksDB consumes
extra memory
Can lead to OOM
Memory
Allow_stalls:enabled
Serious performance
impact
Rocksdb may enter
deadlocks
2. Performance vs Memory Tradeoff
17

Pinning - Performance
Pin index and filter
block for better
performance
RocksDB consumes
extra memory
Can lead to OOM
LRU - Memory
Use LRU cache for
filter and index
Performance impact when
need to load data to
the cache
Performance vs Memory Tradeoff
18

Reorder the LSM tree for
better performance and
lower space (garbage
collection)
3. Compaction Method
Periodic process
Merge SST files: removes duplicate or overwriting keys
Multi-threaded
process
19

Compaction Types : Universal vs. Leveled
Universal Compaction VS Leveled Compaction
Write rate > write rate
Space Amplification > Space Amplification
Write Amplification > write amplification
20

Universal vs. Leveled Compaction
Results
- 60% less space amplification
- 22% less write amplification
Configuration:
- 80 Million keys of 1KB
- WB size: 16MB
lower is better
21

Universal vs. Leveled Compaction (WB size 16MB)
Results
Leveled compaction provides:
- 12% improvement in mixed workload
- 38% improvement in Seek random
- 10% less write performance
Configuration:
- 80 Million keys of 1KB
- WB size: 16MB
22

Universal vs. Leveled Compaction (WB size 64MB)
Results
- 10% improvement with Level
compaction +64MB write buffer size
Test:
- Overwrite (100% random write)
23

A Game changing trick to
Boost your Application

A drop-in replacement for RocksDB

Speedb Open Source
- Speedb open source is a community-led project
- A fork of RocksDB, fully compatible
- Rebase to latest rocksdb regularly
Community-driven
Community Resource
utilization
High and stable
performance
T
Resource
utilization
Data that has not been
flushed - secured on the
WAL but not yet “cleaned”
Developer
Experience
High and stable
performance
High and stable
performance
26

Performance Stabilization: Delayed write
- Mechanism to slow down writes when reaching to a certain threshold
- Speedb delayed the writes moderately to avoid stalls and performance
instability
27

Global delayed write
- Every Kafka streams partition is translated to a Rocksdb instance
- Partitions allowing parallel processing
- The global delayed write take into account all of instances when
decides about the write rate changes
28

Delayedwrite
Environment
16 cores 128GB RAM, NVME
Scenario
5578
4122
No More Hiccups!
Stable performance
no stalls
Results
100% random write workload
29

3
0
5578
4122
Write Buffer Manager
Speedb offers: new write buffer manager that keep
stable performance while staying in the memory
boundaries.
Limits the sum of all memtables to not exceed its
size
The problem: when the WBM reaches to 90% usage it
causes huge performance issue.
Without stalls the size is not enforced.
Series impact on performance, deadlock etc’,

3
1
5578
4122
Write Buffer Manager
-
-
Define a single
parameter
Simplicity
-
-
Stays within the
memory boundaries
Memory
Consumption -
-
Eliminate Stalls
Stable
Performance

WriteBuffer
Manager
Environment
Scenario
95% write workload
Results
- Stable performance (no
stalls)
- 24% less memory usage
5578
4122
Lower is better
32

33
ProactiveFlushes
testandresults
Environment
Scenario
As described above
Results
45% improvement in writes
ProactiveFlushesandNative
Proactive Flushes Native
secs_elapsed

Static Pinning
- Pinning index and filter blocks to the
memory has better performance than LRU
cache
- The risk with RocksDB is to get into OOM
condition
Speedb added a safety belt that provides
the benefit of pinning without the memory
risk
34

35
StaticPinning
Environment
8 cores 64GB RAM, HDD
Scenario
Results
130% improvement in
read random workload
100% random read workload

Speedb Enterprise
- Based on Speedb OSS
- For high scale systems with SLA
Includes:
- Adaptive multi dimensional compaction
- Adaptive media throughput Control
- Professional Services and support
37

Why Speedb Enterprise?
Removes Data size limitation (~100GB)
Reduced CPU usage due to lower write
ampliﬁcation
No IO hiccups during compaction
Eliminate the Performance degradation when
the data set is ±20GB
Improved SSD endurance due to lower
write ampliﬁcation
38

Seek while Write
Rocksdb
Speedb
14000
12000
10000
8000
6000
4000
2000
0
1 361
181 541 721 901 1081 1261 1441 1621 1801 1981 2161 2341 2521 2701 2762 2881 2942 3061 3122
IOPS
Time (in seconds)
40

95% write. 5% read
Rocksdb
Speedb
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
1 361
181 541 721 901 1081 1261 1441 1621 1801 1981 2161 2341 2462 2581 2642 2761 2822 2941 3002
IOPS
Time (in seconds)
41

Adaptive media throughput Control
42

Adaptive media throughput Control
43

Summary
1. Increase write buffer size from 16MB to
64MB
2. Memory vs. performance tradeoff
3. Change the compaction method to Level
4. Use Speedb storage engine in your Kafka Streams
application
44

How to replace RocksDB with Speedb?
1. Download a compiled Kafka stream version with Speedb
2. Compile yourself from source
3. Replace RocksDB with Speedb lib
45
Follow the doc for full instructions
Speedb Github

TheHive
Speedb serves as a hive where Speedb/RocksDB users
and contributors can collaborate on the development
of new storage engine capabilities to address the
needs of modern, data-intensive workloads
4
6

P99
Due to optimization for small object writes, RocksDB is typically
most popular in state store, metadata, caching, indexing and
other such applications supporting databases (like Redis on
Flash), applications and event streaming platforms like Kafka
Streams, Apache Flink or Apache Spark.
Rocksdb write performance suffers when the database size exceeds over 50GB. In this case the write amplification may reach a very large number, which means
that the system needs to do many (more) reads and writes for each application write.
48

Compaction Method
- L0: stores recent data, sorted by flush time
- L1-Lmax: sorted data by key
- Lmax - oldest data
49

Bloom Filter
- The data is always sorted - optimal for
sequential reads
- Bloom filter improves random read
performance
- Reduce the number of reads from the disk
- Returns either "possibly in set" or
"definitely not in set"
50

5
1
Bloomﬁlter
testandresults
Environment
8 cores 64GB RAM, HDD
Scenario
1B KV pairs 16B key, 256B
value, 4 threads, 20 bpk ,
high read-miss (similar to
read before write workload)
Results
130% improvement in
read misses read random
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1000
k
500k
0
10000000000 OBJECTS / 100% READS / 0% WRITES
readrandom

5
2
Bloomﬁlter
testandresults
Environment
Scenario
1B KV pairs 16B key, 256B
value, 4 threads, 29vs40 bpk
Results
26% reduction in memory usage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1000
k
500k
0
readrandom
6000
4000
0
memoryconsumption
3000
5000
2000
1000
Bloom 40 BPK Speedb bloom 29 BPK
5578
4122

New Sorted hash Memtable
- The sorted hash memtable improves overall
performance
- Data structure: array of vectors + hash
table
- Improved read while write performance
53

5
4
Memtable
testandresults
Environment
Scenario
1B KV pairs 16B key,
64B value,50 threads db_bench
Results
17% gain overwrite,
13% gain mixed (9010)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
400k
200k
0
overwrite
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
400k
200k
0
readrandomwriterandom_90

RocksDB vs Speedb OSS vs Enterprise
RocksDB Speedb OSS Speedb Enterprise
Professional
Services
Eliminate Write
stalls
Reduced Memory
usage
High performance
at scale
(dataset >30GB)
Adaptive
Multi-dimensional
compaction
55

Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune Up Performance

Recommended

Recommended

More Related Content

Similar to Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune Up Performance

Similar to Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune Up Performance (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune Up Performance