Kafka on ZFS: Better Living Through Filesystems

Kafka on ZFS
Better Living Through Filesystems
Hugh O’Brien
mail@hughobrien.ie
Kafka Summit SF 2018

What You Should Tell Your Boss
1. ZFS makes Kafka faster
2. ZFS makes Kafka cheaper
3. ZFS works on Linux now

What You Should Say If They Ask How
1. Broker read perf dominated by the FS cache
a. ZFS’ algorithm improves hit rates
2. ZFS can make clever use of I/O devices
a. Use fast instance SSDs as a secondary cache
b. Stripe cheap HDDs to meet write needs

Who are you?
Why are you talking to me?
● Hugh [hew, hue], Irish
● Responsible for Kafka at Jet.com
● Opinions are my own, etc.
● Forgive me if I say zed-eff-ess

Overview
What don’t I already know?
1. Is Kafka Redis?
2. Broker I/O Modes
3. ZFS
4. I/O
5. HowTo
6. Demo
7. Caveats

Is Kafka Redis?
Featuring Betteridge's Law of Headlines

Is Kafka Redis?
From redis.io:
“Redis is an open source ... in-memory data structure store, used as a ... message
broker... Redis has built-in replication ... LRU eviction, transactions ... on-disk
persistence, and provides high availability”

Is Kafka Redis?
From redis.io:
“Redis is an open source ... in-memory data structure store, used as a ... message
broker... Redis has built-in replication ... LRU eviction, transactions and ... on-disk
persistence, and provides high availability”

Why is Redis limited to memory?
● Memory is fast (bandwidth, latency, etc.)
● Memory is always fast
● Memory is volatile
● Memory is expensive

Pricing
Credit: 2017 Cihan B. https://dzone.com/articles/hybrid-memory-using-ram-amp-flash-in-redis

Pricing
Credit: 2017 Andy Klein https://www.backblaze.com/blog/hard-drive-cost-per-gigabyte/

So, Is Kafka Redis?
No. Obviously. Disks change the equation.
But maybe, if we’re clever, sometimes it can be.
Brokers have memory too.

Broker I/O
Modes
When are we Redis?
1. Log Appends
2. Live Consumers
3. Lagging Consumers
4. Downconversion Consumers
5. Compaction

Straw Man Filesystem Cache (pagecache)
● OS retains recently read disk data in memory
● Fast access if read again
● If no unused memory, no cache
● Cache discards old data as new data comes in
● Cache much smaller than disk, many reads miss the cache

I/O 1 - Log Appends
● Messages arrive over the network
● Kafka appends it directly to the active log segment, but may also:
○ Change timestamps
○ Convert from old MessageSet format to new RecordBatch (pre 0.11)
○ Compress/Decompress/Recompress the batch
● Async writes mean it’s up to the OS when to send to disks
● Consistent performance, limited by:
○ OS write buffer, size, utilisation
○ Disk throughput
● Are we Redis?

I/O 1 - Log Appends
● How many times is that data read?
○ Once per replica
○ Once per subscribed client
○ Once per compaction
● It’s definitely going to be in the cache, right?

I/O 2 - ISRs / Live Consumers (CGLag ~ 0)
● Client reads from recently written partition
○ Kafka uses java.nio TransferTo / a.k.a. sendfile(2)
○ OS level Zero-Copy file to socket transfer
○ Data very likely in pagecache
○ Really a memory -> network operation
● Are we Redis?
● Leaves disks free to focus on writes

I/O 3 - Lagging Consumers
● Client reads from partition
○ Kafka uses java.nio TransferTo / a.k.a. sendfile(2)
○ Zero-Copy file to socket transfer
○ Data almost certainly not in pagecache
○ Consumer is stalled on disk reads
● Are we Redis?
○ No, we’re NFS
● Disks now servicing reads instead of writes

I/O 4 - Downconversion Consumers
● Old consumer reads from partition
○ Consumer is on an old client
○ Data may or may not be in pagecache
○ Broker reads data from disk into broker heap, cache is reduced
○ Broker signals kernel to send data
○ Kernel copies data from broker heap to kernel space, cache is reduced
○ Kernel sends the data to the client, data is held until transfer completes
○ Process repeats for each old consumer even for same data
○ Slow consumers eventually cause out-of-memory exception
● Are we Redis?
○ We’re MySQL

I/O 5 - Log Compaction
● Triggered by log segment growing over the set tipping point
● Broker reads entire log segment
● Runs compaction process, consuming much heap (i.e. cache)
● Writes out compacted log segment
● How many times is this data read?
○ Is there a way to avoid caching this?
● Are we Redis?
○ ¯_(ツ)_/¯

When Can We Be Redis?
1. Non-lagging consumers / replicas
2. Appends with write buffer capacity
Since one write is often read N times, reads tend to dominate
If we can serve from memory, what can we optimise so that we do?

Pathological Case
● Consumer performs full replay on old topic (maybe downconverting too)
● It experiences 100% pagecache miss rate
● Disk IOPS spent on reads not available for writes
○ Produce operations slow
● LRU pagecache caches this single use data, evicting recent data
○ Fast consumers now see increased pagecache misses
○ These hit disk
○ Less IOPS for writes as before, now less for reads so more stalls
● Soon many users are stalled on disk IO, even for relatively recent writes

Ideal Case
● Most consumers stay relatively up to date with producers
○ Most reads are cache hits
○ Disks free to focus on writes
● Consumers who lag and miss cache do not impact write performance
○ Data comes from a secondary cache
● Log compactions do not cause cache evictions nor increase misses
○ Single scans of old data are not seen as cache worthy
● Full replay consumers do not impact cache performance for others
○ As above

ZFS Features
● Pooled Storage
● Automatic Checksumming
● Deduplication
● Compression
● Disk Striping
● ARC
● L2ARC
● RAID-N
● Copy-on-write (no fsck)
● Lightweight datasets
● Quotas
● Integrated CIFS, NFS, iSCSI
● Virtual Volumes
● Encryption
● ACLs
● Snapshots
● Clones
● Arbitrary device trees
● Send / Receive datasets
● SLOG

ZFS History
● 2001 - Originally from Solaris (Sun’s OS)
● 2005 - Open sourced (CDDL) as part of OpenSolaris
● 2006 - Linux didn’t use it as CDDL != GPL (FUSE port available)
● 2007 - Picked up by FreeBSD, Apple (briefly)
● 2010 - Oracle closed OpenSolaris, yielded Illumos, OpenZFS
● 2015 - Canonical hired lawyers, decided CDDL == GPL
● 2016 - Available natively in Ubuntu 16.04+

ZFS Superpower 1: The ARC
Paper: http://www2.cs.uh.edu/~paris/7360/PAPERS03/arcfast.pdf
Cantrill Rant: https://www.youtube.com/watch?v=F8sZRBdmqc0

1. List of recently cached data
2. List of recently cached data accessed two or more times
3. List of data evicted from 1
4. List of data evicted from 2
● Take a given amount of cache space, partition it in two at a point c
○ Below c is used for list 1, above c for list 2
● Everytime you miss, see if you recently evicted that data by checking 3,4
○ If you did, move c to favour keeping that type of data
● Scan resistant, protects from replays / compactions

Credit: ARC paper, linked previously
● Results are extremely workload dependent
● Kafka’s workload is very favourable

ZFS Superpower 2: The L2ARC
● Set a storage device to act as a Level 2 ARC
● Temporary, Instance SSDs on Cloud VMs are perfect
● Increase ARC size by ~200GB
○ Slower than memory
○ Faster than disks
○ Does not steal throughput from disks

ZFS Superpower 2: The L2ARC
● Not strictly a second tier of the ARC
● Bad idea to tie ARC evictions to disk speed
● Instead, a process scavenges data that is likely to be evicted soon

ZFS Superpower 3: LZ4
Credit: Vadim Tkachenko 2016 https://www.percona.com/blog/2016/04/13/evaluating-database-compression-methods-update/

ZFS Superpower 3: LZ4
● LZ4 is so fast it’s free
● Disk throughput is increased by compression factor
● UTF-8 JSON achieves around 5x
● Compressed blocks stored in ARC, L2ARC
○ Increases hit rates
● Still better to have producer compress first

ZFS Superpower 4: Prefetch
● Uses idle disk time to pre-load the ARC
● Request a block? Get the next one just in case
● Read that? Better get the next two
● Read those? ...
● Extremely beneficial to sequential read streams
○ A.K.A. every Kafka consumer
● Increases hit rates

Disk I/O in 30 seconds
● IOP = Disk read or write
● Disks have an IOP latency which bounds their IOPs/sec
○ Local SSD : Very fast
○ Remote SSD: Less Fast
○ Remote HDD: Not Fast
● IOP max size determined by disk type
● Throughput = IOPs/second x IOP size
● Spinning disks care about IOPs to random vs. sequential sectors
● ZFS handles all of this for you

Cloud I/O Options (Azure, East US 2) - 1TB
Make use of ZFS striping to use many disks
Note: Latency not shown. Transaction costs can be reduced with ZFS write batching.
Type Layout IOPS Txn Fee Total
Standard HDD 32x 32GB @ $1.54 32 x 500 = 16k Yes ~ $25 $75
Standard SSD 8x 128GB @ $9.60 8 x 500 = 4k Yes ~ $25 $102
Premium SSD 1x 1TB @ $123 5k No $123
Ultra SSD ? ? ? Lots
Instance SSD 1x 200GB 12k No Free

HowTo
apt install zfsutils-linux

Create the VM
1. Attach as many disks as possible
2. If using Azure, do not use the ‘S’ series
a. Reduced instance SSD size
3. If using Azure, do not enable ‘Host Disk Caching’
a. We have the better cache

Tune ZFS - /etc/modprobe.d/zfs.conf
Increase disk queue depth, unlimit L2ARC, limit ARC size based on -Xmx
Maybe also tweak write buffer (dirty data) size
Also: disable weekly scrub

Demo
Here’s one I made earlier

Arcstat Tool - Old
Credit: Mike Harsch 2010 http://blog.harschsystems.com/2010/09/08/arcstat-pl-updated-for-l2arc-statistics/

Prometheus / Exporter / Grafana - New

Total Disk Throughput - ZFS
Just 16 Standard HDDs, $1.54 each, D12_v2 VMs

Total Disk Throughput - EXT4 on LVM
No performance loss

Messages Per Second
1K messages from kafka-producer-perf-test, details in appendix

Caveats
The Wise Man Learns from the Mistakes of Others

Things not to do on ZFS 1
● Do not use a separate device for the Write-Ahead-Log
○ Called the ZFS Intent Log / ZIL
○ Basically Journalling
○ Separate device known as an SLOG
● Most Kafka writes are async, so it’s not going to benefit you
● If the device is lost it can be tricky to recover

● Do not use the deduplicating feature
○ Huge memory hog
○ Means less ARC for pagecache
● Why is there duplicated data anyway?
○ Fix the problem at source

● Do not add an temporary instance disk to your main pool
○ Easy to do if you forget the ‘cache’ keyword
● You cannot remove disks from a zpool
○ You’re forever bound to that particular host

● Do not create ZFS snapshots if you use retention.bytes
● Data will never be deleted
● You will run out of space

● Do not create a raidz pool
○ Your cloud provider is handling data redundancy for you
○ Holdover from physical disks

Future Ideas
● A mirror pool of instance SSD and standard HDD
○ Limited size, but very fast and recoverable on VM loss. Like SSD Redis?
● Does setting copies=2 increase read speed with multiple disks?
○ At the cost of storage capacity
○ Could also do this with mirrors
● Can we use Kafka’s replication to safely have larger write buffers?
● Can Kafka skip startup verification given that the data is always consistent?
● As Kafka is append only, can the ZFS record size be increased efficiently?

Kafka on ZFS: Better Living Through Filesystems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka on ZFS: Better Living Through Filesystems

Similar to Kafka on ZFS: Better Living Through Filesystems (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Kafka on ZFS: Better Living Through Filesystems