(Hugh O'Brien, Jet.com) Kafka Summit SF 2018
You’re doing disk IO wrong, let ZFS show you the way. ZFS on Linux is now stable. Say goodbye to JBOD, to directories in your reassignment plans, to unevenly used disks. Instead, have 8K Cloud IOPS for $25, SSD speed reads on spinning disks, in-kernel LZ4 compression and the smartest page cache on the planet. (Fear compactions no more!)
Learn how Jet’s Kafka clusters squeeze every drop of disk performance out of Azure, all completely transparent to Kafka.
-Striping cheap disks to maximize instance IOPS
-Block compression to reduce disk usage by ~80% (JSON data)
-Instance SSD as the secondary read cache (storing compressed data), eliminating >99% of disk reads and safe across host redeployments
-Upcoming features: Compressed blocks in memory, potentially quadrupling your page cache (RAM) for free
We’ll cover:
-Basic Principles
-Adapting ZFS for cloud instances (gotchas)
-Performance tuning for Kafka
-Benchmarks
3. What You Should Tell Your Boss
1. ZFS makes Kafka faster
2. ZFS makes Kafka cheaper
3. ZFS works on Linux now
4. What You Should Say If They Ask How
1. Broker read perf dominated by the FS cache
a. ZFS’ algorithm improves hit rates
2. ZFS can make clever use of I/O devices
a. Use fast instance SSDs as a secondary cache
b. Stripe cheap HDDs to meet write needs
5. Who are you?
Why are you talking to me?
● Hugh [hew, hue], Irish
● Responsible for Kafka at Jet.com
● Opinions are my own, etc.
● Forgive me if I say zed-eff-ess
6. Overview
What don’t I already know?
1. Is Kafka Redis?
2. Broker I/O Modes
3. ZFS
4. I/O
5. HowTo
6. Demo
7. Caveats
8. Is Kafka Redis?
From redis.io:
“Redis is an open source ... in-memory data structure store, used as a ... message
broker... Redis has built-in replication ... LRU eviction, transactions ... on-disk
persistence, and provides high availability”
9. Is Kafka Redis?
From redis.io:
“Redis is an open source ... in-memory data structure store, used as a ... message
broker... Redis has built-in replication ... LRU eviction, transactions and ... on-disk
persistence, and provides high availability”
10. Why is Redis limited to memory?
● Memory is fast (bandwidth, latency, etc.)
● Memory is always fast
● Memory is volatile
● Memory is expensive
13. So, Is Kafka Redis?
No. Obviously. Disks change the equation.
But maybe, if we’re clever, sometimes it can be.
Brokers have memory too.
14. Broker I/O
Modes
When are we Redis?
1. Log Appends
2. Live Consumers
3. Lagging Consumers
4. Downconversion Consumers
5. Compaction
15. Straw Man Filesystem Cache (pagecache)
● OS retains recently read disk data in memory
● Fast access if read again
● If no unused memory, no cache
● Cache discards old data as new data comes in
● Cache much smaller than disk, many reads miss the cache
16. I/O 1 - Log Appends
● Messages arrive over the network
● Kafka appends it directly to the active log segment, but may also:
○ Change timestamps
○ Convert from old MessageSet format to new RecordBatch (pre 0.11)
○ Compress/Decompress/Recompress the batch
● Async writes mean it’s up to the OS when to send to disks
● Consistent performance, limited by:
○ OS write buffer, size, utilisation
○ Disk throughput
● Are we Redis?
17. I/O 1 - Log Appends
● How many times is that data read?
○ Once per replica
○ Once per subscribed client
○ Once per compaction
● It’s definitely going to be in the cache, right?
18. I/O 2 - ISRs / Live Consumers (CGLag ~ 0)
● Client reads from recently written partition
○ Kafka uses java.nio TransferTo / a.k.a. sendfile(2)
○ OS level Zero-Copy file to socket transfer
○ Data very likely in pagecache
○ Really a memory -> network operation
● Are we Redis?
● Leaves disks free to focus on writes
19. I/O 3 - Lagging Consumers
● Client reads from partition
○ Kafka uses java.nio TransferTo / a.k.a. sendfile(2)
○ Zero-Copy file to socket transfer
○ Data almost certainly not in pagecache
○ Consumer is stalled on disk reads
● Are we Redis?
○ No, we’re NFS
● Disks now servicing reads instead of writes
20. I/O 4 - Downconversion Consumers
● Old consumer reads from partition
○ Consumer is on an old client
○ Data may or may not be in pagecache
○ Broker reads data from disk into broker heap, cache is reduced
○ Broker signals kernel to send data
○ Kernel copies data from broker heap to kernel space, cache is reduced
○ Kernel sends the data to the client, data is held until transfer completes
○ Process repeats for each old consumer even for same data
○ Slow consumers eventually cause out-of-memory exception
● Are we Redis?
○ We’re MySQL
21. I/O 5 - Log Compaction
● Triggered by log segment growing over the set tipping point
● Broker reads entire log segment
● Runs compaction process, consuming much heap (i.e. cache)
● Writes out compacted log segment
● How many times is this data read?
○ Is there a way to avoid caching this?
● Are we Redis?
○ ¯_(ツ)_/¯
22. When Can We Be Redis?
1. Non-lagging consumers / replicas
2. Appends with write buffer capacity
Since one write is often read N times, reads tend to dominate
If we can serve from memory, what can we optimise so that we do?
23. Pathological Case
● Consumer performs full replay on old topic (maybe downconverting too)
● It experiences 100% pagecache miss rate
● Disk IOPS spent on reads not available for writes
○ Produce operations slow
● LRU pagecache caches this single use data, evicting recent data
○ Fast consumers now see increased pagecache misses
○ These hit disk
○ Less IOPS for writes as before, now less for reads so more stalls
● Soon many users are stalled on disk IO, even for relatively recent writes
24. Ideal Case
● Most consumers stay relatively up to date with producers
○ Most reads are cache hits
○ Disks free to focus on writes
● Consumers who lag and miss cache do not impact write performance
○ Data comes from a secondary cache
● Log compactions do not cause cache evictions nor increase misses
○ Single scans of old data are not seen as cache worthy
● Full replay consumers do not impact cache performance for others
○ As above
26. ZFS Features
● Pooled Storage
● Automatic Checksumming
● Deduplication
● Compression
● Disk Striping
● ARC
● L2ARC
● RAID-N
● Copy-on-write (no fsck)
● Lightweight datasets
● Quotas
● Integrated CIFS, NFS, iSCSI
● Virtual Volumes
● Encryption
● ACLs
● Snapshots
● Clones
● Arbitrary device trees
● Send / Receive datasets
● SLOG
27. ZFS History
● 2001 - Originally from Solaris (Sun’s OS)
● 2005 - Open sourced (CDDL) as part of OpenSolaris
● 2006 - Linux didn’t use it as CDDL != GPL (FUSE port available)
● 2007 - Picked up by FreeBSD, Apple (briefly)
● 2010 - Oracle closed OpenSolaris, yielded Illumos, OpenZFS
● 2015 - Canonical hired lawyers, decided CDDL == GPL
● 2016 - Available natively in Ubuntu 16.04+
29. ZFS Superpower 1: The ARC
1. List of recently cached data
2. List of recently cached data accessed two or more times
3. List of data evicted from 1
4. List of data evicted from 2
● Take a given amount of cache space, partition it in two at a point c
○ Below c is used for list 1, above c for list 2
● Everytime you miss, see if you recently evicted that data by checking 3,4
○ If you did, move c to favour keeping that type of data
● Scan resistant, protects from replays / compactions
30. ZFS Superpower 1: The ARC
Credit: ARC paper, linked previously
● Results are extremely workload dependent
● Kafka’s workload is very favourable
31. ZFS Superpower 2: The L2ARC
● Set a storage device to act as a Level 2 ARC
● Temporary, Instance SSDs on Cloud VMs are perfect
● Increase ARC size by ~200GB
○ Slower than memory
○ Faster than disks
○ Does not steal throughput from disks
32. ZFS Superpower 2: The L2ARC
● Not strictly a second tier of the ARC
● Bad idea to tie ARC evictions to disk speed
● Instead, a process scavenges data that is likely to be evicted soon
34. ZFS Superpower 3: LZ4
● LZ4 is so fast it’s free
● Disk throughput is increased by compression factor
● UTF-8 JSON achieves around 5x
● Compressed blocks stored in ARC, L2ARC
○ Increases hit rates
● Still better to have producer compress first
35. ZFS Superpower 4: Prefetch
● Uses idle disk time to pre-load the ARC
● Request a block? Get the next one just in case
● Read that? Better get the next two
● Read those? ...
● Extremely beneficial to sequential read streams
○ A.K.A. every Kafka consumer
● Increases hit rates
37. Disk I/O in 30 seconds
● IOP = Disk read or write
● Disks have an IOP latency which bounds their IOPs/sec
○ Local SSD : Very fast
○ Remote SSD: Less Fast
○ Remote HDD: Not Fast
● IOP max size determined by disk type
● Throughput = IOPs/second x IOP size
● Spinning disks care about IOPs to random vs. sequential sectors
● ZFS handles all of this for you
38. Cloud I/O Options (Azure, East US 2) - 1TB
Make use of ZFS striping to use many disks
Note: Latency not shown. Transaction costs can be reduced with ZFS write batching.
Type Layout IOPS Txn Fee Total
Standard HDD 32x 32GB @ $1.54 32 x 500 = 16k Yes ~ $25 $75
Standard SSD 8x 128GB @ $9.60 8 x 500 = 4k Yes ~ $25 $102
Premium SSD 1x 1TB @ $123 5k No $123
Ultra SSD ? ? ? Lots
Instance SSD 1x 200GB 12k No Free
40. Create the VM
1. Attach as many disks as possible
2. If using Azure, do not use the ‘S’ series
a. Reduced instance SSD size
3. If using Azure, do not enable ‘Host Disk Caching’
a. We have the better cache
55. Things not to do on ZFS 1
● Do not use a separate device for the Write-Ahead-Log
○ Called the ZFS Intent Log / ZIL
○ Basically Journalling
○ Separate device known as an SLOG
● Most Kafka writes are async, so it’s not going to benefit you
● If the device is lost it can be tricky to recover
56. Things not to do on ZFS 2
● Do not use the deduplicating feature
○ Huge memory hog
○ Means less ARC for pagecache
● Why is there duplicated data anyway?
○ Fix the problem at source
57. Things not to do on ZFS 3
● Do not add an temporary instance disk to your main pool
○ Easy to do if you forget the ‘cache’ keyword
● You cannot remove disks from a zpool
○ You’re forever bound to that particular host
58. Things not to do on ZFS 4
● Do not create ZFS snapshots if you use retention.bytes
● Data will never be deleted
● You will run out of space
59. Things not to do on ZFS 5
● Do not create a raidz pool
○ Your cloud provider is handling data redundancy for you
○ Holdover from physical disks
60. Future Ideas
● A mirror pool of instance SSD and standard HDD
○ Limited size, but very fast and recoverable on VM loss. Like SSD Redis?
● Does setting copies=2 increase read speed with multiple disks?
○ At the cost of storage capacity
○ Could also do this with mirrors
● Can we use Kafka’s replication to safely have larger write buffers?
● Can Kafka skip startup verification given that the data is always consistent?
● As Kafka is append only, can the ZFS record size be increased efficiently?