SlideShare a Scribd company logo
1 of 66
Download to read offline
Kafka on ZFS
Better Living Through Filesystems
Hugh O’Brien
mail@hughobrien.ie
Kafka Summit SF 2018
Takeaways
Suggested tweets
What You Should Tell Your Boss
1. ZFS makes Kafka faster
2. ZFS makes Kafka cheaper
3. ZFS works on Linux now
What You Should Say If They Ask How
1. Broker read perf dominated by the FS cache
a. ZFS’ algorithm improves hit rates
2. ZFS can make clever use of I/O devices
a. Use fast instance SSDs as a secondary cache
b. Stripe cheap HDDs to meet write needs
Who are you?
Why are you talking to me?
● Hugh [hew, hue], Irish
● Responsible for Kafka at Jet.com
● Opinions are my own, etc.
● Forgive me if I say zed-eff-ess
Overview
What don’t I already know?
1. Is Kafka Redis?
2. Broker I/O Modes
3. ZFS
4. I/O
5. HowTo
6. Demo
7. Caveats
Is Kafka Redis?
Featuring Betteridge's Law of Headlines
Is Kafka Redis?
From redis.io:
“Redis is an open source ... in-memory data structure store, used as a ... message
broker... Redis has built-in replication ... LRU eviction, transactions ... on-disk
persistence, and provides high availability”
Is Kafka Redis?
From redis.io:
“Redis is an open source ... in-memory data structure store, used as a ... message
broker... Redis has built-in replication ... LRU eviction, transactions and ... on-disk
persistence, and provides high availability”
Why is Redis limited to memory?
● Memory is fast (bandwidth, latency, etc.)
● Memory is always fast
● Memory is volatile
● Memory is expensive
Pricing
Credit: 2017 Cihan B. https://dzone.com/articles/hybrid-memory-using-ram-amp-flash-in-redis
Pricing
Credit: 2017 Andy Klein https://www.backblaze.com/blog/hard-drive-cost-per-gigabyte/
So, Is Kafka Redis?
No. Obviously. Disks change the equation.
But maybe, if we’re clever, sometimes it can be.
Brokers have memory too.
Broker I/O
Modes
When are we Redis?
1. Log Appends
2. Live Consumers
3. Lagging Consumers
4. Downconversion Consumers
5. Compaction
Straw Man Filesystem Cache (pagecache)
● OS retains recently read disk data in memory
● Fast access if read again
● If no unused memory, no cache
● Cache discards old data as new data comes in
● Cache much smaller than disk, many reads miss the cache
I/O 1 - Log Appends
● Messages arrive over the network
● Kafka appends it directly to the active log segment, but may also:
○ Change timestamps
○ Convert from old MessageSet format to new RecordBatch (pre 0.11)
○ Compress/Decompress/Recompress the batch
● Async writes mean it’s up to the OS when to send to disks
● Consistent performance, limited by:
○ OS write buffer, size, utilisation
○ Disk throughput
● Are we Redis?
I/O 1 - Log Appends
● How many times is that data read?
○ Once per replica
○ Once per subscribed client
○ Once per compaction
● It’s definitely going to be in the cache, right?
I/O 2 - ISRs / Live Consumers (CGLag ~ 0)
● Client reads from recently written partition
○ Kafka uses java.nio TransferTo / a.k.a. sendfile(2)
○ OS level Zero-Copy file to socket transfer
○ Data very likely in pagecache
○ Really a memory -> network operation
● Are we Redis?
● Leaves disks free to focus on writes
I/O 3 - Lagging Consumers
● Client reads from partition
○ Kafka uses java.nio TransferTo / a.k.a. sendfile(2)
○ Zero-Copy file to socket transfer
○ Data almost certainly not in pagecache
○ Consumer is stalled on disk reads
● Are we Redis?
○ No, we’re NFS
● Disks now servicing reads instead of writes
I/O 4 - Downconversion Consumers
● Old consumer reads from partition
○ Consumer is on an old client
○ Data may or may not be in pagecache
○ Broker reads data from disk into broker heap, cache is reduced
○ Broker signals kernel to send data
○ Kernel copies data from broker heap to kernel space, cache is reduced
○ Kernel sends the data to the client, data is held until transfer completes
○ Process repeats for each old consumer even for same data
○ Slow consumers eventually cause out-of-memory exception
● Are we Redis?
○ We’re MySQL
I/O 5 - Log Compaction
● Triggered by log segment growing over the set tipping point
● Broker reads entire log segment
● Runs compaction process, consuming much heap (i.e. cache)
● Writes out compacted log segment
● How many times is this data read?
○ Is there a way to avoid caching this?
● Are we Redis?
○ ¯_(ツ)_/¯
When Can We Be Redis?
1. Non-lagging consumers / replicas
2. Appends with write buffer capacity
Since one write is often read N times, reads tend to dominate
If we can serve from memory, what can we optimise so that we do?
Pathological Case
● Consumer performs full replay on old topic (maybe downconverting too)
● It experiences 100% pagecache miss rate
● Disk IOPS spent on reads not available for writes
○ Produce operations slow
● LRU pagecache caches this single use data, evicting recent data
○ Fast consumers now see increased pagecache misses
○ These hit disk
○ Less IOPS for writes as before, now less for reads so more stalls
● Soon many users are stalled on disk IO, even for relatively recent writes
Ideal Case
● Most consumers stay relatively up to date with producers
○ Most reads are cache hits
○ Disks free to focus on writes
● Consumers who lag and miss cache do not impact write performance
○ Data comes from a secondary cache
● Log compactions do not cause cache evictions nor increase misses
○ Single scans of old data are not seen as cache worthy
● Full replay consumers do not impact cache performance for others
○ As above
ZFS
Sun’s other son
ZFS Features
● Pooled Storage
● Automatic Checksumming
● Deduplication
● Compression
● Disk Striping
● ARC
● L2ARC
● RAID-N
● Copy-on-write (no fsck)
● Lightweight datasets
● Quotas
● Integrated CIFS, NFS, iSCSI
● Virtual Volumes
● Encryption
● ACLs
● Snapshots
● Clones
● Arbitrary device trees
● Send / Receive datasets
● SLOG
ZFS History
● 2001 - Originally from Solaris (Sun’s OS)
● 2005 - Open sourced (CDDL) as part of OpenSolaris
● 2006 - Linux didn’t use it as CDDL != GPL (FUSE port available)
● 2007 - Picked up by FreeBSD, Apple (briefly)
● 2010 - Oracle closed OpenSolaris, yielded Illumos, OpenZFS
● 2015 - Canonical hired lawyers, decided CDDL == GPL
● 2016 - Available natively in Ubuntu 16.04+
ZFS Superpower 1: The ARC
Paper: http://www2.cs.uh.edu/~paris/7360/PAPERS03/arcfast.pdf
Cantrill Rant: https://www.youtube.com/watch?v=F8sZRBdmqc0
ZFS Superpower 1: The ARC
1. List of recently cached data
2. List of recently cached data accessed two or more times
3. List of data evicted from 1
4. List of data evicted from 2
● Take a given amount of cache space, partition it in two at a point c
○ Below c is used for list 1, above c for list 2
● Everytime you miss, see if you recently evicted that data by checking 3,4
○ If you did, move c to favour keeping that type of data
● Scan resistant, protects from replays / compactions
ZFS Superpower 1: The ARC
Credit: ARC paper, linked previously
● Results are extremely workload dependent
● Kafka’s workload is very favourable
ZFS Superpower 2: The L2ARC
● Set a storage device to act as a Level 2 ARC
● Temporary, Instance SSDs on Cloud VMs are perfect
● Increase ARC size by ~200GB
○ Slower than memory
○ Faster than disks
○ Does not steal throughput from disks
ZFS Superpower 2: The L2ARC
● Not strictly a second tier of the ARC
● Bad idea to tie ARC evictions to disk speed
● Instead, a process scavenges data that is likely to be evicted soon
ZFS Superpower 3: LZ4
Credit: Vadim Tkachenko 2016 https://www.percona.com/blog/2016/04/13/evaluating-database-compression-methods-update/
ZFS Superpower 3: LZ4
● LZ4 is so fast it’s free
● Disk throughput is increased by compression factor
● UTF-8 JSON achieves around 5x
● Compressed blocks stored in ARC, L2ARC
○ Increases hit rates
● Still better to have producer compress first
ZFS Superpower 4: Prefetch
● Uses idle disk time to pre-load the ARC
● Request a block? Get the next one just in case
● Read that? Better get the next two
● Read those? ...
● Extremely beneficial to sequential read streams
○ A.K.A. every Kafka consumer
● Increases hit rates
I/O
More than just a TLD
Disk I/O in 30 seconds
● IOP = Disk read or write
● Disks have an IOP latency which bounds their IOPs/sec
○ Local SSD : Very fast
○ Remote SSD: Less Fast
○ Remote HDD: Not Fast
● IOP max size determined by disk type
● Throughput = IOPs/second x IOP size
● Spinning disks care about IOPs to random vs. sequential sectors
● ZFS handles all of this for you
Cloud I/O Options (Azure, East US 2) - 1TB
Make use of ZFS striping to use many disks
Note: Latency not shown. Transaction costs can be reduced with ZFS write batching.
Type Layout IOPS Txn Fee Total
Standard HDD 32x 32GB @ $1.54 32 x 500 = 16k Yes ~ $25 $75
Standard SSD 8x 128GB @ $9.60 8 x 500 = 4k Yes ~ $25 $102
Premium SSD 1x 1TB @ $123 5k No $123
Ultra SSD ? ? ? Lots
Instance SSD 1x 200GB 12k No Free
HowTo
apt install zfsutils-linux
Create the VM
1. Attach as many disks as possible
2. If using Azure, do not use the ‘S’ series
a. Reduced instance SSD size
3. If using Azure, do not enable ‘Host Disk Caching’
a. We have the better cache
Create the pool
└─
└─
Create the pool
Create the pool
Configure Startup
Configure Startup
Tune ZFS - /etc/modprobe.d/zfs.conf
Increase disk queue depth, unlimit L2ARC, limit ARC size based on -Xmx
Maybe also tweak write buffer (dirty data) size
Also: disable weekly scrub
Demo
Here’s one I made earlier
Arcstat Tool - Old
Credit: Mike Harsch 2010 http://blog.harschsystems.com/2010/09/08/arcstat-pl-updated-for-l2arc-statistics/
Prometheus / Exporter / Grafana - New
L2ARC Tracking
Total Disk Throughput - ZFS
Just 16 Standard HDDs, $1.54 each, D12_v2 VMs
Total Disk Throughput - EXT4 on LVM
No performance loss
Messages Per Second
1K messages from kafka-producer-perf-test, details in appendix
Caveats
The Wise Man Learns from the Mistakes of Others
Things not to do on ZFS 1
● Do not use a separate device for the Write-Ahead-Log
○ Called the ZFS Intent Log / ZIL
○ Basically Journalling
○ Separate device known as an SLOG
● Most Kafka writes are async, so it’s not going to benefit you
● If the device is lost it can be tricky to recover
Things not to do on ZFS 2
● Do not use the deduplicating feature
○ Huge memory hog
○ Means less ARC for pagecache
● Why is there duplicated data anyway?
○ Fix the problem at source
Things not to do on ZFS 3
● Do not add an temporary instance disk to your main pool
○ Easy to do if you forget the ‘cache’ keyword
● You cannot remove disks from a zpool
○ You’re forever bound to that particular host
Things not to do on ZFS 4
● Do not create ZFS snapshots if you use retention.bytes
● Data will never be deleted
● You will run out of space
Things not to do on ZFS 5
● Do not create a raidz pool
○ Your cloud provider is handling data redundancy for you
○ Holdover from physical disks
Future Ideas
● A mirror pool of instance SSD and standard HDD
○ Limited size, but very fast and recoverable on VM loss. Like SSD Redis?
● Does setting copies=2 increase read speed with multiple disks?
○ At the cost of storage capacity
○ Could also do this with mirrors
● Can we use Kafka’s replication to safely have larger write buffers?
● Can Kafka skip startup verification given that the data is always consistent?
● As Kafka is append only, can the ZFS record size be increased efficiently?
Thank You
EOF
Appendix
Benchmark Command
Software Versions
FreeBSD ARC info
Lab Setup

More Related Content

What's hot

What every data programmer needs to know about disks
What every data programmer needs to know about disksWhat every data programmer needs to know about disks
What every data programmer needs to know about disksiammutex
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detailMIJIN AN
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephScyllaDB
 
Hive partitioning best practices
Hive partitioning  best practicesHive partitioning  best practices
Hive partitioning best practicesNabeel Moidu
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @JavaPeter Lawrey
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use caseDavin Abraham
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisHostedbyConfluent
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL AdministrationEDB
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaScyllaDB
 
Low level java programming
Low level java programmingLow level java programming
Low level java programmingPeter Lawrey
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
Docker storage drivers by Jérôme Petazzoni
Docker storage drivers by Jérôme PetazzoniDocker storage drivers by Jérôme Petazzoni
Docker storage drivers by Jérôme PetazzoniDocker, Inc.
 

What's hot (20)

What every data programmer needs to know about disks
What every data programmer needs to know about disksWhat every data programmer needs to know about disks
What every data programmer needs to know about disks
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Hive partitioning best practices
Hive partitioning  best practicesHive partitioning  best practices
Hive partitioning best practices
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
Docker internals
Docker internalsDocker internals
Docker internals
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
Docker storage drivers by Jérôme Petazzoni
Docker storage drivers by Jérôme PetazzoniDocker storage drivers by Jérôme Petazzoni
Docker storage drivers by Jérôme Petazzoni
 

Similar to Kafka on ZFS: Better Living Through Filesystems

Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data DeduplicationRedWireServices
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightGluster.org
 
Bsdtw17: allan jude: zfs: advanced integration
Bsdtw17: allan jude: zfs: advanced integrationBsdtw17: allan jude: zfs: advanced integration
Bsdtw17: allan jude: zfs: advanced integrationScott Tsai
 
Hybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioHybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioThai Bui
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed_Hat_Storage
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionStreamNative
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)MongoDB
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big DataDataStax Academy
 
Tuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadTuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadMarius Adrian Popa
 
505 kobal exadata
505 kobal exadata505 kobal exadata
505 kobal exadataKam Chan
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Lucidworks
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems Baruch Osoveskiy
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia DatabasesJaime Crespo
 
Deployment Strategies
Deployment StrategiesDeployment Strategies
Deployment StrategiesMongoDB
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringShapeBlue
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneErik Krogen
 

Similar to Kafka on ZFS: Better Living Through Filesystems (20)

Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 
Bsdtw17: allan jude: zfs: advanced integration
Bsdtw17: allan jude: zfs: advanced integrationBsdtw17: allan jude: zfs: advanced integration
Bsdtw17: allan jude: zfs: advanced integration
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Five steps perform_2009 (1)
Five steps perform_2009 (1)Five steps perform_2009 (1)
Five steps perform_2009 (1)
 
Hybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioHybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxio
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage Performance
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless Evolution
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
 
SNIA SDC 2016 final
SNIA SDC 2016 finalSNIA SDC 2016 final
SNIA SDC 2016 final
 
Tuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadTuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy Workload
 
505 kobal exadata
505 kobal exadata505 kobal exadata
505 kobal exadata
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
 
Tuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for LogsTuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for Logs
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
 
Deployment Strategies
Deployment StrategiesDeployment Strategies
Deployment Strategies
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 

More from confluent

Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Eraconfluent
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flinkconfluent
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flinkconfluent
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...confluent
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluentconfluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkconfluent
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloudconfluent
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernizationconfluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataconfluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesisconfluent
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023confluent
 

More from confluent (20)

Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 

Recently uploaded

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governanceWSO2
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Recently uploaded (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Kafka on ZFS: Better Living Through Filesystems

  • 1. Kafka on ZFS Better Living Through Filesystems Hugh O’Brien mail@hughobrien.ie Kafka Summit SF 2018
  • 3. What You Should Tell Your Boss 1. ZFS makes Kafka faster 2. ZFS makes Kafka cheaper 3. ZFS works on Linux now
  • 4. What You Should Say If They Ask How 1. Broker read perf dominated by the FS cache a. ZFS’ algorithm improves hit rates 2. ZFS can make clever use of I/O devices a. Use fast instance SSDs as a secondary cache b. Stripe cheap HDDs to meet write needs
  • 5. Who are you? Why are you talking to me? ● Hugh [hew, hue], Irish ● Responsible for Kafka at Jet.com ● Opinions are my own, etc. ● Forgive me if I say zed-eff-ess
  • 6. Overview What don’t I already know? 1. Is Kafka Redis? 2. Broker I/O Modes 3. ZFS 4. I/O 5. HowTo 6. Demo 7. Caveats
  • 7. Is Kafka Redis? Featuring Betteridge's Law of Headlines
  • 8. Is Kafka Redis? From redis.io: “Redis is an open source ... in-memory data structure store, used as a ... message broker... Redis has built-in replication ... LRU eviction, transactions ... on-disk persistence, and provides high availability”
  • 9. Is Kafka Redis? From redis.io: “Redis is an open source ... in-memory data structure store, used as a ... message broker... Redis has built-in replication ... LRU eviction, transactions and ... on-disk persistence, and provides high availability”
  • 10. Why is Redis limited to memory? ● Memory is fast (bandwidth, latency, etc.) ● Memory is always fast ● Memory is volatile ● Memory is expensive
  • 11. Pricing Credit: 2017 Cihan B. https://dzone.com/articles/hybrid-memory-using-ram-amp-flash-in-redis
  • 12. Pricing Credit: 2017 Andy Klein https://www.backblaze.com/blog/hard-drive-cost-per-gigabyte/
  • 13. So, Is Kafka Redis? No. Obviously. Disks change the equation. But maybe, if we’re clever, sometimes it can be. Brokers have memory too.
  • 14. Broker I/O Modes When are we Redis? 1. Log Appends 2. Live Consumers 3. Lagging Consumers 4. Downconversion Consumers 5. Compaction
  • 15. Straw Man Filesystem Cache (pagecache) ● OS retains recently read disk data in memory ● Fast access if read again ● If no unused memory, no cache ● Cache discards old data as new data comes in ● Cache much smaller than disk, many reads miss the cache
  • 16. I/O 1 - Log Appends ● Messages arrive over the network ● Kafka appends it directly to the active log segment, but may also: ○ Change timestamps ○ Convert from old MessageSet format to new RecordBatch (pre 0.11) ○ Compress/Decompress/Recompress the batch ● Async writes mean it’s up to the OS when to send to disks ● Consistent performance, limited by: ○ OS write buffer, size, utilisation ○ Disk throughput ● Are we Redis?
  • 17. I/O 1 - Log Appends ● How many times is that data read? ○ Once per replica ○ Once per subscribed client ○ Once per compaction ● It’s definitely going to be in the cache, right?
  • 18. I/O 2 - ISRs / Live Consumers (CGLag ~ 0) ● Client reads from recently written partition ○ Kafka uses java.nio TransferTo / a.k.a. sendfile(2) ○ OS level Zero-Copy file to socket transfer ○ Data very likely in pagecache ○ Really a memory -> network operation ● Are we Redis? ● Leaves disks free to focus on writes
  • 19. I/O 3 - Lagging Consumers ● Client reads from partition ○ Kafka uses java.nio TransferTo / a.k.a. sendfile(2) ○ Zero-Copy file to socket transfer ○ Data almost certainly not in pagecache ○ Consumer is stalled on disk reads ● Are we Redis? ○ No, we’re NFS ● Disks now servicing reads instead of writes
  • 20. I/O 4 - Downconversion Consumers ● Old consumer reads from partition ○ Consumer is on an old client ○ Data may or may not be in pagecache ○ Broker reads data from disk into broker heap, cache is reduced ○ Broker signals kernel to send data ○ Kernel copies data from broker heap to kernel space, cache is reduced ○ Kernel sends the data to the client, data is held until transfer completes ○ Process repeats for each old consumer even for same data ○ Slow consumers eventually cause out-of-memory exception ● Are we Redis? ○ We’re MySQL
  • 21. I/O 5 - Log Compaction ● Triggered by log segment growing over the set tipping point ● Broker reads entire log segment ● Runs compaction process, consuming much heap (i.e. cache) ● Writes out compacted log segment ● How many times is this data read? ○ Is there a way to avoid caching this? ● Are we Redis? ○ ¯_(ツ)_/¯
  • 22. When Can We Be Redis? 1. Non-lagging consumers / replicas 2. Appends with write buffer capacity Since one write is often read N times, reads tend to dominate If we can serve from memory, what can we optimise so that we do?
  • 23. Pathological Case ● Consumer performs full replay on old topic (maybe downconverting too) ● It experiences 100% pagecache miss rate ● Disk IOPS spent on reads not available for writes ○ Produce operations slow ● LRU pagecache caches this single use data, evicting recent data ○ Fast consumers now see increased pagecache misses ○ These hit disk ○ Less IOPS for writes as before, now less for reads so more stalls ● Soon many users are stalled on disk IO, even for relatively recent writes
  • 24. Ideal Case ● Most consumers stay relatively up to date with producers ○ Most reads are cache hits ○ Disks free to focus on writes ● Consumers who lag and miss cache do not impact write performance ○ Data comes from a secondary cache ● Log compactions do not cause cache evictions nor increase misses ○ Single scans of old data are not seen as cache worthy ● Full replay consumers do not impact cache performance for others ○ As above
  • 26. ZFS Features ● Pooled Storage ● Automatic Checksumming ● Deduplication ● Compression ● Disk Striping ● ARC ● L2ARC ● RAID-N ● Copy-on-write (no fsck) ● Lightweight datasets ● Quotas ● Integrated CIFS, NFS, iSCSI ● Virtual Volumes ● Encryption ● ACLs ● Snapshots ● Clones ● Arbitrary device trees ● Send / Receive datasets ● SLOG
  • 27. ZFS History ● 2001 - Originally from Solaris (Sun’s OS) ● 2005 - Open sourced (CDDL) as part of OpenSolaris ● 2006 - Linux didn’t use it as CDDL != GPL (FUSE port available) ● 2007 - Picked up by FreeBSD, Apple (briefly) ● 2010 - Oracle closed OpenSolaris, yielded Illumos, OpenZFS ● 2015 - Canonical hired lawyers, decided CDDL == GPL ● 2016 - Available natively in Ubuntu 16.04+
  • 28. ZFS Superpower 1: The ARC Paper: http://www2.cs.uh.edu/~paris/7360/PAPERS03/arcfast.pdf Cantrill Rant: https://www.youtube.com/watch?v=F8sZRBdmqc0
  • 29. ZFS Superpower 1: The ARC 1. List of recently cached data 2. List of recently cached data accessed two or more times 3. List of data evicted from 1 4. List of data evicted from 2 ● Take a given amount of cache space, partition it in two at a point c ○ Below c is used for list 1, above c for list 2 ● Everytime you miss, see if you recently evicted that data by checking 3,4 ○ If you did, move c to favour keeping that type of data ● Scan resistant, protects from replays / compactions
  • 30. ZFS Superpower 1: The ARC Credit: ARC paper, linked previously ● Results are extremely workload dependent ● Kafka’s workload is very favourable
  • 31. ZFS Superpower 2: The L2ARC ● Set a storage device to act as a Level 2 ARC ● Temporary, Instance SSDs on Cloud VMs are perfect ● Increase ARC size by ~200GB ○ Slower than memory ○ Faster than disks ○ Does not steal throughput from disks
  • 32. ZFS Superpower 2: The L2ARC ● Not strictly a second tier of the ARC ● Bad idea to tie ARC evictions to disk speed ● Instead, a process scavenges data that is likely to be evicted soon
  • 33. ZFS Superpower 3: LZ4 Credit: Vadim Tkachenko 2016 https://www.percona.com/blog/2016/04/13/evaluating-database-compression-methods-update/
  • 34. ZFS Superpower 3: LZ4 ● LZ4 is so fast it’s free ● Disk throughput is increased by compression factor ● UTF-8 JSON achieves around 5x ● Compressed blocks stored in ARC, L2ARC ○ Increases hit rates ● Still better to have producer compress first
  • 35. ZFS Superpower 4: Prefetch ● Uses idle disk time to pre-load the ARC ● Request a block? Get the next one just in case ● Read that? Better get the next two ● Read those? ... ● Extremely beneficial to sequential read streams ○ A.K.A. every Kafka consumer ● Increases hit rates
  • 37. Disk I/O in 30 seconds ● IOP = Disk read or write ● Disks have an IOP latency which bounds their IOPs/sec ○ Local SSD : Very fast ○ Remote SSD: Less Fast ○ Remote HDD: Not Fast ● IOP max size determined by disk type ● Throughput = IOPs/second x IOP size ● Spinning disks care about IOPs to random vs. sequential sectors ● ZFS handles all of this for you
  • 38. Cloud I/O Options (Azure, East US 2) - 1TB Make use of ZFS striping to use many disks Note: Latency not shown. Transaction costs can be reduced with ZFS write batching. Type Layout IOPS Txn Fee Total Standard HDD 32x 32GB @ $1.54 32 x 500 = 16k Yes ~ $25 $75 Standard SSD 8x 128GB @ $9.60 8 x 500 = 4k Yes ~ $25 $102 Premium SSD 1x 1TB @ $123 5k No $123 Ultra SSD ? ? ? Lots Instance SSD 1x 200GB 12k No Free
  • 40. Create the VM 1. Attach as many disks as possible 2. If using Azure, do not use the ‘S’ series a. Reduced instance SSD size 3. If using Azure, do not enable ‘Host Disk Caching’ a. We have the better cache
  • 46. Tune ZFS - /etc/modprobe.d/zfs.conf Increase disk queue depth, unlimit L2ARC, limit ARC size based on -Xmx Maybe also tweak write buffer (dirty data) size Also: disable weekly scrub
  • 47. Demo Here’s one I made earlier
  • 48. Arcstat Tool - Old Credit: Mike Harsch 2010 http://blog.harschsystems.com/2010/09/08/arcstat-pl-updated-for-l2arc-statistics/
  • 49. Prometheus / Exporter / Grafana - New
  • 51. Total Disk Throughput - ZFS Just 16 Standard HDDs, $1.54 each, D12_v2 VMs
  • 52. Total Disk Throughput - EXT4 on LVM No performance loss
  • 53. Messages Per Second 1K messages from kafka-producer-perf-test, details in appendix
  • 54. Caveats The Wise Man Learns from the Mistakes of Others
  • 55. Things not to do on ZFS 1 ● Do not use a separate device for the Write-Ahead-Log ○ Called the ZFS Intent Log / ZIL ○ Basically Journalling ○ Separate device known as an SLOG ● Most Kafka writes are async, so it’s not going to benefit you ● If the device is lost it can be tricky to recover
  • 56. Things not to do on ZFS 2 ● Do not use the deduplicating feature ○ Huge memory hog ○ Means less ARC for pagecache ● Why is there duplicated data anyway? ○ Fix the problem at source
  • 57. Things not to do on ZFS 3 ● Do not add an temporary instance disk to your main pool ○ Easy to do if you forget the ‘cache’ keyword ● You cannot remove disks from a zpool ○ You’re forever bound to that particular host
  • 58. Things not to do on ZFS 4 ● Do not create ZFS snapshots if you use retention.bytes ● Data will never be deleted ● You will run out of space
  • 59. Things not to do on ZFS 5 ● Do not create a raidz pool ○ Your cloud provider is handling data redundancy for you ○ Holdover from physical disks
  • 60. Future Ideas ● A mirror pool of instance SSD and standard HDD ○ Limited size, but very fast and recoverable on VM loss. Like SSD Redis? ● Does setting copies=2 increase read speed with multiple disks? ○ At the cost of storage capacity ○ Could also do this with mirrors ● Can we use Kafka’s replication to safely have larger write buffers? ● Can Kafka skip startup verification given that the data is always consistent? ● As Kafka is append only, can the ZFS record size be increased efficiently?