SlideShare a Scribd company logo
1 of 27
1© 2019 Joyent, Inc.
RAIDZ on-disk format
vs. small blocks
Mike Gerdts
mike.gerdts@joyent.com
June, 2019
Agenda
● Introduction
○ Problem statement
○ Advanced Format (AF) drives
○ I/O inflation overview
○ Common SmartOS configurations
● RAIDZ1 block allocation
● RAIDZ2 block allocation
● Future SmartOS configurations
Terminology
sector The fundamental allocation unit on a hard disk or thing that acts like
a hard disk
block An allocation unit at a high level, such as a file system or database
page size Databases often refer to page size when they mean block size
recordsize The ZFS property for file systems that indicates which block size is
used for new files
volblocksize The ZFS property for volumes that indicates which block size is used
for the volume's data
dataset A ZFS filesystem, volume, snapshot, and potentially more. This
presentation is concerned with filesystems and volumes.
The Problem
● Empirical evidence shows that zfs does not automatically size
refreservation properly on RAIDZ2. See OS-7824.
○ SmartOS relies on accuracy of refreservation to set quota
○ When quota is undersized, writes fail in hosts with EDQUOT,
which translates into EIO in guests
● Investigation into the problem above shows further concerns
○ Until volblocksize is 4x the disk sector size:
■ RAIDZ is as space hungry as mirroring
■ RAIDZ2 is as space hungry as triple mirroring
Advanced Format Drives
● The world of storage is moving from 512 to 4096 bytes per sector
○ Seen most in flash-based storage (SSD, NVMe, etc.)
○ Also with hard disk drives (e.g. Ultrastar DC HC530)
https://www.seagate.com/tech-insights/advanced-format-4k-sector-hard-drives-master-ti/
I/O Inflation
● VMs, databases, etc. have their own block sizes
○ Most Linux and Windows file systems use 4k blocks
○ Postgres, Oracle use 8k; sqlite 1k
● I/O inflation comes from
○ Partial sector reads or writes
○ Unaligned reads or writes
○ Writes turn into read-modify-write (RMW)
● Applies to ZFS blocks too
○ Checksums and parity calculated per zfs block
○ RMW may be just MW if block in ARC
R
R
Partial sector read
Unaligned read
R
Aligned read
SmartOS zpool configuration
● Default pool configuration by disklayout.js
○ if one disk, single
○ else if fewer than 16 disks, (2-way) mirror
○ else raidz2
● The fundamental allocation size is 512 or 4096
○ ashift=9 or ashift=12 (29 = 512, 212 = 4096)
○ 1 4Kn drive causes pool to have ashift=12
○ If ashift=9 it is not possible to add a 4Kn disk
SmartOS ZFS Configuration
● Filesystems generally use 128K recordsize
● Volumes generally use 8K volblocksize
○ Some images use 4K volblocksize
● bhyve tries hard to reserve space and prevent excessive use
○ refreservation on volumes (vdisks) and zone dataset
■ No one else can use space reserved for this dataset
○ quota on zone dataset
■ Ensures snapshots do not lead to excessive data use
○ See RFD 154 Flexible Disk Space
9© 2019 Joyent, Inc.
RAIDZ1 block
allocation
RAIDZ - 128K block on 4Kn
● This is a deceptively simple example
● Optimized for 128 KiB blocks
○ RAIDZ{1,2,3} have 1, 2, or 3 parity disks
○ dataset referenced is 128K
○ pool alloc is 160K
● This is a baseline used for other
calculations. This is "raidz deflation".
P0 D0 D8 D16 D24
P1 D1 D9 D17 D25
P2 D2 D10 D18 D26
P3 D3 D11 D19 D27
P4 D4 D12 D20 D28
P5 D5 D13 D21 D29
P6 D6 D14 D22 D30
P7 D7 D15 D23 D31
RAIDZ - 4K blocks on 4Kn
● At least one set of parity per block
○ Also one set per stripe (as seen before)
● Important observations:
○ Same space consumption as mirroring
○ But supports 50% of the iops
● Stored 80K of data in space of 128K block
○ referenced increases by 128K
○ 10 GiB of 4K → referenced=16g
P0 D0 P0 D0 P0
D0 P0 D0 P0 D0
P0
D0 P0 D0 P0
D0 P0 D0 P0 D0
P0 D0 P0 D0 P0
D0 P0 D0 P0 D0
P0 D0 P0 D0 P0
D0 P0 D0 P0 D0
RAIDZ - 8K blocks on 4Kn
● Skip blocks
○ RAIDZ requires that each allocation be a
multiple of (#parity disks + 1)
● Stored 80K of data in space of 128K block
○ referenced increases by 128K
○ Same as 4K blocks
■ But more efficient for metadata
P0 D0 D1 S0 P0
D0 D1 S0 P0 D0
D1 S0 P0 D0 D1
S0 P0 D0 D1 S0
P0 D0 D1 S0 P0
D0 D1 S0 P0 D0
D1 S0 P0 D0 D1
S0 P0 D0 D1 S0
RAIDZ - 32K blocks on 4Kn
● Important observations:
○ Just as good as 128K
○ But not necessarily true with more disks
● Stored 128k of data in space of 128K block
○ referenced increases by 128K
P0 D0 D2 D4 D6
P1 D2 D3 D5 D7
P0 D0 D2 D4 D6
P1 D2 D3 D5 D7
P0 D0 D2 D4 D6
P1 D2 D3 D5 D7
P0 D0 D2 D4 D6
P1 D2 D3 D5 D7
RAIDZ - Parity Overhead
volblocksize raidz-5 raidz1-6 raidz1-7 raidz1-8 raidz1-9 raidz1-10
4k 1.60 1.60 1.69 1.69 1.78 1.78
8k 1.60 1.60 1.69 1.69 1.78 1.78
16k 1.20 1.20 1.26 1.26 1.33 1.33
32k 1.00 1.00 1.05 1.05 1.11 1.11
64k 1.00 1.00 1.05 1.05 1.00 1.00
128k 1.00 1.00 1.00 1.00 1.00 1.00
referenced relative to storing same amount of data with 128k blocks
15© 2019 Joyent, Inc.
RAIDZ2 block
allocation
RAIDZ2 - 128K block on 4Kn
● This too is a deceptively simple example
● Optimized for 128 KiB blocks
○ dataset referenced is 128K
○ pool alloc is 192K vs 160K for RAIDZ1
● RAIDZ1 → RAIDZ2 appears to increase
overhead by one disk (20% in this case)
P0 P8 D0 D8 D16 D24
P1 P9 D1 D9 D17 D25
P2 P10 D2 D10 D18 D26
P3 P11 D3 D11 D19 D27
P4 P12 D4 D12 D20 D28
P5 P13 D5 D13 D21 D29
P6 P14 D6 D14 D22 D30
P7 P15 D7 D15 D23 D31
RAIDZ2 - 4K block on 4Kn
● Two parity blocks for every data block
○ Same space as triple mirroring!
○ But ⅓ of the read performance
● Stored 64K in space of 128K block
○ referenced increases by 128K
P0 P1 D0 P0 P1 D0
P0 P1 D0 P0 P1 D0
P0 P1 D0 P0 P1 D0
P0 P1 D0 P0 P1 D0
P0 P1 D0 P0 P1 D0
P0 P1 D0 P0 P1 D0
P0 P1 D0 P0 P1 D0
P0 P1 D0 P0 P1 D0
RAIDZ2 - 8K block on 4Kn
● Two parity blocks for every data block
○ Same space as triple mirroring!
○ But ⅓ the read performance
● Skip blocks chew up tons of space
● Stored 64K in space of 128K block
○ referenced increases by 128K
P0 P1 D0 D1 S0 S1
P0 P1 D0 D1 S0 S1
P0 P1 D0 D1 S0 S1
P0 P1 D0 D1 S0 S1
P0 P1 D0 D1 S0 S1
P0 P1 D0 D1 S0 S1
P0 P1 D0 D1 S0 S1
P0 P1 D0 D1 S0 S1
RAIDZ2 - 16K block on 4Kn
● Important observations:
○ Woot! Just as good as 128K
○ But not with 7 or more disks
● Stored 128k of data in space of 128K block
○ referenced increases by 128K
P0 P1 D0 D1 D2 D3
P0 P1 D0 D1 D2 D3
P0 P1 D0 D1 D2 D3
P0 P1 D0 D1 D2 D3
P0 P1 D0 D1 D2 D3
P0 P1 D0 D1 D2 D3
P0 P1 D0 D1 D2 D3
P0 P1 D0 D1 D2 D3
RAIDZ2 - Parity Overhead
volblocksize raidz2-5 raidz2-6 raidz2-7 raidz2-8 raidz2-9 raidz2-10
4k 1.78 2.00 2.00 2.14 2.29 2.29
8k 1.78 2.00 2.00 2.14 2.29 2.29
16k 1.33 1.00 1.00 1.07 1.15 1.15
32k 1.11 1.00 1.00 1.07 1.14 1.14
64k 1.11 1.00 1.00 1.07 1.14 1.00
128k 1.00 1.00 1.00 1.00 1.00 1.00
referenced relative to storing same amount of data with 128k blocks
21© 2019 Joyent, Inc.
Future SmartOS
configurations
Bugs and improvements
● OS-7824 zfs volume can reference more data than
refreservation=auto calculates
○ That is, fix refreservation=auto calculation
○ aka illumos#9318
● In the future maybe add per-dataset tracking of skip blocks
○ Particularly informative when compression used
● Explore draid (in development)
○ Is or should allocation and/or accounting be different?
Default Pool Layout
● If the pool will have ashift=12, always default to mirroring
○ This is consistent with the current focus on bhyve
○ Likely relevant for containers with…
■ postgres with recordsize=8k without compression
■ postgres with recordsize=16k+ with compression
■ most files larger than 112 bytes but smaller than 16K
● Improve documentation to include explanation of nuanced
behavior
Default volblocksize
● Explore impact of volblocksize=16k without compression
● Explore impact of volblocksize=32k+ with compression
● But keep in mind that a send stream includes the block size
○ Images
○ Migration
25© 2019 Joyent, Inc.
Discuss
References
● Ironically, justification for much of the current worry about RAIDZ
https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz
● RAIDZ parity (and padding) cost, linked from above blog post
https://docs.google.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674
● This presentation as a big theory comment
https://github.com/mgerdts/illumos-
joyent/blob/ed6682683d1c15bcb87971c2430d16cf5a1e52d6/usr/src/lib/libzfs/common/libzfs_dataset.c#L5115-L5193
/*
* The theory of raidz space accounting
*
* The "referenced" property of RAIDZ vdevs is scaled such that a 128KB block
* will "reference" 128KB, even though it allocates more than that, to store the
* parity information (and perhaps skip sectors). This concept of the
* "referenced" (and other DMU space accounting) being lower than the allocated
* space by a constant factor is called "raidz deflation."
*
* As mentioned above, the constant factor for raidz deflation assumes a 128KB
* block size. However, zvols typically have a much smaller block size (default
* 8KB). These smaller blocks may require proportionally much more parity
* information (and perhaps skip sectors). In this case, the change to the
* "referenced" property may be much more than the logical block size.
*
* Suppose a raidz vdev has 5 disks with ashift=12. A 128k block may be written
* as follows.
*
* +-------+-------+-------+-------+-------+
* | disk1 | disk2 | disk3 | disk4 | disk5 |
* +-------+-------+-------+-------+-------+
* | P0 | D0 | D8 | D16 | D24 |
* | P1 | D1 | D9 | D17 | D25 |
* | P2 | D2 | D10 | D18 | D26 |
* | P3 | D3 | D11 | D19 | D27 |
* | P4 | D4 | D12 | D20 | D28 |
* | P5 | D5 | D13 | D21 | D29 |
* | P6 | D6 | D14 | D22 | D30 |
* | P7 | D7 | D15 | D23 | D31 |
* +-------+-------+-------+-------+-------+
*
* Above, notice that 160k was allocated: 8 x 4k parity sectors + 32 x 4k data
* sectors. The dataset's referenced will increase by 128k and the pool's
* allocated and free properties will be adjusted by 160k.
*
* A 4k block written to the same raidz vdev will require two 4k sectors. The
* blank cells represent unallocated space.
*
* +-------+-------+-------+-------+-------+
* | disk1 | disk2 | disk3 | disk4 | disk5 |
* +-------+-------+-------+-------+-------+
* | P0 | D0 | | | |
* +-------+-------+-------+-------+-------+
*
* Above, notice that the 4k block required one sector for parity and another
* for data. vdev_raidz_asize() will return 8k and as such the pool's allocated
* and free properties will be adjusted by 8k. The dataset will not be charged
* 8k. Rather, it will be charged a value that is scaled according to the
* overhead of the 128k block on the same vdev. This 8k allocation will be
* charged 8k * 128k / 160k. 128k is from SPA_OLD_MAXBLOCKSIZE and 160k is as
* calculated in the 128k block example above.
*
* Every raidz allocation is sized to be a multiple of nparity+1 sectors. That
* is, every raidz1 allocation will be a multiple of 2 sectors, raidz2
* allocations are a multiple of 3 sectors, and raidz3 allocations are a
* multiple of of 4 sectors. When a block does not fill the required number of
* sectors, skip blocks (sectors) are used.
*
* An 8k block being written to a raidz vdev may be written as follows:
*
* +-------+-------+-------+-------+-------+
* | disk1 | disk2 | disk3 | disk4 | disk5 |
* +-------+-------+-------+-------+-------+
* | P0 | D0 | D1 | S0 | |
* +-------+-------+-------+-------+-------+
*
* In order to maintain the nparity+1 allocation size, a skip block (S0) was
* added. For this 8k block, the pool's allocated and free properties are
* adjusted by 16k and the dataset's referenced is increased by 16k * 128k /
* 160k. Again, 128k is from SPA_OLD_MAXBLOCKSIZE and 160k is as calculated in
* the 128k block example above.
*
* Compression may lead to a variety of block sizes being written for the same
* volume or file. There is no clear way to reserve just the amount of space
* that will be required, so the worst case (no compression) is assumed.
* Note that metadata blocks will typically be compressed, so the reservation
* size returned by zvol_volsize_to_reservation() will generally be slightly
* larger than the maximum that the volume can reference.
*/

More Related Content

What's hot

Raid : Redundant Array of Inexpensive Disks
Raid : Redundant Array of Inexpensive DisksRaid : Redundant Array of Inexpensive Disks
Raid : Redundant Array of Inexpensive DisksCloudbells.com
 
Weather Data Analytics Using Hadoop
Weather Data Analytics Using HadoopWeather Data Analytics Using Hadoop
Weather Data Analytics Using HadoopNajima Begum
 
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)Raid Data Recovery
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory SystemsArush Nagpal
 
Memory Management | Computer Science
Memory Management | Computer ScienceMemory Management | Computer Science
Memory Management | Computer ScienceTransweb Global Inc
 
Introduction to Linux Kernel by Quontra Solutions
Introduction to Linux Kernel by Quontra SolutionsIntroduction to Linux Kernel by Quontra Solutions
Introduction to Linux Kernel by Quontra SolutionsQUONTRASOLUTIONS
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introductionchrislusf
 
Optimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed SystemsOptimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed Systemsmridul mishra
 
CDW: SAN vs. NAS
CDW: SAN vs. NASCDW: SAN vs. NAS
CDW: SAN vs. NASSpiceworks
 
Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Nakul Manchanda
 
VMware VSAN Technical Deep Dive - March 2014
VMware VSAN Technical Deep Dive - March 2014VMware VSAN Technical Deep Dive - March 2014
VMware VSAN Technical Deep Dive - March 2014David Davis
 
Moving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressMoving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressOdinot Stanislas
 
Chapter 11 - File System Implementation
Chapter 11 - File System ImplementationChapter 11 - File System Implementation
Chapter 11 - File System ImplementationWayne Jones Jnr
 
Presentation On RAID(Redundant Array Of Independent Disks) Basics
Presentation On RAID(Redundant Array Of Independent Disks) BasicsPresentation On RAID(Redundant Array Of Independent Disks) Basics
Presentation On RAID(Redundant Array Of Independent Disks) BasicsKuber Chandra
 
Paging and Segmentation in Operating System
Paging and Segmentation in Operating SystemPaging and Segmentation in Operating System
Paging and Segmentation in Operating SystemRaj Mohan
 

What's hot (20)

Raid : Redundant Array of Inexpensive Disks
Raid : Redundant Array of Inexpensive DisksRaid : Redundant Array of Inexpensive Disks
Raid : Redundant Array of Inexpensive Disks
 
Weather Data Analytics Using Hadoop
Weather Data Analytics Using HadoopWeather Data Analytics Using Hadoop
Weather Data Analytics Using Hadoop
 
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
 
DAS RAID NAS SAN
DAS RAID NAS SANDAS RAID NAS SAN
DAS RAID NAS SAN
 
Storage Managment
Storage ManagmentStorage Managment
Storage Managment
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
Memory Management | Computer Science
Memory Management | Computer ScienceMemory Management | Computer Science
Memory Management | Computer Science
 
Raid level
Raid levelRaid level
Raid level
 
Introduction to Linux Kernel by Quontra Solutions
Introduction to Linux Kernel by Quontra SolutionsIntroduction to Linux Kernel by Quontra Solutions
Introduction to Linux Kernel by Quontra Solutions
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Optimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed SystemsOptimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed Systems
 
CDW: SAN vs. NAS
CDW: SAN vs. NASCDW: SAN vs. NAS
CDW: SAN vs. NAS
 
Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)
 
VMware VSAN Technical Deep Dive - March 2014
VMware VSAN Technical Deep Dive - March 2014VMware VSAN Technical Deep Dive - March 2014
VMware VSAN Technical Deep Dive - March 2014
 
Moving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressMoving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM Express
 
Memory virtualization
Memory virtualizationMemory virtualization
Memory virtualization
 
Chapter 11 - File System Implementation
Chapter 11 - File System ImplementationChapter 11 - File System Implementation
Chapter 11 - File System Implementation
 
Presentation On RAID(Redundant Array Of Independent Disks) Basics
Presentation On RAID(Redundant Array Of Independent Disks) BasicsPresentation On RAID(Redundant Array Of Independent Disks) Basics
Presentation On RAID(Redundant Array Of Independent Disks) Basics
 
Paging and Segmentation in Operating System
Paging and Segmentation in Operating SystemPaging and Segmentation in Operating System
Paging and Segmentation in Operating System
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
 

Similar to Raidz on-disk format vs. small blocks

Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data DeduplicationRedWireServices
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensMatthew Ahrens
 
Ceph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wildCeph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wildCeph Community
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringShapeBlue
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBXiao Yan Li
 
ZFS Workshop
ZFS WorkshopZFS Workshop
ZFS WorkshopAPNIC
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheDavid Grier
 
Webinar NETGEAR - ReadyNAS, le novità hardware e software
Webinar NETGEAR - ReadyNAS, le novità hardware e softwareWebinar NETGEAR - ReadyNAS, le novità hardware e software
Webinar NETGEAR - ReadyNAS, le novità hardware e softwareNetgear Italia
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...DataStax Academy
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems confluent
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big DataDataStax Academy
 
Perforce BTrees: The Arcane and the Profane
Perforce BTrees: The Arcane and the ProfanePerforce BTrees: The Arcane and the Profane
Perforce BTrees: The Arcane and the ProfanePerforce
 
Speedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderSpeedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderGregSmith458515
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightGluster.org
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Data center core elements, Data center virtualization
Data center core elements, Data center virtualizationData center core elements, Data center virtualization
Data center core elements, Data center virtualizationMadhuraNK
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

Similar to Raidz on-disk format vs. small blocks (20)

Raid
RaidRaid
Raid
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
 
Raid
RaidRaid
Raid
 
Ceph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wildCeph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wild
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDB
 
ZFS Workshop
ZFS WorkshopZFS Workshop
ZFS Workshop
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
Webinar NETGEAR - ReadyNAS, le novità hardware e software
Webinar NETGEAR - ReadyNAS, le novità hardware e softwareWebinar NETGEAR - ReadyNAS, le novità hardware e software
Webinar NETGEAR - ReadyNAS, le novità hardware e software
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
RAID Review
RAID ReviewRAID Review
RAID Review
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
 
Perforce BTrees: The Arcane and the Profane
Perforce BTrees: The Arcane and the ProfanePerforce BTrees: The Arcane and the Profane
Perforce BTrees: The Arcane and the Profane
 
Speedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderSpeedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql Loader
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Data center core elements, Data center virtualization
Data center core elements, Data center virtualizationData center core elements, Data center virtualization
Data center core elements, Data center virtualization
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 

Raidz on-disk format vs. small blocks

  • 1. 1© 2019 Joyent, Inc. RAIDZ on-disk format vs. small blocks Mike Gerdts mike.gerdts@joyent.com June, 2019
  • 2. Agenda ● Introduction ○ Problem statement ○ Advanced Format (AF) drives ○ I/O inflation overview ○ Common SmartOS configurations ● RAIDZ1 block allocation ● RAIDZ2 block allocation ● Future SmartOS configurations
  • 3. Terminology sector The fundamental allocation unit on a hard disk or thing that acts like a hard disk block An allocation unit at a high level, such as a file system or database page size Databases often refer to page size when they mean block size recordsize The ZFS property for file systems that indicates which block size is used for new files volblocksize The ZFS property for volumes that indicates which block size is used for the volume's data dataset A ZFS filesystem, volume, snapshot, and potentially more. This presentation is concerned with filesystems and volumes.
  • 4. The Problem ● Empirical evidence shows that zfs does not automatically size refreservation properly on RAIDZ2. See OS-7824. ○ SmartOS relies on accuracy of refreservation to set quota ○ When quota is undersized, writes fail in hosts with EDQUOT, which translates into EIO in guests ● Investigation into the problem above shows further concerns ○ Until volblocksize is 4x the disk sector size: ■ RAIDZ is as space hungry as mirroring ■ RAIDZ2 is as space hungry as triple mirroring
  • 5. Advanced Format Drives ● The world of storage is moving from 512 to 4096 bytes per sector ○ Seen most in flash-based storage (SSD, NVMe, etc.) ○ Also with hard disk drives (e.g. Ultrastar DC HC530) https://www.seagate.com/tech-insights/advanced-format-4k-sector-hard-drives-master-ti/
  • 6. I/O Inflation ● VMs, databases, etc. have their own block sizes ○ Most Linux and Windows file systems use 4k blocks ○ Postgres, Oracle use 8k; sqlite 1k ● I/O inflation comes from ○ Partial sector reads or writes ○ Unaligned reads or writes ○ Writes turn into read-modify-write (RMW) ● Applies to ZFS blocks too ○ Checksums and parity calculated per zfs block ○ RMW may be just MW if block in ARC R R Partial sector read Unaligned read R Aligned read
  • 7. SmartOS zpool configuration ● Default pool configuration by disklayout.js ○ if one disk, single ○ else if fewer than 16 disks, (2-way) mirror ○ else raidz2 ● The fundamental allocation size is 512 or 4096 ○ ashift=9 or ashift=12 (29 = 512, 212 = 4096) ○ 1 4Kn drive causes pool to have ashift=12 ○ If ashift=9 it is not possible to add a 4Kn disk
  • 8. SmartOS ZFS Configuration ● Filesystems generally use 128K recordsize ● Volumes generally use 8K volblocksize ○ Some images use 4K volblocksize ● bhyve tries hard to reserve space and prevent excessive use ○ refreservation on volumes (vdisks) and zone dataset ■ No one else can use space reserved for this dataset ○ quota on zone dataset ■ Ensures snapshots do not lead to excessive data use ○ See RFD 154 Flexible Disk Space
  • 9. 9© 2019 Joyent, Inc. RAIDZ1 block allocation
  • 10. RAIDZ - 128K block on 4Kn ● This is a deceptively simple example ● Optimized for 128 KiB blocks ○ RAIDZ{1,2,3} have 1, 2, or 3 parity disks ○ dataset referenced is 128K ○ pool alloc is 160K ● This is a baseline used for other calculations. This is "raidz deflation". P0 D0 D8 D16 D24 P1 D1 D9 D17 D25 P2 D2 D10 D18 D26 P3 D3 D11 D19 D27 P4 D4 D12 D20 D28 P5 D5 D13 D21 D29 P6 D6 D14 D22 D30 P7 D7 D15 D23 D31
  • 11. RAIDZ - 4K blocks on 4Kn ● At least one set of parity per block ○ Also one set per stripe (as seen before) ● Important observations: ○ Same space consumption as mirroring ○ But supports 50% of the iops ● Stored 80K of data in space of 128K block ○ referenced increases by 128K ○ 10 GiB of 4K → referenced=16g P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0 P0 D0
  • 12. RAIDZ - 8K blocks on 4Kn ● Skip blocks ○ RAIDZ requires that each allocation be a multiple of (#parity disks + 1) ● Stored 80K of data in space of 128K block ○ referenced increases by 128K ○ Same as 4K blocks ■ But more efficient for metadata P0 D0 D1 S0 P0 D0 D1 S0 P0 D0 D1 S0 P0 D0 D1 S0 P0 D0 D1 S0 P0 D0 D1 S0 P0 D0 D1 S0 P0 D0 D1 S0 P0 D0 D1 S0 P0 D0 D1 S0
  • 13. RAIDZ - 32K blocks on 4Kn ● Important observations: ○ Just as good as 128K ○ But not necessarily true with more disks ● Stored 128k of data in space of 128K block ○ referenced increases by 128K P0 D0 D2 D4 D6 P1 D2 D3 D5 D7 P0 D0 D2 D4 D6 P1 D2 D3 D5 D7 P0 D0 D2 D4 D6 P1 D2 D3 D5 D7 P0 D0 D2 D4 D6 P1 D2 D3 D5 D7
  • 14. RAIDZ - Parity Overhead volblocksize raidz-5 raidz1-6 raidz1-7 raidz1-8 raidz1-9 raidz1-10 4k 1.60 1.60 1.69 1.69 1.78 1.78 8k 1.60 1.60 1.69 1.69 1.78 1.78 16k 1.20 1.20 1.26 1.26 1.33 1.33 32k 1.00 1.00 1.05 1.05 1.11 1.11 64k 1.00 1.00 1.05 1.05 1.00 1.00 128k 1.00 1.00 1.00 1.00 1.00 1.00 referenced relative to storing same amount of data with 128k blocks
  • 15. 15© 2019 Joyent, Inc. RAIDZ2 block allocation
  • 16. RAIDZ2 - 128K block on 4Kn ● This too is a deceptively simple example ● Optimized for 128 KiB blocks ○ dataset referenced is 128K ○ pool alloc is 192K vs 160K for RAIDZ1 ● RAIDZ1 → RAIDZ2 appears to increase overhead by one disk (20% in this case) P0 P8 D0 D8 D16 D24 P1 P9 D1 D9 D17 D25 P2 P10 D2 D10 D18 D26 P3 P11 D3 D11 D19 D27 P4 P12 D4 D12 D20 D28 P5 P13 D5 D13 D21 D29 P6 P14 D6 D14 D22 D30 P7 P15 D7 D15 D23 D31
  • 17. RAIDZ2 - 4K block on 4Kn ● Two parity blocks for every data block ○ Same space as triple mirroring! ○ But ⅓ of the read performance ● Stored 64K in space of 128K block ○ referenced increases by 128K P0 P1 D0 P0 P1 D0 P0 P1 D0 P0 P1 D0 P0 P1 D0 P0 P1 D0 P0 P1 D0 P0 P1 D0 P0 P1 D0 P0 P1 D0 P0 P1 D0 P0 P1 D0 P0 P1 D0 P0 P1 D0 P0 P1 D0 P0 P1 D0
  • 18. RAIDZ2 - 8K block on 4Kn ● Two parity blocks for every data block ○ Same space as triple mirroring! ○ But ⅓ the read performance ● Skip blocks chew up tons of space ● Stored 64K in space of 128K block ○ referenced increases by 128K P0 P1 D0 D1 S0 S1 P0 P1 D0 D1 S0 S1 P0 P1 D0 D1 S0 S1 P0 P1 D0 D1 S0 S1 P0 P1 D0 D1 S0 S1 P0 P1 D0 D1 S0 S1 P0 P1 D0 D1 S0 S1 P0 P1 D0 D1 S0 S1
  • 19. RAIDZ2 - 16K block on 4Kn ● Important observations: ○ Woot! Just as good as 128K ○ But not with 7 or more disks ● Stored 128k of data in space of 128K block ○ referenced increases by 128K P0 P1 D0 D1 D2 D3 P0 P1 D0 D1 D2 D3 P0 P1 D0 D1 D2 D3 P0 P1 D0 D1 D2 D3 P0 P1 D0 D1 D2 D3 P0 P1 D0 D1 D2 D3 P0 P1 D0 D1 D2 D3 P0 P1 D0 D1 D2 D3
  • 20. RAIDZ2 - Parity Overhead volblocksize raidz2-5 raidz2-6 raidz2-7 raidz2-8 raidz2-9 raidz2-10 4k 1.78 2.00 2.00 2.14 2.29 2.29 8k 1.78 2.00 2.00 2.14 2.29 2.29 16k 1.33 1.00 1.00 1.07 1.15 1.15 32k 1.11 1.00 1.00 1.07 1.14 1.14 64k 1.11 1.00 1.00 1.07 1.14 1.00 128k 1.00 1.00 1.00 1.00 1.00 1.00 referenced relative to storing same amount of data with 128k blocks
  • 21. 21© 2019 Joyent, Inc. Future SmartOS configurations
  • 22. Bugs and improvements ● OS-7824 zfs volume can reference more data than refreservation=auto calculates ○ That is, fix refreservation=auto calculation ○ aka illumos#9318 ● In the future maybe add per-dataset tracking of skip blocks ○ Particularly informative when compression used ● Explore draid (in development) ○ Is or should allocation and/or accounting be different?
  • 23. Default Pool Layout ● If the pool will have ashift=12, always default to mirroring ○ This is consistent with the current focus on bhyve ○ Likely relevant for containers with… ■ postgres with recordsize=8k without compression ■ postgres with recordsize=16k+ with compression ■ most files larger than 112 bytes but smaller than 16K ● Improve documentation to include explanation of nuanced behavior
  • 24. Default volblocksize ● Explore impact of volblocksize=16k without compression ● Explore impact of volblocksize=32k+ with compression ● But keep in mind that a send stream includes the block size ○ Images ○ Migration
  • 25. 25© 2019 Joyent, Inc. Discuss
  • 26. References ● Ironically, justification for much of the current worry about RAIDZ https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz ● RAIDZ parity (and padding) cost, linked from above blog post https://docs.google.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT689wTjHv6CGVElrPqTA0w_ZY/edit?pli=1#gid=2126998674 ● This presentation as a big theory comment https://github.com/mgerdts/illumos- joyent/blob/ed6682683d1c15bcb87971c2430d16cf5a1e52d6/usr/src/lib/libzfs/common/libzfs_dataset.c#L5115-L5193
  • 27. /* * The theory of raidz space accounting * * The "referenced" property of RAIDZ vdevs is scaled such that a 128KB block * will "reference" 128KB, even though it allocates more than that, to store the * parity information (and perhaps skip sectors). This concept of the * "referenced" (and other DMU space accounting) being lower than the allocated * space by a constant factor is called "raidz deflation." * * As mentioned above, the constant factor for raidz deflation assumes a 128KB * block size. However, zvols typically have a much smaller block size (default * 8KB). These smaller blocks may require proportionally much more parity * information (and perhaps skip sectors). In this case, the change to the * "referenced" property may be much more than the logical block size. * * Suppose a raidz vdev has 5 disks with ashift=12. A 128k block may be written * as follows. * * +-------+-------+-------+-------+-------+ * | disk1 | disk2 | disk3 | disk4 | disk5 | * +-------+-------+-------+-------+-------+ * | P0 | D0 | D8 | D16 | D24 | * | P1 | D1 | D9 | D17 | D25 | * | P2 | D2 | D10 | D18 | D26 | * | P3 | D3 | D11 | D19 | D27 | * | P4 | D4 | D12 | D20 | D28 | * | P5 | D5 | D13 | D21 | D29 | * | P6 | D6 | D14 | D22 | D30 | * | P7 | D7 | D15 | D23 | D31 | * +-------+-------+-------+-------+-------+ * * Above, notice that 160k was allocated: 8 x 4k parity sectors + 32 x 4k data * sectors. The dataset's referenced will increase by 128k and the pool's * allocated and free properties will be adjusted by 160k. * * A 4k block written to the same raidz vdev will require two 4k sectors. The * blank cells represent unallocated space. * * +-------+-------+-------+-------+-------+ * | disk1 | disk2 | disk3 | disk4 | disk5 | * +-------+-------+-------+-------+-------+ * | P0 | D0 | | | | * +-------+-------+-------+-------+-------+ * * Above, notice that the 4k block required one sector for parity and another * for data. vdev_raidz_asize() will return 8k and as such the pool's allocated * and free properties will be adjusted by 8k. The dataset will not be charged * 8k. Rather, it will be charged a value that is scaled according to the * overhead of the 128k block on the same vdev. This 8k allocation will be * charged 8k * 128k / 160k. 128k is from SPA_OLD_MAXBLOCKSIZE and 160k is as * calculated in the 128k block example above. * * Every raidz allocation is sized to be a multiple of nparity+1 sectors. That * is, every raidz1 allocation will be a multiple of 2 sectors, raidz2 * allocations are a multiple of 3 sectors, and raidz3 allocations are a * multiple of of 4 sectors. When a block does not fill the required number of * sectors, skip blocks (sectors) are used. * * An 8k block being written to a raidz vdev may be written as follows: * * +-------+-------+-------+-------+-------+ * | disk1 | disk2 | disk3 | disk4 | disk5 | * +-------+-------+-------+-------+-------+ * | P0 | D0 | D1 | S0 | | * +-------+-------+-------+-------+-------+ * * In order to maintain the nparity+1 allocation size, a skip block (S0) was * added. For this 8k block, the pool's allocated and free properties are * adjusted by 16k and the dataset's referenced is increased by 16k * 128k / * 160k. Again, 128k is from SPA_OLD_MAXBLOCKSIZE and 160k is as calculated in * the 128k block example above. * * Compression may lead to a variety of block sizes being written for the same * volume or file. There is no clear way to reserve just the amount of space * that will be required, so the worst case (no compression) is assumed. * Note that metadata blocks will typically be compressed, so the reservation * size returned by zvol_volsize_to_reservation() will generally be slightly * larger than the maximum that the volume can reference. */