• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
ZFS Tutorial LISA 2011
 

ZFS Tutorial LISA 2011

on

  • 4,564 views

Presentation version of my USENIX LISA'11 Tutorial on ZFS

Presentation version of my USENIX LISA'11 Tutorial on ZFS

Statistics

Views

Total Views
4,564
Views on SlideShare
4,559
Embed Views
5

Actions

Likes
10
Downloads
0
Comments
2

3 Embeds 5

https://twitter.com 2
http://www.docseek.net 2
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • I like the simplicity and fun of slide 9, but visualising volumes within a pool is debatably inconsistent with ZFS pooled storage eliminating the notion of volumes ;-)
    Are you sure you want to
    Your message goes here
    Processing…
  • Very useful, thanks.

    Side note: the later 218-slide set at http://www.slideshare.net/relling/usenix-lisa11-tutorial-zfs-a is without the blank space at slide 116, but this 219-slide set appears better in places (slide 10, for example).
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Remember this picture, it will help you make sense of it all\n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Disabling the ZIL is a bad idea\n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Chickens and eggs at Richard's ranch\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Notice the architecture gets simpler when we go for more speed. This is a speed vs cost trade-off.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Notice the architecture gets simpler when we go for more speed. This is a speed vs cost trade-off.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

ZFS Tutorial LISA 2011 ZFS Tutorial LISA 2011 Presentation Transcript

  • ZFS: A File System for Modern HardwareRichard.Elling@RichardElling.com USENIX LISA’11 Conference USENIX LISA11 December, 2011
  • Agenda • Overview • Foundations • Pooled Storage Layer • Transactional Object Layer • ZFS commands • Sharing • Properties • Other goodies • Performance • TroubleshootingZFS Tutorial USENIX LISA’11 2
  • ZFS History• Announced September 14, 2004• Integration history ✦ SXCE b27 (November 2005) ✦ FreeBSD (April 2007) ✦ Mac OSX Leopard ✤ Preview shown, but removed from Snow Leopard ✤ Disappointed community reforming as the zfs-macos google group (Oct 2009) ✦ OpenSolaris 2008.05 ✦ Solaris 10 6/06 (June 2006) ✦ Linux FUSE (summer 2006) ✦ greenBytes ZFS+ (September 2008) ✦ Linux native port funded by the US DOE (2010)• More than 45 patents, contributed to the CDDL Patents CommonZFS Tutorial USENIX LISA’11 3
  • ZFS Design Goals • Figure out why storage has gotten so complicated • Blow away 20+ years of obsolete assumptions • Gotta replace UFS • Design an integrated system from scratch End the sufferingZFS Tutorial USENIX LISA’11 4
  • Limits• 248 — Number of entries in any individual directory• 256 — Number of attributes of a file [*]• 256 — Number of files in a directory [*]• 16 EiB (264 bytes) — Maximum size of a file system• 16 EiB — Maximum size of a single file• 16 EiB — Maximum size of any attribute• 264 — Number of devices in any pool• 264 — Number of pools in a system• 264 — Number of file systems in a pool• 264 — Number of snapshots of any file system• 256 ZiB (278 bytes) — Maximum size of any pool [*] actually constrained to 248 for the number of files in a ZFS file systemZFS Tutorial USENIX LISA’11 5
  • Understanding Builds • Build is often referenced when speaking of feature/bug integration • Short-hand notation: b### • Distributions derived from Solaris NV (Nevada) ✦ NexentaStor ✦ Nexenta Core Platform ✦ SmartOS ✦ Solaris 11 (nee OpenSolaris) ✦ OpenIndiana ✦ StormOS ✦ BelleniX ✦ SchilliX ✦ MilaX • OpenSolaris builds ✦ Binary builds died at b134 ✦ Source releases continued through b147 • illumos stepping up to fill void left by OpenSolaris’ demiseZFS Tutorial USENIX LISA’11 6
  • Community Links • Community links ✦ nexenta.org ✦ nexentastor.org ✦ freebsd.org ✦ zfsonlinux.org ✦ zfs-fuse.net ✦ groups.google.com/group/zfs-macos • ZFS Community ✦ hub.opensolaris.org/bin/view/Community+Group+zfs/ • IRC channels at irc.freenode.net ✦ #zfsZFS Tutorial USENIX LISA’11 7
  • ZFS Foundations 8
  • Overhead View of a Pool Pool File System Configuration Information Volume File System Volume DatasetZFS Tutorial USENIX LISA’11 9
  • Hybrid Storage Pool Adaptive Replacement Cache (ARC) separate Main Pool Main Pool Level 2 ARC intent log Write optimized HDD HDD Read optimized device (SSD) HDD device (SSD) Size (GBytes) 1 - 10 GByte large big Cost write iops/$ size/$ size/$ Use sync writes persistent storage read cache Performance secondary low-latency writes low-latency reads optimization Need more speed? stripe more, faster devices stripeZFS Tutorial USENIX LISA’11 10
  • Layer View raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ??November 8, 2010 USENIX LISA’10 11
  • Source Code Structure File system Mgmt Device Consumer Consumer libzfs Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV ConfigurationNovember 8, 2010 USENIX LISA’10 12
  • Acronyms • ARC – Adaptive Replacement Cache • DMU – Data Management Unit • DSL – Dataset and Snapshot Layer • JNI – Java Native Interface • ZPL – ZFS POSIX Layer (traditional file system interface) • VDEV – Virtual Device • ZAP – ZFS Attribute Processor • ZIL – ZFS Intent Log • ZIO – ZFS I/O layer • Zvol – ZFS volume (raw/cooked block device interface)ZFS Tutorial USENIX LISA’11 13
  • NexentaStor Rosetta Stone NexentaStor OpenSolaris/ZFS Volume Storage pool ZVol Volume Folder File systemZFS Tutorial USENIX LISA’11 14
  • nvlists • name=value pairs • libnvpair(3LIB) • Allows ZFS capabilities to change without changing the physical on-disk format • Data stored is XDR encoded • A good thing, used oftenZFS Tutorial USENIX LISA’11 15
  • Versioning • Features can be added and identified by nvlist entries • Change in pool or dataset versions do not change physical on- disk format (!) ✦ does change nvlist parameters • Older-versions can be used ✦ might see warning messages, but harmless • Available versions and features can be easily viewed ✦ zpool upgrade -v ✦ zfs upgrade -v • Online references (broken?) ✦ zpool: hub.opensolaris.org/bin/view/Community+Group+zfs/N ✦ zfs: hub.opensolaris.org/bin/view/Community+Group+zfs/N-1 Dont confuse zpool and zfs versionsZFS Tutorial USENIX LISA’11 16
  • zpool Versions VER DESCRIPTION --- ------------------------------------------------ 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit support Continued...ZFS Tutorial USENIX LISA’11 17
  • More zpool Versions VER DESCRIPTION --- ------------------------------------------------ 15 user/group space accounting 16 stmf property support 17 Triple-parity RAID-Z 18 snapshot user holds 19 Log device removal 20 Compression using zle (zero-length encoding) 21 Deduplication 22 Received properties 23 Slim ZIL 24 System attributes 25 Improved scrub stats 26 Improved snapshot deletion performance 27 Improved snapshot creation performance 28 Multiple vdev replacements For Solaris 10, version 21 is “reserved”ZFS Tutorial USENIX LISA’11 18
  • zfs Versions VER DESCRIPTION ---------------------------------------------- 1 Initial ZFS filesystem version 2 Enhanced directory entries 3 Case insensitive and File system unique identifier (FUID) 4 userquota, groupquota properties 5 System attributesZFS Tutorial USENIX LISA’11 19
  • Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & freeZFS Tutorial USENIX LISA’11 20
  • COW Notes• COW works on blocks, not files• ZFS reserves 32 MBytes or 1/64 of pool size ✦ COWs need some free space to remove files ✦ need space for ZIL• For fixed-record size workloads “fragmentation” and “poor performance” can occur if the recordsize is not matched• Spatial distribution is good fodder for performance speculation ✦ affects HDDs ✦ moot for SSDs ZFS Tutorial USENIX LISA’11 21
  • To fsck or not to fsck • fsck was created to fix known inconsistencies in file system metadata ✦ UFS is not transactional ✦ metadata inconsistencies must be reconciled ✦ does NOT repair data – how could it? • ZFS doesnt need fsck, as-is ✦ all on-disk changes are transactional ✦ COW means previously existing, consistent metadata is not overwritten ✦ ZFS can repair itself ✤ metadata is at least dual-redundant ✤ data can also be redundant • Reality check – this does not mean that ZFS is not susceptible to corruption ✦ nor is any other file systemZFS Tutorial USENIX LISA’11 22
  • Pooled Storage Layer raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ??November 8, 2010 USENIX LISA’10 23
  • vdevs – Virtual Devices Logical vdevs root vdev top-level vdev top-level vdev children[0] children[1] mirror mirror vdev vdev vdev vdev type = disk type = disk type = disk type = disk children[0] children[0] children[0] children[0] Physical or leaf vdevsZFS Tutorial USENIX LISA’11 24
  • vdev Labels • vdev labels != disk labels • Four 256 kByte labels written to every physical vdev • Two-stage update process ✦ write label0 & label2 ✦ flush cache & check for errors ✦ write label1 & label3 ✦ flush cache & check for errors N = 256k * (size % 256k) M = 128k / MIN(1k, sector size) 0 256k 512k 4M N-512k N-256k N label0 label1 boot block label2 label3 ... Boot Name=Value Blank header Pairs M-slot Uberblock Array 0 8k 16k 128k 256k 25ZFS Tutorial USENIX LISA’11
  • Observing Labels# zdb -l /dev/rdsk/c0t0d0s0--------------------------------------------LABEL 0-------------------------------------------- version=14 name=rpool state=0 txg=13152 pool_guid=17111649328928073943 hostid=8781271 hostname= top_guid=11960061581853893368 guid=11960061581853893368 vdev_tree type=disk id=0 guid=11960061581853893368 path=/dev/dsk/c0t0d0s0 devid=id1,sd@SATA_____ST3500320AS_________________9QM3FWFT/a phys_path=/pci@0,0/pci1458,b002@11/disk@0,0:a whole_disk=0 metaslab_array=24 metaslab_shift=30 ashift=9 asize=157945167872 is_log=0 ZFS Tutorial USENIX LISA’11 26
  • Uberblocks • Sized based on minimum device block size • Stored in 128-entry circular queue • Only one uberblock is active at any time ✦ highest transaction group number ✦ correct SHA-256 checksum • Stored in machines native format ✦ A magic number is used to determine endian format when imported • Contains pointer to Meta Object Set (MOS) Device Block Size Uberblock Size Queue Entries 512 Bytes,1 KB 1 KB 128 2 KB 2 KB 64 4 KB 4 KB 32ZFS Tutorial USENIX LISA’11 27
  • About Sizes • Sizes are dynamic • LSIZE = logical size • PSIZE = physical size after compression • ASIZE = allocated size including: ✦ physical size ✦ raidz parity ✦ gang blocks Old notions of size reporting confuse peopleZFS Tutorial USENIX LISA’11 28
  • VDEVZFS Tutorial USENIX LISA’11 29
  • Dynamic Striping • RAID-0 ✦ SNIA definition: fixed-length sequences of virtual disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern • Dynamic Stripe ✦ Data is dynamically mapped to member disks ✦ No fixed-length sequences ✦ Allocate up to ~1 MByte/vdev before changing vdev ✦ vdevs can be different size ✦ Good combination of the concatenation feature with RAID-0 performanceZFS Tutorial USENIX LISA’11 30
  • Dynamic Striping RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes384 kBytes ZFS Dynamic Stripe recordsize = 128 kBytes Total write size = 2816 kBytes ZFS Tutorial USENIX LISA’11 31
  • Mirroring • Straightforward: put N copies of the data on N vdevs • Unlike RAID-1 ✦ No 1:1 mapping at the block level ✦ vdev labels are still at beginning and end ✦ vdevs can be of different size ✤ effective space is that of smallest vdev • Arbitration: ZFS does not blindly trust either side of mirror ✦ Most recent, correct view of data wins ✦ Checksums validate dataZFS Tutorial USENIX LISA’11 32
  • Dynamic vdev Replacement • zpool replace poolname vdev [vdev] • Today, replacing vdev must be same size or larger ✦ NexentaStor 2 ‒ as measured by blocks ✦ NexentaStor 3 ‒ as measured by metaslabs • Replacing all vdevs in a top-level vdev with larger vdevs results in top-level vdev resizing • Expansion policy controlled by: ✦ NexentaStor 2 ‒ resize on import ✦ NexentaStor 3 ‒ zpool autoexpand property 15G 10G 10G 15G 10G 20G 15G 20G 10G 15G 20G 20G 20G 20G 10G 10G Mirror 10G Mirror 15G Mirror 15G Mirror 20G MirrorZFS Tutorial USENIX LISA’11 33
  • RAIDZ • RAID-5 ✦ Parity check data is distributed across the RAID arrays disks ✦ Must read/modify/write when data is smaller than stripe width • RAIDZ ✦ Dynamic data placement ✦ Parity added as needed ✦ Writes are full-stripe writes ✦ No read/modify/write (write hole) • Arbitration: ZFS does not blindly trust any device ✦ Does not rely on disk reporting read error ✦ Checksums validate data ✦ If checksum fails, read parity Space used is dependent on how usedZFS Tutorial USENIX LISA’11 34
  • RAID-5 vs RAIDZ DiskA DiskB DiskC DiskD DiskE D0:0 D0:1 D0:2 D0:3 P0 RAID-5 P1 D1:0 D1:1 D1:2 D1:3 D2:3 P2 D2:0 D2:1 D2:2 D3:2 D3:3 P3 D3:0 D3:1 DiskA DiskB DiskC DiskD DiskE P0 D0:0 D0:1 D0:2 D0:3 RAIDZ P1 D1:0 D1:1 P2:0 D2:0 D2:1 D2:2 D2:3 P2:1 D2:4 D2:5 Gap P3 D3:0ZFS Tutorial USENIX LISA’11 35
  • RAIDZ and Block Size If block size >> N * sector size, space consumption is like RAID-5 If block size = sector size, space consumption is like mirroring PSIZE=2KBASIZE=2.5KB DiskA DiskB DiskC DiskD DiskE P0 D0:0 D0:1 D0:2 D0:3 P1 D1:0 D1:1 P2:0 D2:0 PSIZE=1KB D2:1 D2:2 D2:3 P2:1 D2:4ASIZE=1.5KB D2:5 Gap P3 D3:0 PSIZE=3KB PSIZE=512 bytes ASIZE=4KB + Gap ASIZE=1KB Sector size = 512 bytes Sector size can impact space savingsZFS Tutorial USENIX LISA’11 36
  • RAID-5 Write Hole • Occurs when data to be written is smaller than stripe size • Must read unallocated columns to recalculate the parity or the parity must be read/modify/write • Read/modify/write is risky for consistency ✦ Multiple disks ✦ Reading independently ✦ Writing independently ✦ System failure before all writes are complete to media could result in data loss • Effects can be hidden from host using RAID array with nonvolatile write cache, but extra I/O cannot be hidden from disksZFS Tutorial USENIX LISA’11 37
  • RAIDZ2 and RAIDZ3 • RAIDZ2 = double parity RAIDZ • RAIDZ3 = triple parity RAIDZ • Sorta like RAID-6 ✦ Parity 1: XOR ✦ Parity 2: another Reed-Soloman syndrome ✦ Parity 3: yet another Reed-Soloman syndrome • Arbitration: ZFS does not blindly trust any device ✦ Does not rely on disk reporting read error ✦ Checksums validate data ✦ If data not valid, read parity ✦ If data still not valid, read other parity Space used is dependent on how usedZFS Tutorial USENIX LISA’11 38
  • Evaluating Data Retention • MTTDL = Mean Time To Data Loss • Note: MTBF is not constant in the real world, but keeps math simple • MTTDL[1] is a simple MTTDL model • No parity (single vdev, striping, RAID-0) ✦ MTTDL[1] = MTBF / N • Single Parity (mirror, RAIDZ, RAID-1, RAID-5) ✦ MTTDL[1] = MTBF2 / (N * (N-1) * MTTR) • Double Parity (3-way mirror, RAIDZ2, RAID-6) ✦ MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2) • Triple Parity (4-way mirror, RAIDZ3) ✦ MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3)ZFS Tutorial USENIX LISA’11 39
  • Another MTTDL Model • MTTDL[1] model doesnt take into account unrecoverable read • But unrecoverable reads (UER) are becoming the dominant failure mode ✦ UER specifed as errors per bits read ✦ More bits = higher probability of loss per vdev • MTTDL[2] model considers UERZFS Tutorial USENIX LISA’11 40
  • Why Worry about UER? • Richards study ✦ 3,684 hosts with 12,204 LUNs ✦ 11.5% of all LUNs reported read errors • Bairavasundaram et.al. FAST08 www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf ✦ 1.53M LUNs over 41 months ✦ RAID reconstruction discovers 8% of checksum mismatches ✦ “For some drive models as many as 4% of drives develop checksum mismatches during the 17 months examined”ZFS Tutorial USENIX LISA’11 41
  • Why Worry about UER? • RAID array studyZFS Tutorial USENIX LISA’11 42
  • Why Worry about UER? • RAID array study Unrecoverable Disk Disappeared Reads “disk pull” “Disk pull” tests aren’t very usefulZFS Tutorial USENIX LISA’11 43
  • MTTDL[2] Model • Probability that a reconstruction will fail ✦ Precon_fail = (N-1) * size / UER • Model doesnt work for non-parity schemes ✦ single vdev, striping, RAID-0 • Single Parity (mirror, RAIDZ, RAID-1, RAID-5) ✦ MTTDL[2] = MTBF / (N * Precon_fail) • Double Parity (3-way mirror, RAIDZ2, RAID-6) ✦ MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail) • Triple Parity (4-way mirror, RAIDZ3) ✦ MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 * Precon_fail)ZFS Tutorial USENIX LISA’11 44
  • Practical View of MTTDL[1]ZFS Tutorial USENIX LISA’11 45
  • MTTDL[1] ComparisonZFS Tutorial USENIX LISA’11 46
  • MTTDL Models: Mirror Spares are not always better...ZFS Tutorial USENIX LISA’11 47
  • MTTDL Models: RAIDZ2ZFS Tutorial USENIX LISA’11 48
  • Space, Dependability, and PerformanceZFS Tutorial USENIX LISA’11 49
  • Dependability Use Case • Customer has 15+ TB of read-mostly data • 16-slot, 3.5” drive chassis • 2 TB HDDs • Option 1: one raidz2 set ✦ 24 TB available space ✤ 12 data ✤ 2 parity ✤ 2 hot spares, 48 hour disk replacement time ✦ MTTDL[1] = 1,790,000 years • Option 2: two raidz2 sets ✦ 24 TB available space (each set) ✤ 6 data ✤ 2 parity ✤ no hot spares ✦ MTTDL[1] = 7,450,000 yearsZFS Tutorial USENIX LISA’11 50
  • Ditto Blocks • Recall that each blkptr_t contains 3 DVAs • Dataset property used to indicate how many copies (aka ditto blocks) of data is desired ✦ Write all copies ✦ Read any copy ✦ Recover corrupted read from a copy • Not a replacement for mirroring ✦ For single disk, can handle data loss on approximately 1/8 contiguous space • Easier to describe in pictures... copies parameter Data copies Metadata copies copies=1 (default) 1 2 copies=2 2 3 copies=3 3 3ZFS Tutorial USENIX LISA’11 51
  • Copies in PicturesNovember 8, 2010 USENIX LISA’10 52
  • Copies in PicturesZFS Tutorial USENIX LISA’11 53
  • When Good Data Goes Bad File system If it’s a metadata Or we get does bad read block FS panics back bad Can not tell does disk rebuild dataZFS Tutorial USENIX LISA’11 54
  • Checksum Verification ZFS verifies checksums for every read Repairs data when possible (mirror, raidz, copies>1) Read bad data Read good data Repair bad dataZFS Tutorial USENIX LISA’11 55
  • ZIO - ZFS I/O Layer 56
  • ZIO Framework • All physical disk I/O goes through ZIO Framework • Translates DVAs into Logical Block Address (LBA) on leaf vdevs ✦ Keeps free space maps (spacemap) ✦ If contiguous space is not available: ✤ Allocate smaller blocks (the gang) ✤ Allocate gang block, pointing to the gang • Implemented as multi-stage pipeline ✦ Allows extensions to be added fairly easily • Handles I/O errorsZFS Tutorial USENIX LISA’11 57
  • ZIO Write Pipeline ZIO State Compression Checksum DVA vdev I/O open compress if savings > 12.5% generate allocate start start start done done done assess assess assess done Gang and deduplicaiton activity elided, for clarityZFS Tutorial USENIX LISA’11 58
  • ZIO Read Pipeline ZIO State Compression Checksum DVA vdev I/O open start start start done done done assess assess assess verify decompress done Gang and deduplicaiton activity elided, for clarityZFS Tutorial USENIX LISA’11 59
  • VDEV – Virtual Device Subsytem • Where mirrors, RAIDZ, and Name Priority RAIDZ2 are implemented NOW 0 ✦ Surprisingly few lines of code SYNC_READ 0 needed to implement RAID SYNC_WRITE 0 • Leaf vdev (physical device) I/O FREE 0 management CACHE_FILL 0 ✦ Number of outstanding iops LOG_WRITE 0 ✦ Read-ahead cache ASYNC_READ 4 • Priority scheduling ASYNC_WRITE 4 RESILVER 10 SCRUB 20ZFS Tutorial USENIX LISA’11 60
  • ARC - AdaptiveReplacement Cache 61
  • Object Cache • UFS uses page cache managed by the virtual memory system • ZFS does not use the page cache, except for mmaped files • ZFS uses a Adaptive Replacement Cache (ARC) • ARC used by DMU to cache DVA data objects • Only one ARC per system, but caching policy can be changed on a per-dataset basis • Seems to work much better than page cache ever did for UFSZFS Tutorial USENIX LISA’11 62
  • Traditional Cache • Works well when data being accessed was recently added • Doesnt work so well when frequently accessed data is evicted Misses cause insert MRU Dynamic caches can change Cache size size by either not evicting or aggressively evicting LRU Evict the oldestZFS Tutorial USENIX LISA’11 63
  • ARC – Adaptive Replacement Cache Evict the oldest single-use entry LRU Recent Cache Miss MRU Evictions and dynamic MFU size resizing needs to choose best Hit cache to evict (shrink) Frequent Cache LFU Evict the oldest multiple accessed entryZFS Tutorial USENIX LISA’11 64
  • ARC with Locked Pages Evict the oldest single-use entry Cannot evict LRU locked pages! Recent Cache Miss MRU MFU size Hit Frequent If hit occurs Cache within 62 ms LFU Evict the oldest multiple accessed entry ZFS ARC handles mixed-size pagesZFS Tutorial USENIX LISA’11 65
  • L2ARC – Level 2 ARC • Data soon to be evicted from the ARC is added to a queue to be sent to cache vdev ✦ Another thread sends queue to cache vdev ARC ✦ Data is copied to the cache vdev with a throttle data soon to to limit bandwidth consumption be evicted ✦ Under heavy memory pressure, not all evictions will arrive in the cache vdev • ARC directory remains in memory • Good idea - optimize cache vdev for fast reads ✦ lower latency than pool disks ✦ inexpensive way to “increase memory” cache • Content considered volatile, no raid needed • Monitor usage with zpool iostat and ARC kstatsZFS Tutorial USENIX LISA’11 66
  • ARC Directory • Each ARC directory entry contains arc_buf_hdr structs ✦ Info about the entry ✦ Pointer to the entry • Directory entries have size, ~200 bytes • ZFS block size is dynamic, sector size to 128 kBytes • Disks are large • Suppose we use a Seagate LP 2 TByte disk for the L2ARC ✦ Disk has 3,907,029,168 512 byte sectors, guaranteed ✦ Workload uses 8 kByte fixed record size ✦ RAM needed for arc_buf_hdr entries ✤ Need = (3,907,029,168 - 9,232) * 200 / 16 = ~48 GBytes • Dont underestimate the RAM needed for large L2ARCsZFS Tutorial USENIX LISA’11 67
  • ARC Tips • In general, it seems to work well for most workloads • ARC size will vary, based on usage ✦ Default target max is 7/8 of physical memory or (memory - 1 GByte) ✦ Target min is 64 MB ✦ Metadata capped at 1/4 of max ARC size • Dynamic size can be reduced when: ✦ page scanner is running ✤ freemem < lotsfree + needfree + desfree ✦ swapfs does not have enough space so that anonymous reservations can succeed ✤ availrmem < swapfs_minfree + swapfs_reserve + desfree ✦ [x86 only] kernel heap space more than 75% full • Can limit at boot timeZFS Tutorial USENIX LISA’11 68
  • Observing ARC • ARC statistics stored in kstats • kstat -n arcstats • Interesting statistics: ✦ size = current ARC size ✦ p = size of MFU cache ✦ c = target ARC size ✦ c_max = maximum target ARC size ✦ c_min = minimum target ARC size ✦ l2_hdr_size = space used in ARC by L2ARC ✦ l2_size = size of data in L2ARCZFS Tutorial USENIX LISA’11 69
  • General Status - ARCZFS Tutorial USENIX LISA’11 70
  • More ARC Tips • Performance ✦ Prior to b107, L2ARC fill rate was limited to 8 MB/sec ✦ After b107, cold L2ARC fill rate increases to 16 MB/sec • Internals tracked by kstats in Solaris ✦ Use memory_throttle_count to observe pressure to evict • Dedup Table (DDT) also uses ARC ✦ lots of dedup objects need lots of RAM ✦ field reports that L2ARC can help with dedup L2ARC keeps its directory in kernel memoryZFS Tutorial USENIX LISA’11 71
  • TransactionalObject Layer 72
  • flash Source Code Structure File system Mgmt Device Consumer Consumer libzfs Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration November 8, 2010 USENIX LISA’10 73
  • Transaction Engine • Manages physical I/O • Transactions grouped into transaction group (txg) ✦ txg updates ✦ All-or-nothing ✦ Commit interval ✤ Older versions: 5 seconds ✤ Less old versions: 30 seconds ✤ b143 and later: 5 seconds • Delay committing data to physical storage ✦ Improves performance ✦ A bad thing for sync workload performance – hence the ZFS Intent Log (ZIL) 30 second delay can impact failure detection timeZFS Tutorial USENIX LISA’11 74
  • ZIL – ZFS Intent Log • DMU is transactional, and likes to group I/O into transactions for later commits, but still needs to handle “write it now” desire of sync writers ✦ NFS ✦ Databases • ZIL recordsize inflation can occur for some workloads ✦ May cause larger than expected actual I/O for sync workloads ✦ Oracle redo logs ✦ No slog: can tune zfs_immediate_write_sz, zvol_immediate_write_sz ✦ With slog: use logbias property instead • Never read, except at import (eg reboot), when transactions may need to be rolled forwardZFS Tutorial USENIX LISA’11 75
  • Separate Logs (slogs) • ZIL competes with pool for IOPS ✦ Applications wait for sync writes to be on nonvolatile media ✦ Very noticeable on HDD JBODs • Put ZIL on separate vdev, outside of pool ✦ ZIL writes tend to be sequential ✦ No competition with pool for IOPS ✦ Downside: slog device required to be operational at import ✦ NexentaStor 3 allows slog device removal ✦ Size of separate log < than size of RAM (duh) • 10x or more performance improvements possible ✦ Nonvolatile RAM card ✦ Write-optimized SSD ✦ Nonvolatile write cache on RAID arrayZFS Tutorial USENIX LISA’11 76
  • zilstat • http://www.richardelling.com/Home/scripts-and-programs-1/ zilstat • Integrated into NexentaStor 3.0.3 ✦ nmc: show performance zilZFS Tutorial USENIX LISA’11 77
  • Synchronous Write Destination Without separate log Sync I/O size > ZIL Destination zfs_immediate_write_sz ? no ZIL log yes bypass to pool With separate log logbias? ZIL Destination latency (default) log device throughput bypass to pool Default zfs_immediate_write_sz = 32 kBytesZFS Tutorial USENIX LISA’11 78
  • ZIL Synchronicity Project • All-or-nothing policies don’t work well, in general • ZIL Synchronicity project proposed by Robert Milkowski ✦ http://milek.blogspot.com • Adds new sync property to datasets • Arrived in b140 sync Parameter Behaviour Policy follows previous design: write standard (default) immediate size and separate logs always All writes become synchronous (slow) disabled Synchronous write requests are ignoredZFS Tutorial USENIX LISA’11 79
  • Disabling the ZIL • Preferred method: change dataset sync property • Rule 0: Don’t disable the ZIL • If you love your data, do not disable the ZIL • You can find references to this as a way to speed up ZFS ✦ NFS workloads ✦ “tar -x” benchmarks • Golden Rule: Don’t disable the ZIL • Can set via mdb, but need to remount the file system • Friends don’t let friends disable the ZIL • Older Solaris - can set in /etc/system • NexentaStor has checkbox for disabling ZIL • Nostradamus wrote, “disabling the ZIL will lead to the apocalypse”ZFS Tutorial USENIX LISA’11 80
  • DSL - Dataset and Snapshot Layer 81
  • Dataset & Snapshot Layer • Object ✦ Allocated storage ✦ dnode describes collection of blocks • Object Set Dataset Directory ✦ Group of related objects Dataset • Dataset Object Set Childmap ✦ Snapmap: snapshot relationships Object Object ✦ Space usage Object Properties • Dataset directory Snapmap ✦ Childmap: dataset relationships ✦ PropertiesZFS Tutorial USENIX LISA’11 82
  • flash Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free z ZFS Tutorial USENIX LISA’11 83
  • zfs snapshot • Create a read-only, point-in-time window into the dataset (file system or Zvol) • Computationally free, because of COW architecture • Very handy feature ✦ Patching/upgrades • Basis for time-related snapshot interfaces ✦ Solaris Time Slider ✦ NexentaStor Delorean Plugin ✦ NexentaStor Virtual Machine Data CenterZFS Tutorial USENIX LISA’11 84
  • Snapshot • Create a snapshot by not freeing COWed blocks • Snapshot creation is fast and easy • Number of snapshots determined by use – no hardwired limit • Recursive snapshots also possible Snapshot tree Current tree root rootZFS Tutorial USENIX LISA’11 85
  • auto-snap serviceZFS Tutorial USENIX LISA’11 86
  • Clones • Snapshots are read-only • Clones are read-write based upon a snapshot • Child depends on parent ✦ Cannot destroy parent without destroying all children ✦ Can promote children to be parents • Good ideas ✦ OS upgrades ✦ Change control ✦ Replication ✤ zones ✤ virtual disksZFS Tutorial USENIX LISA’11 87
  • zfs clone • Create a read-write file system from a read-only snapshot • Solaris boot environment administation Install Checkpoint Clone Checkpoint OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 rootfs- rootfs- rootfs- rootfs- nmu- nmu- nmu- nmu- 001 001 001 001 patch/ OS rev1 OS rev1 OS rev1 upgrade clone clone clone rootfs- nmu- 002 grubboot manager Origin snapshot cannot be destroyed, if clone existsZFS Tutorial USENIX LISA’11 88
  • Deduplication 89
  • What is Deduplication? • A $2.1 Billion feature • 2009 buzzword of the year • Technique for improving storage space efficiency ✦ Trades big I/Os for small I/Os ✦ Does not eliminate I/O • Implementation styles ✦ offline or post processing ✤ data written to nonvolatile storage ✤ process comes along later and dedupes data ✤ example: tape archive dedup ✦ inline ✤ data is deduped as it is being allocated to nonvolatile storage ✤ example: ZFSZFS Tutorial USENIX LISA’11 90
  • Dedup how-to • Given a bunch of data • Find data that is duplicated • Build a lookup table of references to data • Replace duplicate data with a pointer to the entry in the lookup table • Grainularity ✦ file ✦ block ✦ byteZFS Tutorial USENIX LISA’11 91
  • Dedup in ZFS • Leverage block-level checksums ✦ Identify blocks which might be duplicates ✦ Variable block size is ok • Synchronous implementation ✦ Data is deduped as it is being written • Scalable design ✦ No reference count limits • Works with existing features ✦ compression ✦ copies ✦ scrub ✦ resilver • Implemented in ZIO pipelineZFS Tutorial USENIX LISA’11 92
  • Deduplication Table (DDT) • Internal implementation ✦ Adelson-Velskii, Landis (AVL) tree ✦ Typical table entry ~270 bytes ✤ checksum ✤ logical size ✤ physical size ✤ references ✦ Table entry size increases as the number of references increasesZFS Tutorial USENIX LISA’11 93
  • Reference Counts Eggs courtesy of Richard’s chickensZFS Tutorial USENIX LISA’11 94
  • Reference Counts • Problem: loss of the referenced data affects all referrers • Solution: make additional copies of referred data based upon a threshold count of referrers ✦ leverage copies (ditto blocks) ✦ pool-level threshold for automatically adding ditto copies ✤ set via dedupditto pool property # zpool set dedupditto=50 zwimming ✤ add 2nd copy when dedupditto references (50) reached ✤ add 3rd copy when dedupditto2 references (2500) reachedZFS Tutorial USENIX LISA’11 95
  • Verification write() compress checksum DDT entry lookup yes no DDT verify? match? no yes read data data yes add reference match? no new entryZFS Tutorial USENIX LISA’11 96
  • Enabling Dedup • Set dedup property for each dataset to be deduped • Remember: properties are inherited • Remember: only applies to newly written data dedup checksum verify? on SHA256 no sha256 on,verify SHA256 yes sha256,verify Fletcher is considered too weak, without verifyZFS Tutorial USENIX LISA’11 97
  • Dedup Accounting • ...and you thought compression accounting was hard... • Remember: dedup works at pool level ✦ dataset-level accounting doesn’t see other datasets ✦ pool-level accounting is always correctzfs list NAME USED AVAIL REFER MOUNTPOINT bar 7.56G 449G 22K /bar bar/ws 7.56G 449G 7.56G /bar/ws dozer 7.60G 455G 22K /dozer dozer/ws 7.56G 455G 7.56G /dozer/ws tank 4.31G 456G 22K /tank tank/ws 4.27G 456G 4.27G /tank/wszpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT bar 464G 7.56G 456G 1% 1.00x ONLINE - dozer 464G 1.43G 463G 0% 5.92x ONLINE - tank 464G 957M 463G 0% 5.39x ONLINE -ZFS Tutorial DataUSENIX LISA’11team courtesy of the ZFS 98
  • DDT Histogram # zdb -DD tank DDT-sha256-zap-duplicate: 110173 entries, size 295 on disk, 153 in core DDT-sha256-zap-unique: 302 entries, size 42194 on disk, 52827 in core DDT histogram (aggregated over all DDTs): bucket! allocated! referenced ______ ___________________________ ___________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 302 7.26M 4.24M 4.24M 302 7.26M 4.24M 4.24M 2 103K 1.12G 712M 712M 216K 2.64G 1.62G 1.62G 4 3.11K 30.0M 17.1M 17.1M 14.5K 168M 95.2M 95.2M 8 503 11.6M 6.16M 6.16M 4.83K 129M 68.9M 68.9M 16 100 4.22M 1.92M 1.92M 2.14K 101M 45.8M 45.8MZFS Tutorial USENIX LISA’11 Data courtesy of the ZFS team 99
  • DDT Histogram$ zdb -DD zwimmingDDT-sha256-zap-duplicate: 19725 entries, size 270 on disk, 153 in coreDDT-sha256-zap-unique: 52369639 entries, size 284 on disk, 159 in coreDDT histogram (aggregated over all DDTs):bucket allocated referenced______ ______________________________ ______________________________refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE------ ------ ----- ----- ----- ------ ----- ----- ----- 1 49.9M 25.0G 25.0G 25.0G 49.9M 25.0G 25.0G 25.0G 2 16.7K 8.33M 8.33M 8.33M 33.5K 16.7M 16.7M 16.7M 4 610 305K 305K 305K 3.33K 1.66M 1.66M 1.66M 8 661 330K 330K 330K 6.67K 3.34M 3.34M 3.34M 16 242 121K 121K 121K 5.34K 2.67M 2.67M 2.67M 32 131 65.5K 65.5K 65.5K 5.54K 2.77M 2.77M 2.77M 64 897 448K 448K 448K 84K 42M 42M 42M 128 125 62.5K 62.5K 62.5K 18.0K 8.99M 8.99M 8.99M 8K 1 512 512 512 12.5K 6.27M 6.27M 6.27M Total 50.0M 25.0G 25.0G 25.0G 50.1M 25.1G 25.1G 25.1Gdedup = 1.00, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.00 ZFS Tutorial USENIX LISA’11 100
  • Over-the-wire Dedup • Dedup is also possible over the send/receive pipe ✦ Blocks with no checksum are considered duplicates (no verify option) ✦ First copy sent as usual ✦ Subsequent copies sent by reference • Independent of dedup status of originating pool ✦ Receiving pool knows about blocks which have already arrived • Can be a win for dedupable data, especially over slow wires • Remember: send/receive version rules still apply # zfs send -DR zwimming/stuffZFS Tutorial USENIX LISA’11 101
  • Dedup Performance • Dedup can save space and bandwidth • Dedup increases latency ✦ Caching data improves latency ✦ More memory → more data cached ✦ Cache performance heirarchy ✤ RAM: fastest ✤ L2ARC on SSD: slower ✤ Pool HDD: dreadfully slow • ARC is currently not deduped • Difficult to predict ✦ Dependent variable: number of blocks ✦ Estimate 270 bytes per unique block ✦ Example: ✤ 50M blocks * 270 bytes/block = 13.5 GBytesZFS Tutorial USENIX LISA’11 102
  • Deduplication Use Cases Data type Dedupe Compression Home directories ✔✔ ✔✔ Internet content ✔ ✔ Media and video ✔✔ ✔ Life sciences ✘ ✔✔ Oil and Gas (seismic) ✘ ✔✔ Virtual machines ✔✔ ✘ Archive ✔✔✔✔ ✔ZFS Tutorial USENIX LISA’11 103
  • zpool Command 104
  • Pooled Storage Layer raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ??November 8, 2010 USENIX LISA’10 105
  • zpool create • zpool create poolname vdev-configuration • nmc: setup volume create ✦ vdev-configuration examples ✤ mirror c0t0d0 c3t6d0 ✤ mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6 ✤ mirror disk1s0 disk2s0 cache disk4s0 log disk5 ✤ raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0 • Solaris ✦ Additional checks for disk/slice overlaps or in use ✦ Whole disks are given EFI labels • Can set initial pool or dataset properties • By default, creates a file system with the same name ✦ poolname pool → /poolname file system People get confused by a file system with same name as the poolZFS Tutorial USENIX LISA’11 106
  • zpool destroy • Destroy the pool and all datasets therein • zpool destroy poolname ✦ Can (try to) force with “-f” ✦ There is no “are you sure?” prompt – if you werent sure, you would not have typed “destroy” • nmc: destroy volume volumename ✦ nmc prompts for confirmation, by default zpool destroy is destructive... really! Use with caution!ZFS Tutorial USENIX LISA’11 107
  • zpool add • Adds a device to the pool as a top-level vdev • Does NOT not add columns to a raidz set • Does NOT attach a mirror – use zpool attach instead • zpool add poolname vdev-configuration ✦ vdev-configuration can be any combination also used for zpool create ✦ Complains if the added vdev-configuration would cause a different data protection scheme than is already in use ✤ use “-f” to override ✦ Good idea: try with “-n” flag first ✤ will show final configuration without actually performing the add • nmc: setup volume volumename grow Do not add a device which is in use as a cluster quorum deviceZFS Tutorial USENIX LISA’11 108
  • zpool remove • Remove a top-level vdev from the pool • zpool remove poolname vdev • nmc: setup volume volumename remove-lun • Today, you can only remove the following vdevs: ✦ cache ✦ hot spare ✦ separate log (b124, NexentaStor 3.0) Dont confuse “remove” with “detach”ZFS Tutorial USENIX LISA’11 109
  • zpool attach • Attach a vdev as a mirror to an existing vdev • zpool attach poolname existing-vdev vdev • nmc: setup volume volumename attach-lun • Attaching vdev must be the same size or larger than the existing vdev vdev Configurations ok simple vdev → mirror ok mirror ok log → mirrored log no RAIDZ no RAIDZ2 no RAIDZ3ZFS Tutorial USENIX LISA’11 110
  • zpool detach • Detach a vdev from a mirror • zpool detach poolname vdev • nmc: setup volume volumename detach-lun • A resilvering vdev will wait until resilvering is completeZFS Tutorial USENIX LISA’11 111
  • zpool replace • Replaces an existing vdev with a new vdev • zpool replace poolname existing-vdev vdev • nmc: setup volume volumename replace-lun • Effectively, a shorthand for “zpool attach” followed by “zpool detach” • Attaching vdev must be the same size or larger than the existing vdev • Works for any top-level vdev-configuration, including RAIDZ “Same size” literally means the same number of blocks until b117. Many “same size” disks have different number of available blocks.ZFS Tutorial USENIX LISA’11 112
  • zpool import • Import a pool and mount all mountable datasets • Import a specific pool ✦ zpool import poolname ✦ zpool import GUID ✦ nmc: setup volume import • Scan LUNs for pools which may be imported ✦ zpool import • Can set options, such as alternate root directory or other properties ✦ alternate root directory important for rpool or syspool Beware of zpool.cache interactions Beware of artifacts, especially partial artifactsZFS Tutorial USENIX LISA’11 113
  • zpool export • Unmount datasets and export the pool • zpool export poolname • nmc: setup volume volumename export • Removes pool entry from zpool.cache ✦ useful when unimported pools remain in zpool.cacheZFS Tutorial USENIX LISA’11 114
  • zpool upgrade• Display current versions ✦ zpool upgrade• View available upgrade versions, with features, but dont actually upgrade ✦ zpool upgrade -v• Upgrade pool to latest version ✦ zpool upgrade poolname ✦ nmc: setup volume volumename version- upgrade• Upgrade pool to specific version Once you upgrade, there is no downgrade Beware of grub and rollback issuesZFS Tutorial USENIX LISA’11 115
  • zpool history • Show history of changes made to the pool • nmc and Solaris use same commandZFS Tutorial USENIX LISA’11 116
  • zpool history • Show history of changes made to the pool • nmc and Solaris use same command# zpool history rpoolHistory for rpool:2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -ocachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s02009-03-04.07:29:47 zfs set canmount=noauto rpool2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_1062009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_1062009-03-04.07:29:51 zfs set canmount=on rpool2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export2009-03-04.07:29:51 zfs create rpool/export/home2009-03-04.00:21:42 zpool import -f -R /a 171116493289280739432009-03-04.00:21:42 zpool export rpool2009-03-04.08:47:08 zpool set bootfs=rpool rpool2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b1082009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/snv_b108...ZFS Tutorial USENIX LISA’11 116
  • zpool status • Shows the status of the current pools, including their configuration • Important troubleshooting step • nmc and Solaris use same command # zpool status … pool: zwimming state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using zpool upgrade. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM zwimming ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0s0 ONLINE 0 0 0 c0t0d0s7 ONLINE 0 0 0 errors: No known data errors Understanding status output error messages can be trickyZFS Tutorial USENIX LISA’11 117
  • zpool clear • Clears device errors • Clears device error counters • Starts any resilvering, as needed • Improves sysadmin sanity and reduces sweating • zpool clear poolname • nmc: setup volume volumename clear-errorsZFS Tutorial USENIX LISA’11 118
  • zpool iostat • Show pool physical I/O activity, in an iostat-like manner • Solaris: fsstat will show I/O activity looking into a ZFS file system • Especially useful for showing slog activity • nmc and Solaris use same command # zpool iostat -v capacity operations bandwidth pool used avail read write read write ------------ ----- ----- ----- ----- ----- ----- rpool 16.5G 131G 0 0 1.16K 2.80K c0t0d0s0 16.5G 131G 0 0 1.16K 2.80K ------------ ----- ----- ----- ----- ----- ----- zwimming 135G 14.4G 0 5 2.09K 27.3K mirror 135G 14.4G 0 5 2.09K 27.3K c0t2d0s0 - - 0 3 1.25K 27.5K c0t0d0s7 - - 0 2 1.27K 27.5K ------------ ----- ----- ----- ----- ----- ----- Unlike iostat, does not show latencyZFS Tutorial USENIX LISA’11 119
  • zpool scrub • Manually starts scrub ✦ zpool scrub poolname • Scrubbing performed in background • Use zpool status to track scrub progress • Stop scrub ✦ zpool scrub -s poolname • How often to scrub? ✦ Depends on level of paranoia ✦ Once per month seems reasonable ✦ After a repair or recovery procedure • NexentaStor auto-scrub features easily manages scrubs and schedules Estimated scrub completion time improves over timeZFS Tutorial USENIX LISA’11 120
  • auto-scrub serviceZFS Tutorial USENIX LISA’11 121
  • zfs Command 122
  • Dataset Management raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ??November 8, 2010 USENIX LISA’10 123
  • zfs create, destroy • By default, a file system with the same name as the pool is created by zpool create • Dataset name format is: pool/name[/name ...] • File system / folder ✦ zfs create dataset-name ✦ nmc: create folder ✦ zfs destroy dataset-name ✦ nmc: destroy folder • Zvol ✦ zfs create -V size dataset-name ✦ nmc: create zvol ✦ zfs destroy dataset-name ✦ nmc: destroy zvolZFS Tutorial USENIX LISA’11 124
  • zfs mount, unmount • Note: mount point is a file system parameter ✦ zfs get mountpoint fs-name • Rarely used subcommand (!) • Display mounted file systems ✦ zfs mount • Mount a file system ✦ zfs mount fs-name ✦ zfs mount -a • Unmount (not umount) ✦ zfs unmount fs-name ✦ zfs unmount -aZFS Tutorial USENIX LISA’11 125
  • zfs list • List mounted datasets • NexentaStor 2: listed everything • NexentaStor 3: do not list snapshots ✦ See zpool listsnapshots property • Examples ✦ zfs list ✦ zfs list -t snapshot ✦ zfs list -H -o nameZFS Tutorial USENIX LISA’11 126
  • Replication Services Days Traditional Backup NDMP Hours Auto-TierRecovery rsync Point Text Auto-Sync ZFS send/receiveObjective Seconds Auto-CDP Application Level AVS (SNDR) Mirror Replication Slower Faster System I/O Performance ZFS Tutorial USENIX LISA’11 127
  • zfs send, receive • Send ✦ send a snapshot to stdout ✦ data is decompressed • Receive ✦ receive a snapshot from stdin ✦ receiving file system parameters apply (compression, et.al) • Can incrementally send snapshots in time order • Handy way to replicate dataset snapshots • NexentaStor ✦ simplifies management ✦ manages snapshots and send/receive to remote systems • Only method for replicating dataset properties, except quotas • NOT a replacement for traditional backup solutionsZFS Tutorial USENIX LISA’11 128
  • auto-sync ServiceZFS Tutorial USENIX LISA’11 129
  • zfs upgrade• Display current versions ✦ zfs upgrade• View available upgrade versions, with features, but dont actually upgrade ✦ zfs upgrade -v• Upgrade pool to latest version ✦ zfs upgrade dataset• Upgrade pool to specific version ✦ zfs upgrade -V version dataset• NexentaStor: not needed until 3.0 You can upgrade, there is no downgrade Beware of grub and rollback issuesZFS Tutorial USENIX LISA’11 130
  • Sharing 131
  • Sharing • zfs share dataset • Type of sharing set by parameters ✦ shareiscsi = [on | off] ✦ sharenfs = [on | off | options] ✦ sharesmb = [on | off | options] • Shortcut to manage sharing ✦ Uses external services (nfsd, iscsi target, smbshare, etc) ✦ Importing pool will also share ✦ Implementation is OS-specific ✤ sharesmb uses in-kernel SMB server for Solaris-derived OSes ✤ sharesmb uses Samba for FreeBSDZFS Tutorial USENIX LISA’11 132
  • Properties 133
  • Properties • Properties are stored in an nvlist • By default, are inherited • Some properties are common to all datasets, but a specific dataset type may have additional properties • Easily set or retrieved via scripts • In general, properties affect future file system activity zpool get doesnt script as nicely as zfs getZFS Tutorial USENIX LISA’11 134
  • Getting Properties• zpool get all poolname• nmc: show volume volumename property propertyname• zpool get propertyname poolname• zfs get all dataset-name• nmc: show folder foldername property• nmc: show zvol zvolname propertyZFS Tutorial USENIX LISA’11 135
  • Setting Properties• zpool set propertyname=value poolname• nmc: setup volume volumename property propertyname• zfs set propertyname=value dataset-name• nmc: setup folder foldername property propertynameZFS Tutorial USENIX LISA’11 136
  • User-defined Properties • Names ✦ Must include colon : ✦ Can contain lower case alphanumerics or “+” “.” “_” ✦ Max length = 256 characters ✦ By convention, module:property ✤ com.sun:auto-snapshot • Values ✦ Max length = 1024 characters • Examples ✦ com.sun:auto-snapshot=true ✦ com.richardelling:important_files=trueZFS Tutorial USENIX LISA’11 137
  • Clearing Properties • Reset to inherited value ✦ zfs inherit compression export/home/relling • Clear user-defined parameter ✦ zfs inherit com.sun:auto-snapshot export/ home/relling • NexentaStor doesn’t offer method in nmcZFS Tutorial USENIX LISA’11 138
  • Pool Properties Property Change? Brief Description altroot Alternate root directory (ala chroot) autoexpand Policy for expanding when vdev size changes autoreplace vdev replacement policy available readonly Available storage space bootfs Default bootable dataset for root pool Cache file to use other than /etc/zfs/ cachefile zpool.cache capacity readonly Percent of pool space used delegation Master pool delegation switch failmode Catastrophic pool failure policyZFS Tutorial USENIX LISA’11 139
  • More Pool Properties Property Change? Brief Description guid readonly Unique identifier health readonly Current health of the pool listsnapshots zfs list policy size readonly Total size of pool used readonly Amount of space used version readonly Current on-disk versionZFS Tutorial USENIX LISA’11 140
  • Common Dataset Properties Property Change? Brief Description available readonly Space available to dataset & children checksum Checksum algorithm compression Compression algorithm Compression ratio – logical compressratio readonly size:referenced physical copies Number of copies of user data creation readonly Dataset creation time dedup Deduplication policy logbias Separate log write policy mlslabel Multilayer security label origin readonly For clones, origin snapshotZFS Tutorial USENIX LISA’11 141
  • More Common Dataset Properties Property Change? Brief Description primarycache ARC caching policy readonly Is dataset in readonly mode? referenced readonly Size of data accessible by this dataset Minimum space guaranteed to a refreservation dataset, excluding descendants (snapshots & clones) Minimum space guaranteed to dataset, reservation including descendants secondarycache L2ARC caching policy sync Synchronous write policy Type of dataset (filesystem, snapshot, type readonly volume)ZFS Tutorial USENIX LISA’11 142
  • More Common Dataset Properties Property Change? Brief Description used readonly Sum of usedby* (see below) usedbychildren readonly Space used by descendants usedbydataset readonly Space used by dataset Space used by a refreservation for usedbyrefreservation readonly this dataset Space used by all snapshots of this usedbysnapshots readonly dataset Is dataset added to non-global zone zoned readonly (Solaris)ZFS Tutorial USENIX LISA’11 143
  • Volume Dataset Properties Property Change? Brief Description shareiscsi iSCSI service (not COMSTAR) volblocksize creation fixed block size volsize Implicit quota Set if dataset delegated to non-global zoned readonly zone (Solaris)ZFS Tutorial USENIX LISA’11 144
  • File System Properties Property Change? Brief Description ACL inheritance policy, when files or aclinherit directories are created ACL modification policy, when chmod is aclmode used atime Disable access time metadata updates canmount Mount policy Filename matching algorithm (CIFS client casesensitivity creation feature) devices Device opening policy for dataset exec File execution policy for dataset mounted readonly Is file system currently mounted?ZFS Tutorial USENIX LISA’11 145
  • More File System Properties Property Change? Brief Description export/ File system should be mounted with non-blocking nbmand import mandatory locks (CIFS client feature)normalization creation Unicode normalization of file names for matching quota Max space dataset and descendants can consume recordsize Suggested maximum block size for files Max space dataset can consume, not including refquota descendants setuid setuid mode policy sharenfs NFS sharing options sharesmb Files system shared with CIFSZFS Tutorial USENIX LISA’11 146
  • File System Properties Property Change? Brief Description snapdir Controls whether .zfs directory is hidden utf8only creation UTF-8 character file name policy vscan Virus scan enabled xattr Extended attributes policyZFS Tutorial USENIX LISA’11 147
  • Forking Properties Pool Properties Release Property Brief Description illumos comment Human-readable comment field Dataset Properties Release Property Brief Description Solaris 11 encryption Dataset encryption Delphix/illumos clones Clone descendants Delphix/illumos refratio Compression ratio for references Solaris 11 share Combines sharenfs & sharesmb Solaris 11 shadow Shadow copy NexentaOS/illumos worm WORM feature Amount of data written since last Delphix/illumos written snapshotZFS Tutorial USENIX LISA’11 148
  • More Goodies 149
  • Dataset Space Accounting • used = usedbydataset + usedbychildren + usedbysnapshots + usedbyrefreservation • Lazy updates, may not be correct until txg commits • ls and du will show size of allocated files which includes all copies of a file • Shorthand report available$ zfs list -o spaceNAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILDrpool 126G 18.3G 0 35.5K 0 18.3Grpool/ROOT 126G 15.3G 0 18K 0 15.3Grpool/ROOT/snv_106 126G 86.1M 0 86.1M 0 0rpool/ROOT/snv_b108 126G 15.2G 5.89G 9.28G 0 0rpool/dump 126G 1.00G 0 1.00G 0 0rpool/export 126G 37K 0 19K 0 18Krpool/export/home 126G 18K 0 18K 0 0rpool/swap 128G 2G 0 193M 1.81G 0 ZFS Tutorial USENIX LISA’11 150
  • Pool Space Accounting • Pool space accounting changed in b128, along with deduplication • Compression, deduplication, and raidz complicate pool accounting (the numbers are correct, the interpretation is suspect) • Capacity planning for remaining free space can be challenging $ zpool list zwimming NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT zwimming 100G 43.9G 56.1G 43% 1.00x ONLINE -ZFS Tutorial USENIX LISA’11 151
  • zfs vs zpool Space Accounting • zfs list != zpool list • zfs list shows space used by the dataset plus space for internal accounting • zpool list shows physical space available to the pool • For simple pools and mirrors, they are nearly the same • For RAIDZ, RAIDZ2, or RAIDZ3, zpool list will show space available for parity Users will be confused about reported space availableZFS Tutorial USENIX LISA’11 152
  • NexentaStor Snapshot ServicesZFS Tutorial USENIX LISA’11 153
  • Accessing Snapshots • By default, snapshots are accessible in .zfs directory • Visibility of .zfs directory is tunable via snapdir property ✦ Dont really want find to find the .zfs directory • Windows CIFS clients can see snapshots as Shadow Copies for Shared Folders (VSS) # zfs snapshot rpool/export/home/relling@20090415 # ls -a /export/home/relling … .Xsession .xsession-errors # ls /export/home/relling/.zfs shares snapshot # ls /export/home/relling/.zfs/snapshot 20090415 # ls /export/home/relling/.zfs/snapshot/20090415 Desktop Documents Downloads PublicZFS Tutorial USENIX LISA’11 154
  • Time Slider - Automatic Snapshots • Solaris feature similar to OSXs Time Machine • SMF service for managing snapshots • SMF properties used to specify policies: frequency (interval) and number to keep • Creates cron jobs • GUI tool makes it easy to select individual file systems • Tip: take additional snapshots for important milestones to avoid automatic snapshot deletion Service Name Interval (default) Keep (default) auto-snapshot:frequent 15 minutes 4 auto-snapshot:hourly 1 hour 24 auto-snapshot:daily 1 day 31 auto-snapshot:weekly 7 days 4 auto-snapshot:monthly 1 month 12ZFS Tutorial USENIX LISA’11 155
  • Nautilus • File system views which can go back in timeZFS Tutorial USENIX LISA’11 156
  • Resilver & Scrub • Can be read IOPS bound • Resilver can also be bandwidth bound to the resilvering device • Both work at lower I/O scheduling priority than normal work, but that may not matter for read IOPS bound devices • Dueling RFEs: ✦ Resilver should go faster ✦ Resilver should go slower ✤ Integrated in b140ZFS Tutorial USENIX LISA’11 157
  • Time-based Resilvering • Block pointers contain birth txg number 73 • Resilvering begins with 73 oldest blocks first 73 55 • Interrupted resilver will still 73 27 result in a valid file system 27 27 68 73 view 73 68 Birth txg = 27 Birth txg = 68 Birth txg = 73ZFS Tutorial USENIX LISA’11 158
  • ACL – Access Control List • Based on NFSv4 ACLs • Similar to Windows NT ACLs • Works well with CIFS services • Supports ACL inheritance • Change using chmod • View using ls • Some changes in b146 to make behaviour more consistentZFS Tutorial USENIX LISA’11 159
  • Checksums for Data • DVA contains 256 bits for checksum • Checksum is in the parent, not in the block itself • Types ✦ none ✦ fletcher2: truncated 2nd order Fletcher-like algorithm ✦ fletcher4: 4th order Fletcher-like algorithm ✦ SHA-256 • There are open proposals for better algorithmsZFS Tutorial USENIX LISA’11 160
  • Checksum Use Pool Algorithm Notes Uberblock SHA-256 self-checksummed Metadata fletcher4 Labels SHA-256 Gang block SHA-256 self-checksummed Dataset Algorithm Notes Metadata fletcher4 fletcher2 Data zfs checksum parameter fletcher4 (b114) fletcher2 ZIL log self-checksummed fletcher4 (b135) Send stream fletcher4 Note: ZIL log has additional checking beyond the checksumZFS Tutorial USENIX LISA’11 161
  • Compression • Builtin ✦ lzjb, Lempel-Ziv by Jeff Bonwick ✦ gzip, levels 1-9 ✦ lze, run-length encoding (zeros) • Extensible ✦ new compressors can be added ✦ backwards compatibility issues • Uses taskqs to take advantage of multi-processor systems • Do you have a better compressor in mind? ✦ http://richardelling.blogspot.com/2009/08/justifying-new- compression-algorithms.html Solaris cannot boot from gzip compressed root (RFE is open)ZFS Tutorial USENIX LISA’11 162
  • Encryption • Placeholder – details TBD, reported putback in b148 • http://opensolaris.org/os/project/zfs-crypto • Complicated by: ✦ Oracle ✦ Block pointer rewrites ✦ DeduplicationZFS Tutorial USENIX LISA’11 163
  • Quotas • File system quotas ✦ quota includes descendants (snapshots, clones) ✦ refquota does not include descendants • User and group quotas ✦ b114, Solaris 10 10/09 (patch 141444-03 or 141445-03) ✦ Works like refquota, descendants dont count ✦ Not inherited ✦ zfs userspace and groupspace subcommands show quotas ✤ Users can only see their own and group quota, but can delegate ✦ Managed like properties ✤ [user|group]quota@[UID|username|SID name|SID number] ✤ not visible via zfs get allZFS Tutorial USENIX LISA’11 164
  • zpool.cache • Old way ✦ mount / ✦ read /etc/[v]fstab ✦ mount file systems • ZFS ✦ import pool(s) ✦ find mountable datasets and mount them • /etc/zpool.cache is a cache of pools to be imported at boot time ✦ No scanning of all available LUNs for pools to import ✦ Binary: dump contents with zdb -C ✦ cachefile property permits selecting an alternate zpool.cache ✤ Useful for OS installers ✤ Useful for clusters, where you dont want a booting node toZFS Tutorial USENIX LISA’11 165
  • Mounting ZFS File Systems • By default, mountable file systems are mounted when the pool is imported ✦ Controlled by canmount policy (not inherited) ✤ on – (default) file system is mountable ✤ off – file system is not mountable - if you want children to be mountable, but not the parent ✤ noauto – file system must be explicitly mounted (boot environment) • Can zfs set mountpoint=legacy to use /etc/vfstab • By default, cannot mount on top of non-empty directory ✦ Can override explicitly using zfs mount -O or legacy mountpoint • Mount properties are persistent, use zfs mount -o for temporary changesZFS Tutorial USENIX LISA’11 166
  • Solaris Swap and Dump • Swap ✦ Solaris does not have automatic swap resizing ✦ Swap as a separate dataset ✦ Swap device is raw, with a refreservation ✦ Blocksize matched to pagesize: 8 KB SPARC, 4 KB x86 ✦ Dont really need or want snapshots or clones ✦ Can resize while online, manually • Dump ✦ Only used during crash dump ✦ Preallocated ✦ No refreservation ✦ Checksum off ✦ Compression off (dumps are already compressed)ZFS Tutorial USENIX LISA’11 167
  • recordsize • Dynamic ✦ Max: 128 kBytes ✦ Min: sector size (usually 512 bytes) ✦ Power of 2 • For most workloads, dont worry about it • For fixed size workloads, can set to match workloads ✦ Databases ✦ iSCSI Zvols serving NTFS or ext3 (use 4 or 8 KB) • File systems or Zvols • zfs set recordsize=8k datasetZFS Tutorial USENIX LISA’11 168
  • Delegated Administration • Fine grain control ✦ users or groups of users ✦ subcommands, parameters, or sets • Similar to Solaris Role Based Access Control (RBAC) • Enable/disable at the pool level ✦ zpool set delegation=on mypool (default) • Allow/unallow at the dataset level ✦ zfs allow relling snapshot mypool/relling ✦ zfs allow @backupusers snapshot,send mypool/sw ✦ zfs allow mypool/rellingZFS Tutorial USENIX LISA’11 169
  • Delegation Inheritance • Beware of inheritance • Local ✦ zfs allow -l username snapshot mypool • Local + descendants ✦ zfs allow -d username mount mypool Make sure permissions are set at the correct levelZFS Tutorial USENIX LISA’11 170
  • Delegatable Subcommands • allow • receive • clone • rename • create • rollback • destroy • send • groupquota • share • groupused • snapshot • mount • userquota • promote • userusedZFS Tutorial USENIX LISA’11 171
  • Delegatable Parameters• aclinherit • nbmand • sharesmb• aclmode • normalization • snapdir• atime • quota • userprop• canmount • readonly • utf8only• casesensitivity • recordsize • version• checksum • refquota • volsize• compression • refreservation • vscan• copies • reservation • xattr• devices • setuid • zoned• exec • shareiscsi• mountpoint • sharenfsZFS Tutorial USENIX LISA’11 172
  • Nondelagable Parameters • mlslabelZFS Tutorial USENIX LISA’11 173
  • Browser User Interface • Solaris 10 – WebConsole • Nexenta • OpenStorageZFS Tutorial USENIX LISA’11 174
  • Solaris WebConsoleZFS Tutorial USENIX LISA’11 175
  • Solaris WebConsoleZFS Tutorial USENIX LISA’11 176
  • NexentaStorZFS Tutorial USENIX LISA’11 177
  • Oracle’s Sun OpenStorageZFS Tutorial USENIX LISA’11 178
  • Performance 179
  • General Comments • In general, performs well out of the box • Standard performance improvement techniques apply • Lots of DTrace knowledge available • Typical areas of concern: ✦ ZIL ✤ check with zilstat, improve with slogs ✦ COW “fragmentation” ✤ check iostat, improve with L2ARC ✦ Memory consumption ✤ check with arcstat ✤ set primarycache property ✤ can be capped ✤ can compete with large page aware apps ✦ Compression, or lack thereofZFS Tutorial USENIX LISA’11 180
  • Hybrid Storage Pool Adaptive Replacement Cache (ARC) separate Main Pool Main Pool Level 2 ARC intent log Write optimized HDD HDD Read optimized device (SSD) HDD device (SSD) Size (GBytes) 1 - 10 GByte large big Cost write iops/$ size/$ size/$ Use sync writes persistent storage read cache Performance secondary low-latency writes low-latency reads optimization Need more speed? stripe more, faster devices stripeZFS Tutorial USENIX LISA’11 181
  • Storage Architecture Log Main Pool L2ARC HDD Minimal HDD Good SSD HDD mirror HDD HDD HDD SSD Better SSD SSD raidz, raidz2, raidz3 mirror SSD SSD HDD HDD HDD HDD HDD HDD SSD SSD Best SSD SSD mirror mirror mirror mirror mirror stripe stripe stripeNovember 8, 2010 USENIX LISA’10 182
  • Storage Architecture Log Main Pool L2ARC HDD HDD HDD SSD Better SSD SSD raidz, raidz2, raidz3 mirror SSD SSD HDD HDD HDD HDD HDD HDD SSD SSD Best SSD mirror SSD mirror mirror mirror mirror stripe stripe stripe SSD SSD SSD Extreme raidz2, raidz3Over the Top SSD HDD SSD HDD SSD HDD mirror mirror mirror stripeNovember 8, 2010 USENIX LISA’10 183
  • ZIL Performance : NFS • Big performance increases demonstrated ✦ especially with SSDs ✦ for RAID arrays with nonvolatile RAM cache, not so much • NFS servers ✦ 32kByte threshold (zfs_immediate_write_sz) also corresponds to NFSv3 write size ✤ May cause more work than needed ✤ See CR6686887 ✦ Use fast, separate log deviceZFS Tutorial USENIX LISA’11 184
  • iSCSI Performance Hiccup • Prior to b107, iSCSI over Zvols didn’t properly handle sync writes • b107-b113, iSCSI over Zvols made all writes sync (read: slow) ✦ Workaround: enable write cache enable in the iSCSI target, see CR6770534 ✦ OpenSolaris 2009.06 is b111 • b114, write cache enable works automatically iSCSI over Zvol • NexentaStor ✦ not affected by hiccup ✦ easy management of iSCSI performance tunablesZFS Tutorial USENIX LISA’11 185
  • ARC Size • The ARC maximum size of 7/8 memory or memory - 1 GB is effectively 7/8 memory • For storage servers, this can be too conservative • If RAM > 8 GB, consider changing swapfs limits ✦ swapfs_minfree limit is 1/8 memory, by default ✦ units are pages (4 KB for x86, 8 KB for SPARC) ✦ set swapfs_minfree=8192ZFS Tutorial USENIX LISA’11 186
  • ZIL Performance : Databases • The logbias property can be set on a dataset to control threshold for writing to pool when a slog is used ✦ logbias=latency (default) all writes go to slog ✦ logbias=throughput all writes go to pool ✦ Settable on-the-fly ✤ Consider changing policy during database loads • Can have different sync policies for logs and data ✦ Oracle, separate latency-sensitive redo log traffic from ✤ Redo logs: logbias=latency ✤ Indexes: logbias=latency ✤ Data files: logbias=throughput ✦ MySQL with InnoDB ✤ logbias=latencyZFS Tutorial USENIX LISA’11 187
  • More ZIL Performance : Databases • I/O size inflation ✦ Once a file grows to use a block size, it will keep that block size ✤ Block size is capped by recordsize ✤ recordsize is a power of 2: 512 bytes, 1 KB, 2 KB, 4 KB, ... 128 KB ✦ Can be inefficient if the workload is sync and writes variable sized data • Oracle performance work: Roch reports 40% improvement for JBOD (HDD) + separate log (SSD) with: File system or Zvol Role recordsize logbias data files 8 KB throughput redo logs 128 KB (default) latency (default) indices 8-32 KB? latency (default)ZFS Tutorial USENIX LISA’11 188
  • Network Architecture NAS Network Clients Minimal Good Better HA cluster Best HA clusterNovember 8, 2010 USENIX LISA’10
  • Compression Performance • Metadata – you wont notice • Data ✦ LZJB is barely noticeable ✦ gzip-9 can be very noticeable • Geriatric hardware ???ZFS Tutorial USENIX LISA’11 190
  • vdev Cache • vdev cache occurs at the SPA level ✦ readahead ✦ 10 MBytes per vdev ✦ only caches metadata (b70 or later) • Stats collected as Solaris kstats # kstat -n vdev_cache_stats module: zfs instance: 0 name: vdev_cache_stats class: misc crtime 38.83342625 delegations 14030 hits 105169 misses 59452 snaptime 4564628.18130739 Hit rate = 59%, not bad...ZFS Tutorial USENIX LISA’11 191
  • Intelligent Prefetching • Intelligent file-level prefetching occurs at the DMU level • Feeds the ARC • In a nutshell, prefetch hits cause more prefetching ✦ Read a block, prefetch a block ✦ If we used the prefetched block, read 2 more blocks ✦ Up to 256 blocks • Recognizes strided reads ✦ 2 sequential reads of same length and a fixed distance will be coalesced • Fetches backwards • Seems to work pretty well, as-is, for most workloadsZFS Tutorial USENIX LISA’11 192
  • Unintelligent Prefetch? • Some workloads dont do so well with intelligent prefetch ✦ CR6859997, zfs caching performance problem, fixed in NV b124 • Look for time spent in zfetch_* functions using lockstat ✦ lockstat -I sleep 10 • Easy to disable in mdb for testing on Solaris ✦ echo zfs_prefetch_disable/W0t1 | mdb -kw • Re-enable with ✦ echo zfs_prefetch_disable/W0t0 | mdb -kw • Set via /etc/system ✦ set zfs:zfs_prefetch_disable = 1ZFS Tutorial USENIX LISA’11 193
  • Impedance Matching • RAID arrays & columns • format label offsets ✦ Older Solaris starting block = 34 ✦ Newer Solaris starting block = 256 • Verify with format or prtvtocZFS Tutorial USENIX LISA’11 194
  • I/O Queues • By default, up to 35 I/Os are queued to each vdev ✦ b127 changes default to 4-10, limit based on response ✦ Upper limit tunable with zfs_vdev_max_pending, set to 10 with: ✦ echo zfs_vdev_max_pending/W0t10 | mdb -kw • Implies that more vdevs is better ✦ Consider avoiding RAID array with a single, large LUN ✦ Beware: ZFS doesn’t know two vdevs can be on same disk • ZFS I/O scheduler loses control once iops are queued ✦ CR6471212 proposes reserved slots for high-priority iops • May need to match queues for the entire data path ✦ zfs_vdev_max_pending ✦ Fibre channel, SCSI, SAS, SATA driver ✦ RAID array controller • Fast disks → small queues, slow disks → larger queuesZFS Tutorial USENIX LISA’11 195
  • Slow disk? • What is going on here? extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 5948.9 349.3 40322.3 5238.1 0.1 16.7 0.0 2.7 0 330 c9 3.7 0.0 230.7 0.0 0.0 0.1 0.0 13.5 0 2 c9t1d0 845.0 0.0 5497.4 0.0 0.0 0.9 0.0 1.1 1 32 c9t2d0 3.8 0.0 230.7 0.0 0.0 0.0 0.0 10.6 0 1 c9t3d0 845.2 0.0 5495.4 0.0 0.0 0.9 0.0 1.1 1 32 c9t4d0 3.8 0.0 237.1 0.0 0.0 0.0 0.0 10.4 0 1 c9t5d0 841.4 0.0 5519.7 0.0 0.0 0.9 0.0 1.1 1 32 c9t6d0 3.8 0.0 237.3 0.0 0.0 0.0 0.0 9.2 0 1 c9t7d0 843.5 0.0 5485.2 0.0 0.0 0.9 0.0 1.1 1 31 c9t8d0 3.7 0.0 230.8 0.0 0.0 0.1 0.0 15.2 0 2 c9t9d0 850.2 0.0 5488.6 0.0 0.0 0.9 0.0 1.1 1 31 c9t10d0 3.1 0.0 211.2 0.0 0.0 0.0 0.0 13.2 0 1 c9t11d0 847.9 0.0 5523.4 0.0 0.0 0.9 0.0 1.1 1 31 c9t12d0 3.1 0.0 204.9 0.0 0.0 0.0 0.0 9.6 0 1 c9t13d0 847.2 0.0 5506.0 0.0 0.0 0.9 0.0 1.1 1 31 c9t14d0 3.4 0.0 224.1 0.0 0.0 0.0 0.0 12.3 0 1 c9t15d0 0.0 349.3 0.0 5238.1 0.0 9.9 0.0 28.4 1 100 c9t16d0ZFS Tutorial USENIX LISA’11 196
  • Slow disk? NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 c9t9d0 ONLINE 0 0 0 c9t11d0 ONLINE 0 0 0 c9t13d0 ONLINE 0 0 0 c9t15d0 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 c9t2d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t6d0 ONLINE 0 0 0 c9t8d0 ONLINE 0 0 0 c9t10d0 ONLINE 0 0 0 c9t12d0 ONLINE 0 0 0 c9t14d0 ONLINE 0 0 0 replacing-7 DEGRADED 0 0 0 c9t16d0s0/o FAULTED 0 0 0 corrupted data c9t16d0 ONLINE 0 0 0 1.28G resilveredZFS Tutorial USENIX LISA’11 197
  • Slow disk? • IOPS bound, big time • zfs_vdev_max_pending too big? extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 428.0 103.8 27361.9 65.9 0.0 14.4 0.0 27.1 0 446 c9 28.5 0.0 1824.1 0.0 0.0 0.3 0.0 10.6 0 23 c9t1d0 28.7 0.0 1836.9 0.0 0.0 0.3 0.0 9.6 0 23 c9t2d0 28.2 0.0 1804.9 0.0 0.0 0.3 0.0 10.6 0 23 c9t3d0 28.5 0.0 1824.1 0.0 0.0 0.3 0.0 10.6 0 24 c9t4d0 28.5 0.0 1824.1 0.0 0.0 0.3 0.0 10.8 0 23 c9t5d0 28.3 0.0 1805.0 0.0 0.0 0.3 0.0 10.1 0 22 c9t6d0 28.8 0.0 1843.3 0.0 0.0 0.3 0.0 10.8 0 23 c9t7d0 29.2 0.0 1862.6 0.0 0.0 0.3 0.0 10.0 0 23 c9t8d0 28.0 0.0 1792.1 0.0 0.0 0.3 0.0 10.7 0 23 c9t9d0 28.3 0.0 1805.0 0.0 0.0 0.3 0.0 9.8 0 21 c9t10d0 28.3 0.0 1811.3 0.0 0.0 0.3 0.0 10.8 0 24 c9t11d0 29.2 0.0 1862.6 0.0 0.0 0.3 0.0 9.8 0 23 c9t12d0 27.8 0.0 1779.3 0.0 0.0 0.3 0.0 10.2 0 23 c9t13d0 29.1 0.0 1856.2 0.0 0.0 0.3 0.0 9.8 0 23 c9t14d0 28.6 0.0 1830.5 0.0 0.0 0.3 0.0 10.9 0 23 c9t15d0 0.0 103.8 0.0 65.9 0.0 10.0 0.0 96.2 0 100 c9t16d0ZFS Tutorial USENIX LISA’11 198
  • COW Penalty • COW can negatively affect workloads which have updates and sequential reads ✦ Initial writes will be sequential ✦ Updates (writes) will cause seeks to read data • Lots of people seem to worry a lot about this • Only affects HDDs • Very difficult to speculate about the impact on real-world apps ✦ Large sequential scans of random data hurt anyway ✦ Reads are cached in many places in the data path ✦ Databases can COW, too • Sysbench benchmark used to test on MySQL w/InnoDB engine ✦ One hour read/write testZFS Tutorial USENIX LISA’11 199
  • COW Penalty Performance seems to level at about 25% penalty Results compliments of Allan Packer & Neelakanth Nadgir http://blogs.sun.com/realneel/resource/MySQL_Conference_2009_ZFS_MySQL.pdfZFS Tutorial USENIX LISA’11 200
  • About Disks... • Disks still the most important performance bottleneck ✦ Modern processors are multi-core ✦ Default checksums and compression are computationally efficient Average Max Size Rotational Average Seek Disk Size RPM (GBytes) Latency (ms) (ms) HDD 2.5” 5,400 500 5.5 11 HDD 3.5” 5,900 2,000 5.1 16 HDD 3.5” 7,200 1,500 4.2 8 - 8.5 HDD 2.5” 10,000 300 3 4.2 - 4.6 HDD 2.5” 15,000 146 2 3.2 - 3.5 SSD (w) 2.5” N/A 800 0 0.02 - 0.25 SSD (r) 2.5” N/A 1,000 0 0.02 - 0.15ZFS Tutorial USENIX LISA’11 201
  • Where is my disk?ZFS Tutorial USENIX LISA’11 202
  • DirectIO • UFS forcedirectio option brought the early 1980s design of UFS up to the 1990s • ZFS designed to run on modern multiprocessors • Databases or applications which manage their data cache may benefit by disabling file system caching • Expect L2ARC to improve random reads (secondarycache) • Prefetch disabled by primarycache=none|metadata UFS DirectIO ZFS primarycache=metadata Unbuffered I/O primarycache=none Concurrency Available at inception Improved Async I/O code path Available at inceptionZFS Tutorial USENIX LISA’11 203
  • RAID-Z Bandwidth • Traditional RAID-Z had a “mind the gap” feature • Impacts possible bandwidth • Mirrors could show higher bandwidth • Now RAID-Z shows better bandwidth, when channel bandwidth is the constrained resourceZFS Tutorial USENIX LISA’11 204
  • Troubleshooting 205
  • Checking Status • zpool status • zpool status -v • Solaris ✦ fmadm faulty ✦ fmdump ✦ fmdump -ev or fmdump -eV ✦ format or rmformatZFS Tutorial USENIX LISA’11 206
  • flash Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free z ZFS Tutorial USENIX LISA’11 207
  • flash Copy on Write What happens if the uberblock is updated prior to leaves? 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free z ZFS Tutorial USENIX LISA’11 208
  • What if flush is ignored? • Some devices ignore cache flush commands (!) ✦ Virtualization default=ignore flush:VirtualBox, others? ✦ Some USB/Firewire to IDE/SATA converters • Problem: uberblock could be updated before leaves • Symptom: can’t import pool, uberblock points to random data • Affected systems ✦ Many OSes and file systems ✦ Laptops - rarely because of battery ✦ Enterprise-class systems - rarely because of power redundancy and solid design ✦ Desktops - more frequentlyZFS Tutorial USENIX LISA’11 209
  • What if flush is ignored? • Solution ✦ Check integrity of recent transaction groups ✦ If damaged, rollback to older uberblock ✤ zpool import -F ✦ Best recovery methods in later releasesZFS Tutorial USENIX LISA’11 210
  • Cant Import Pool? • Check device paths with zpool import ✦ Be aware of /etc/zfs/zpool.cache ✦ May need zpool -d directory option ✦ “phantom paths”? • Check for 4 labels ✦ zdb -l /dev/dsk/c0t0d0s0 • Try rollback import ✦ zpool imports -F poolname Beware of device short names: c0d0 != c0d0s0ZFS Tutorial USENIX LISA’11 211
  • Slow Pool Import? • Case: zvols with snapshots ✦ Symptom: reboot or zpool import is really slllooooowwwwwww... ✦ Cause: inefficient incrementing over all zvols creating entries in /dev/zvol/dsk ✦ Cure: CR6761786 integrated in b125 • Case: system is iSCSI initiator ✦ Symptom: format or zpool import wait for tens of seconds to start ✦ Cause: iSCSI target presents in-band management LUs that are slow to respond to SCSI inquiry ✦ Cure: map iSCSI targets so that in-band management is not presented to ZFS hostsZFS Tutorial USENIX LISA’11 212
  • File System Mounts B0rken? • Prevention ✦ Avoid complex heirarchies (KISS) ✦ Be aware of legacy mounts ✦ Be aware of alternate boot environments (Solaris) • Check mountpoint properties ✦ zfs list -o name,mountpoint • Shared file systems ✦ Be aware of inherited shares ✦ Some clients do not mirror mount (Linux) ✦ NFS version differences? ✦ Check name servicesZFS Tutorial USENIX LISA’11 213
  • Cant Boot? • Check if BIOS/OBP supports booting from device • Make sure LUN has SMI label, not EFI ✦ Common mistake when mirroring root ✦ OK: zpool attach rpool c0t0d0s0 c0t1d0s7 ✦ Not OK: zpool attach rpool c0t0d0s0 c0t1d0 • installboot? • grub issues ✦ Boot environments usually handled by grub ✦ Check grub menu.lst • Know how to do a failsafe boot • Be aware of LiveCD import • Be aware of zpool.cache interactionsZFS Tutorial USENIX LISA’11 214
  • Stuck at grub> Prompt? • Latent fault • Typical scenario ✦ Upgrade NexentaStor ✦ Go to expert mode shell ✦ # zpool upgrade syspool ✦ ... wait a few months ... ✦ Reboot ✦ grub> • What happened? ✦ grub has a miniature ZFS implementation that can’t grok syspool version and menu.lst file is in the syspool • Recovery ✦ Boot CDZFS Tutorial USENIX LISA’11 215
  • Future Plans • Stay tuned...ZFS Tutorial USENIX LISA’11 216
  • Now you know... • ZFS structure: pools, datasets • Data redundancy: mirrors, RAIDZ, copies • Data verification: checksums • Data replication: snapshots, clones, send, receive • Hybrid storage: separate logs, cache devices, ARC • Security: allow, deny, encryption • Resource management: quotas, references, I/O scheduler • Performance: latency, COW, zilstat, arcstat, logbias, recordsize • Troubleshooting: FMA, zdb, importance of cache flushesZFS Tutorial USENIX LISA’11 217
  • Thank You! Questions?Richard.Elling@RichardElling.com Richard.Elling@Nexenta.com 218