Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

USENIX LISA11 Tutorial: ZFS a

This is the ZFS a

  • Login to see the comments

USENIX LISA11 Tutorial: ZFS a

  1. 1. ZFS: A File System for Modern USENIX LISA’11 Conference USENIX LISA11 December, 2011
  2. 2. Agenda • Overview • Foundations • Pooled Storage Layer • Transactional Object Layer • ZFS commands • Sharing • Properties • Other goodies • Performance • TroubleshootingZFS Tutorial USENIX LISA’11 2
  3. 3. ZFS History• Announced September 14, 2004• Integration history ✦ SXCE b27 (November 2005) ✦ FreeBSD (April 2007) ✦ Mac OSX Leopard ✤ Preview shown, but removed from Snow Leopard ✤ Disappointed community reforming as the zfs-macos google group (Oct 2009) ✦ OpenSolaris 2008.05 ✦ Solaris 10 6/06 (June 2006) ✦ Linux FUSE (summer 2006) ✦ greenBytes ZFS+ (September 2008) ✦ Linux native port funded by the US DOE (2010)• More than 45 patents, contributed to the CDDL Patents CommonZFS Tutorial USENIX LISA’11 3
  4. 4. ZFS Design Goals • Figure out why storage has gotten so complicated • Blow away 20+ years of obsolete assumptions • Gotta replace UFS • Design an integrated system from scratch End the sufferingZFS Tutorial USENIX LISA’11 4
  5. 5. Limits• 248 — Number of entries in any individual directory• 256 — Number of attributes of a file [*]• 256 — Number of files in a directory [*]• 16 EiB (264 bytes) — Maximum size of a file system• 16 EiB — Maximum size of a single file• 16 EiB — Maximum size of any attribute• 264 — Number of devices in any pool• 264 — Number of pools in a system• 264 — Number of file systems in a pool• 264 — Number of snapshots of any file system• 256 ZiB (278 bytes) — Maximum size of any pool [*] actually constrained to 248 for the number of files in a ZFS file systemZFS Tutorial USENIX LISA’11 5
  6. 6. Understanding Builds • Build is often referenced when speaking of feature/bug integration • Short-hand notation: b### • Distributions derived from Solaris NV (Nevada) ✦ NexentaStor ✦ Nexenta Core Platform ✦ SmartOS ✦ Solaris 11 (nee OpenSolaris) ✦ OpenIndiana ✦ StormOS ✦ BelleniX ✦ SchilliX ✦ MilaX • OpenSolaris builds ✦ Binary builds died at b134 ✦ Source releases continued through b147 • illumos stepping up to fill void left by OpenSolaris’ demiseZFS Tutorial USENIX LISA’11 6
  7. 7. Community Links • Community links ✦ ✦ ✦ ✦ ✦ ✦ • ZFS Community ✦ • IRC channels at ✦ #zfsZFS Tutorial USENIX LISA’11 7
  8. 8. ZFS Foundations 8
  9. 9. Overhead View of a Pool Pool File System Configuration Information Volume File System Volume DatasetZFS Tutorial USENIX LISA’11 9
  10. 10. Hybrid Storage Pool Adaptive Replacement Cache (ARC) separate Main Pool Main Pool Level 2 ARC intent log Write optimized HDD HDD Read optimized device (SSD) HDD device (SSD) Size (GBytes) 1 - 10 GByte large big Cost write iops/$ size/$ size/$ Use sync writes persistent storage read cache Performance secondary low-latency writes low-latency reads optimization Need more speed? stripe more, faster devices stripeZFS Tutorial USENIX LISA’11 10
  11. 11. Layer View raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ??November 8, 2010 USENIX LISA’10 11
  12. 12. Source Code Structure File system Mgmt Device Consumer Consumer libzfs Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV ConfigurationNovember 8, 2010 USENIX LISA’10 12
  13. 13. Acronyms • ARC – Adaptive Replacement Cache • DMU – Data Management Unit • DSL – Dataset and Snapshot Layer • JNI – Java Native Interface • ZPL – ZFS POSIX Layer (traditional file system interface) • VDEV – Virtual Device • ZAP – ZFS Attribute Processor • ZIL – ZFS Intent Log • ZIO – ZFS I/O layer • Zvol – ZFS volume (raw/cooked block device interface)ZFS Tutorial USENIX LISA’11 13
  14. 14. NexentaStor Rosetta Stone NexentaStor OpenSolaris/ZFS Volume Storage pool ZVol Volume Folder File systemZFS Tutorial USENIX LISA’11 14
  15. 15. nvlists • name=value pairs • libnvpair(3LIB) • Allows ZFS capabilities to change without changing the physical on-disk format • Data stored is XDR encoded • A good thing, used oftenZFS Tutorial USENIX LISA’11 15
  16. 16. Versioning • Features can be added and identified by nvlist entries • Change in pool or dataset versions do not change physical on- disk format (!) ✦ does change nvlist parameters • Older-versions can be used ✦ might see warning messages, but harmless • Available versions and features can be easily viewed ✦ zpool upgrade -v ✦ zfs upgrade -v • Online references (broken?) ✦ zpool: ✦ zfs: Dont confuse zpool and zfs versionsZFS Tutorial USENIX LISA’11 16
  17. 17. zpool Versions VER DESCRIPTION --- ------------------------------------------------ 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit support Continued...ZFS Tutorial USENIX LISA’11 17
  18. 18. More zpool Versions VER DESCRIPTION --- ------------------------------------------------ 15 user/group space accounting 16 stmf property support 17 Triple-parity RAID-Z 18 snapshot user holds 19 Log device removal 20 Compression using zle (zero-length encoding) 21 Deduplication 22 Received properties 23 Slim ZIL 24 System attributes 25 Improved scrub stats 26 Improved snapshot deletion performance 27 Improved snapshot creation performance 28 Multiple vdev replacements For Solaris 10, version 21 is “reserved”ZFS Tutorial USENIX LISA’11 18
  19. 19. zfs Versions VER DESCRIPTION ---------------------------------------------- 1 Initial ZFS filesystem version 2 Enhanced directory entries 3 Case insensitive and File system unique identifier (FUID) 4 userquota, groupquota properties 5 System attributesZFS Tutorial USENIX LISA’11 19
  20. 20. Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & freeZFS Tutorial USENIX LISA’11 20
  21. 21. COW Notes• COW works on blocks, not files• ZFS reserves 32 MBytes or 1/64 of pool size ✦ COWs need some free space to remove files ✦ need space for ZIL• For fixed-record size workloads “fragmentation” and “poor performance” can occur if the recordsize is not matched• Spatial distribution is good fodder for performance speculation ✦ affects HDDs ✦ moot for SSDs ZFS Tutorial USENIX LISA’11 21
  22. 22. To fsck or not to fsck • fsck was created to fix known inconsistencies in file system metadata ✦ UFS is not transactional ✦ metadata inconsistencies must be reconciled ✦ does NOT repair data – how could it? • ZFS doesnt need fsck, as-is ✦ all on-disk changes are transactional ✦ COW means previously existing, consistent metadata is not overwritten ✦ ZFS can repair itself ✤ metadata is at least dual-redundant ✤ data can also be redundant • Reality check – this does not mean that ZFS is not susceptible to corruption ✦ nor is any other file systemZFS Tutorial USENIX LISA’11 22
  23. 23. Pooled Storage Layer raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ??November 8, 2010 USENIX LISA’10 23
  24. 24. vdevs – Virtual Devices Logical vdevs root vdev top-level vdev top-level vdev children[0] children[1] mirror mirror vdev vdev vdev vdev type = disk type = disk type = disk type = disk children[0] children[0] children[0] children[0] Physical or leaf vdevsZFS Tutorial USENIX LISA’11 24
  25. 25. vdev Labels • vdev labels != disk labels • Four 256 kByte labels written to every physical vdev • Two-stage update process ✦ write label0 & label2 ✦ flush cache & check for errors ✦ write label1 & label3 ✦ flush cache & check for errors N = 256k * (size % 256k) M = 128k / MIN(1k, sector size) 0 256k 512k 4M N-512k N-256k N label0 label1 boot block label2 label3 ... Boot Name=Value Blank header Pairs M-slot Uberblock Array 0 8k 16k 128k 256k 25ZFS Tutorial USENIX LISA’11
  26. 26. Observing Labels# zdb -l /dev/rdsk/c0t0d0s0--------------------------------------------LABEL 0-------------------------------------------- version=14 name=rpool state=0 txg=13152 pool_guid=17111649328928073943 hostid=8781271 hostname= top_guid=11960061581853893368 guid=11960061581853893368 vdev_tree type=disk id=0 guid=11960061581853893368 path=/dev/dsk/c0t0d0s0 devid=id1,sd@SATA_____ST3500320AS_________________9QM3FWFT/a phys_path=/pci@0,0/pci1458,b002@11/disk@0,0:a whole_disk=0 metaslab_array=24 metaslab_shift=30 ashift=9 asize=157945167872 is_log=0 ZFS Tutorial USENIX LISA’11 26
  27. 27. Uberblocks • Sized based on minimum device block size • Stored in 128-entry circular queue • Only one uberblock is active at any time ✦ highest transaction group number ✦ correct SHA-256 checksum • Stored in machines native format ✦ A magic number is used to determine endian format when imported • Contains pointer to Meta Object Set (MOS) Device Block Size Uberblock Size Queue Entries 512 Bytes,1 KB 1 KB 128 2 KB 2 KB 64 4 KB 4 KB 32ZFS Tutorial USENIX LISA’11 27
  28. 28. About Sizes • Sizes are dynamic • LSIZE = logical size • PSIZE = physical size after compression • ASIZE = allocated size including: ✦ physical size ✦ raidz parity ✦ gang blocks Old notions of size reporting confuse peopleZFS Tutorial USENIX LISA’11 28
  29. 29. VDEVZFS Tutorial USENIX LISA’11 29
  30. 30. Dynamic Striping • RAID-0 ✦ SNIA definition: fixed-length sequences of virtual disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern • Dynamic Stripe ✦ Data is dynamically mapped to member disks ✦ No fixed-length sequences ✦ Allocate up to ~1 MByte/vdev before changing vdev ✦ vdevs can be different size ✦ Good combination of the concatenation feature with RAID-0 performanceZFS Tutorial USENIX LISA’11 30
  31. 31. Dynamic Striping RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes384 kBytes ZFS Dynamic Stripe recordsize = 128 kBytes Total write size = 2816 kBytes ZFS Tutorial USENIX LISA’11 31
  32. 32. Mirroring • Straightforward: put N copies of the data on N vdevs • Unlike RAID-1 ✦ No 1:1 mapping at the block level ✦ vdev labels are still at beginning and end ✦ vdevs can be of different size ✤ effective space is that of smallest vdev • Arbitration: ZFS does not blindly trust either side of mirror ✦ Most recent, correct view of data wins ✦ Checksums validate dataZFS Tutorial USENIX LISA’11 32
  33. 33. Dynamic vdev Replacement • zpool replace poolname vdev [vdev] • Today, replacing vdev must be same size or larger ✦ NexentaStor 2 ‒ as measured by blocks ✦ NexentaStor 3 ‒ as measured by metaslabs • Replacing all vdevs in a top-level vdev with larger vdevs results in top-level vdev resizing • Expansion policy controlled by: ✦ NexentaStor 2 ‒ resize on import ✦ NexentaStor 3 ‒ zpool autoexpand property 15G 10G 10G 15G 10G 20G 15G 20G 10G 15G 20G 20G 20G 20G 10G 10G Mirror 10G Mirror 15G Mirror 15G Mirror 20G MirrorZFS Tutorial USENIX LISA’11 33
  34. 34. RAIDZ • RAID-5 ✦ Parity check data is distributed across the RAID arrays disks ✦ Must read/modify/write when data is smaller than stripe width • RAIDZ ✦ Dynamic data placement ✦ Parity added as needed ✦ Writes are full-stripe writes ✦ No read/modify/write (write hole) • Arbitration: ZFS does not blindly trust any device ✦ Does not rely on disk reporting read error ✦ Checksums validate data ✦ If checksum fails, read parity Space used is dependent on how usedZFS Tutorial USENIX LISA’11 34
  35. 35. RAID-5 vs RAIDZ DiskA DiskB DiskC DiskD DiskE D0:0 D0:1 D0:2 D0:3 P0 RAID-5 P1 D1:0 D1:1 D1:2 D1:3 D2:3 P2 D2:0 D2:1 D2:2 D3:2 D3:3 P3 D3:0 D3:1 DiskA DiskB DiskC DiskD DiskE P0 D0:0 D0:1 D0:2 D0:3 RAIDZ P1 D1:0 D1:1 P2:0 D2:0 D2:1 D2:2 D2:3 P2:1 D2:4 D2:5 Gap P3 D3:0ZFS Tutorial USENIX LISA’11 35
  36. 36. RAIDZ and Block Size If block size >> N * sector size, space consumption is like RAID-5 If block size = sector size, space consumption is like mirroring PSIZE=2KBASIZE=2.5KB DiskA DiskB DiskC DiskD DiskE P0 D0:0 D0:1 D0:2 D0:3 P1 D1:0 D1:1 P2:0 D2:0 PSIZE=1KB D2:1 D2:2 D2:3 P2:1 D2:4ASIZE=1.5KB D2:5 Gap P3 D3:0 PSIZE=3KB PSIZE=512 bytes ASIZE=4KB + Gap ASIZE=1KB Sector size = 512 bytes Sector size can impact space savingsZFS Tutorial USENIX LISA’11 36
  37. 37. RAID-5 Write Hole • Occurs when data to be written is smaller than stripe size • Must read unallocated columns to recalculate the parity or the parity must be read/modify/write • Read/modify/write is risky for consistency ✦ Multiple disks ✦ Reading independently ✦ Writing independently ✦ System failure before all writes are complete to media could result in data loss • Effects can be hidden from host using RAID array with nonvolatile write cache, but extra I/O cannot be hidden from disksZFS Tutorial USENIX LISA’11 37
  38. 38. RAIDZ2 and RAIDZ3 • RAIDZ2 = double parity RAIDZ • RAIDZ3 = triple parity RAIDZ • Sorta like RAID-6 ✦ Parity 1: XOR ✦ Parity 2: another Reed-Soloman syndrome ✦ Parity 3: yet another Reed-Soloman syndrome • Arbitration: ZFS does not blindly trust any device ✦ Does not rely on disk reporting read error ✦ Checksums validate data ✦ If data not valid, read parity ✦ If data still not valid, read other parity Space used is dependent on how usedZFS Tutorial USENIX LISA’11 38
  39. 39. Evaluating Data Retention • MTTDL = Mean Time To Data Loss • Note: MTBF is not constant in the real world, but keeps math simple • MTTDL[1] is a simple MTTDL model • No parity (single vdev, striping, RAID-0) ✦ MTTDL[1] = MTBF / N • Single Parity (mirror, RAIDZ, RAID-1, RAID-5) ✦ MTTDL[1] = MTBF2 / (N * (N-1) * MTTR) • Double Parity (3-way mirror, RAIDZ2, RAID-6) ✦ MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2) • Triple Parity (4-way mirror, RAIDZ3) ✦ MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3)ZFS Tutorial USENIX LISA’11 39
  40. 40. Another MTTDL Model • MTTDL[1] model doesnt take into account unrecoverable read • But unrecoverable reads (UER) are becoming the dominant failure mode ✦ UER specifed as errors per bits read ✦ More bits = higher probability of loss per vdev • MTTDL[2] model considers UERZFS Tutorial USENIX LISA’11 40
  41. 41. Why Worry about UER? • Richards study ✦ 3,684 hosts with 12,204 LUNs ✦ 11.5% of all LUNs reported read errors • Bairavasundaram FAST08 ✦ 1.53M LUNs over 41 months ✦ RAID reconstruction discovers 8% of checksum mismatches ✦ “For some drive models as many as 4% of drives develop checksum mismatches during the 17 months examined”ZFS Tutorial USENIX LISA’11 41
  42. 42. Why Worry about UER? • RAID array studyZFS Tutorial USENIX LISA’11 42
  43. 43. Why Worry about UER? • RAID array study Unrecoverable Disk Disappeared Reads “disk pull” “Disk pull” tests aren’t very usefulZFS Tutorial USENIX LISA’11 43
  44. 44. MTTDL[2] Model • Probability that a reconstruction will fail ✦ Precon_fail = (N-1) * size / UER • Model doesnt work for non-parity schemes ✦ single vdev, striping, RAID-0 • Single Parity (mirror, RAIDZ, RAID-1, RAID-5) ✦ MTTDL[2] = MTBF / (N * Precon_fail) • Double Parity (3-way mirror, RAIDZ2, RAID-6) ✦ MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail) • Triple Parity (4-way mirror, RAIDZ3) ✦ MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 * Precon_fail)ZFS Tutorial USENIX LISA’11 44
  45. 45. Practical View of MTTDL[1]ZFS Tutorial USENIX LISA’11 45
  46. 46. MTTDL[1] ComparisonZFS Tutorial USENIX LISA’11 46
  47. 47. MTTDL Models: Mirror Spares are not always better...ZFS Tutorial USENIX LISA’11 47
  48. 48. MTTDL Models: RAIDZ2ZFS Tutorial USENIX LISA’11 48
  49. 49. Space, Dependability, and PerformanceZFS Tutorial USENIX LISA’11 49
  50. 50. Dependability Use Case • Customer has 15+ TB of read-mostly data • 16-slot, 3.5” drive chassis • 2 TB HDDs • Option 1: one raidz2 set ✦ 24 TB available space ✤ 12 data ✤ 2 parity ✤ 2 hot spares, 48 hour disk replacement time ✦ MTTDL[1] = 1,790,000 years • Option 2: two raidz2 sets ✦ 24 TB available space (each set) ✤ 6 data ✤ 2 parity ✤ no hot spares ✦ MTTDL[1] = 7,450,000 yearsZFS Tutorial USENIX LISA’11 50
  51. 51. Ditto Blocks • Recall that each blkptr_t contains 3 DVAs • Dataset property used to indicate how many copies (aka ditto blocks) of data is desired ✦ Write all copies ✦ Read any copy ✦ Recover corrupted read from a copy • Not a replacement for mirroring ✦ For single disk, can handle data loss on approximately 1/8 contiguous space • Easier to describe in pictures... copies parameter Data copies Metadata copies copies=1 (default) 1 2 copies=2 2 3 copies=3 3 3ZFS Tutorial USENIX LISA’11 51
  52. 52. Copies in PicturesNovember 8, 2010 USENIX LISA’10 52
  53. 53. Copies in PicturesZFS Tutorial USENIX LISA’11 53
  54. 54. When Good Data Goes Bad File system If it’s a metadata Or we get does bad read block FS panics back bad Can not tell does disk rebuild dataZFS Tutorial USENIX LISA’11 54
  55. 55. Checksum Verification ZFS verifies checksums for every read Repairs data when possible (mirror, raidz, copies>1) Read bad data Read good data Repair bad dataZFS Tutorial USENIX LISA’11 55
  56. 56. ZIO - ZFS I/O Layer 56
  57. 57. ZIO Framework • All physical disk I/O goes through ZIO Framework • Translates DVAs into Logical Block Address (LBA) on leaf vdevs ✦ Keeps free space maps (spacemap) ✦ If contiguous space is not available: ✤ Allocate smaller blocks (the gang) ✤ Allocate gang block, pointing to the gang • Implemented as multi-stage pipeline ✦ Allows extensions to be added fairly easily • Handles I/O errorsZFS Tutorial USENIX LISA’11 57
  58. 58. ZIO Write Pipeline ZIO State Compression Checksum DVA vdev I/O open compress if savings > 12.5% generate allocate start start start done done done assess assess assess done Gang and deduplicaiton activity elided, for clarityZFS Tutorial USENIX LISA’11 58
  59. 59. ZIO Read Pipeline ZIO State Compression Checksum DVA vdev I/O open start start start done done done assess assess assess verify decompress done Gang and deduplicaiton activity elided, for clarityZFS Tutorial USENIX LISA’11 59
  60. 60. VDEV – Virtual Device Subsytem • Where mirrors, RAIDZ, and Name Priority RAIDZ2 are implemented NOW 0 ✦ Surprisingly few lines of code SYNC_READ 0 needed to implement RAID SYNC_WRITE 0 • Leaf vdev (physical device) I/O FREE 0 management CACHE_FILL 0 ✦ Number of outstanding iops LOG_WRITE 0 ✦ Read-ahead cache ASYNC_READ 4 • Priority scheduling ASYNC_WRITE 4 RESILVER 10 SCRUB 20ZFS Tutorial USENIX LISA’11 60
  61. 61. ARC - AdaptiveReplacement Cache 61
  62. 62. Object Cache • UFS uses page cache managed by the virtual memory system • ZFS does not use the page cache, except for mmaped files • ZFS uses a Adaptive Replacement Cache (ARC) • ARC used by DMU to cache DVA data objects • Only one ARC per system, but caching policy can be changed on a per-dataset basis • Seems to work much better than page cache ever did for UFSZFS Tutorial USENIX LISA’11 62
  63. 63. Traditional Cache • Works well when data being accessed was recently added • Doesnt work so well when frequently accessed data is evicted Misses cause insert MRU Dynamic caches can change Cache size size by either not evicting or aggressively evicting LRU Evict the oldestZFS Tutorial USENIX LISA’11 63
  64. 64. ARC – Adaptive Replacement Cache Evict the oldest single-use entry LRU Recent Cache Miss MRU Evictions and dynamic MFU size resizing needs to choose best Hit cache to evict (shrink) Frequent Cache LFU Evict the oldest multiple accessed entryZFS Tutorial USENIX LISA’11 64
  65. 65. ARC with Locked Pages Evict the oldest single-use entry Cannot evict LRU locked pages! Recent Cache Miss MRU MFU size Hit Frequent If hit occurs Cache within 62 ms LFU Evict the oldest multiple accessed entry ZFS ARC handles mixed-size pagesZFS Tutorial USENIX LISA’11 65
  66. 66. L2ARC – Level 2 ARC • Data soon to be evicted from the ARC is added to a queue to be sent to cache vdev ✦ Another thread sends queue to cache vdev ARC ✦ Data is copied to the cache vdev with a throttle data soon to to limit bandwidth consumption be evicted ✦ Under heavy memory pressure, not all evictions will arrive in the cache vdev • ARC directory remains in memory • Good idea - optimize cache vdev for fast reads ✦ lower latency than pool disks ✦ inexpensive way to “increase memory” cache • Content considered volatile, no raid needed • Monitor usage with zpool iostat and ARC kstatsZFS Tutorial USENIX LISA’11 66
  67. 67. ARC Directory • Each ARC directory entry contains arc_buf_hdr structs ✦ Info about the entry ✦ Pointer to the entry • Directory entries have size, ~200 bytes • ZFS block size is dynamic, sector size to 128 kBytes • Disks are large • Suppose we use a Seagate LP 2 TByte disk for the L2ARC ✦ Disk has 3,907,029,168 512 byte sectors, guaranteed ✦ Workload uses 8 kByte fixed record size ✦ RAM needed for arc_buf_hdr entries ✤ Need = (3,907,029,168 - 9,232) * 200 / 16 = ~48 GBytes • Dont underestimate the RAM needed for large L2ARCsZFS Tutorial USENIX LISA’11 67
  68. 68. ARC Tips • In general, it seems to work well for most workloads • ARC size will vary, based on usage ✦ Default target max is 7/8 of physical memory or (memory - 1 GByte) ✦ Target min is 64 MB ✦ Metadata capped at 1/4 of max ARC size • Dynamic size can be reduced when: ✦ page scanner is running ✤ freemem < lotsfree + needfree + desfree ✦ swapfs does not have enough space so that anonymous reservations can succeed ✤ availrmem < swapfs_minfree + swapfs_reserve + desfree ✦ [x86 only] kernel heap space more than 75% full • Can limit at boot timeZFS Tutorial USENIX LISA’11 68
  69. 69. Observing ARC • ARC statistics stored in kstats • kstat -n arcstats • Interesting statistics: ✦ size = current ARC size ✦ p = size of MFU cache ✦ c = target ARC size ✦ c_max = maximum target ARC size ✦ c_min = minimum target ARC size ✦ l2_hdr_size = space used in ARC by L2ARC ✦ l2_size = size of data in L2ARCZFS Tutorial USENIX LISA’11 69
  70. 70. General Status - ARCZFS Tutorial USENIX LISA’11 70
  71. 71. More ARC Tips • Performance ✦ Prior to b107, L2ARC fill rate was limited to 8 MB/sec ✦ After b107, cold L2ARC fill rate increases to 16 MB/sec • Internals tracked by kstats in Solaris ✦ Use memory_throttle_count to observe pressure to evict • Dedup Table (DDT) also uses ARC ✦ lots of dedup objects need lots of RAM ✦ field reports that L2ARC can help with dedup L2ARC keeps its directory in kernel memoryZFS Tutorial USENIX LISA’11 71
  72. 72. TransactionalObject Layer 72
  73. 73. flash Source Code Structure File system Mgmt Device Consumer Consumer libzfs Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration November 8, 2010 USENIX LISA’10 73
  74. 74. Transaction Engine • Manages physical I/O • Transactions grouped into transaction group (txg) ✦ txg updates ✦ All-or-nothing ✦ Commit interval ✤ Older versions: 5 seconds ✤ Less old versions: 30 seconds ✤ b143 and later: 5 seconds • Delay committing data to physical storage ✦ Improves performance ✦ A bad thing for sync workload performance – hence the ZFS Intent Log (ZIL) 30 second delay can impact failure detection timeZFS Tutorial USENIX LISA’11 74
  75. 75. ZIL – ZFS Intent Log • DMU is transactional, and likes to group I/O into transactions for later commits, but still needs to handle “write it now” desire of sync writers ✦ NFS ✦ Databases • ZIL recordsize inflation can occur for some workloads ✦ May cause larger than expected actual I/O for sync workloads ✦ Oracle redo logs ✦ No slog: can tune zfs_immediate_write_sz, zvol_immediate_write_sz ✦ With slog: use logbias property instead • Never read, except at import (eg reboot), when transactions may need to be rolled forwardZFS Tutorial USENIX LISA’11 75
  76. 76. Separate Logs (slogs) • ZIL competes with pool for IOPS ✦ Applications wait for sync writes to be on nonvolatile media ✦ Very noticeable on HDD JBODs • Put ZIL on separate vdev, outside of pool ✦ ZIL writes tend to be sequential ✦ No competition with pool for IOPS ✦ Downside: slog device required to be operational at import ✦ NexentaStor 3 allows slog device removal ✦ Size of separate log < than size of RAM (duh) • 10x or more performance improvements possible ✦ Nonvolatile RAM card ✦ Write-optimized SSD ✦ Nonvolatile write cache on RAID arrayZFS Tutorial USENIX LISA’11 76
  77. 77. zilstat • zilstat • Integrated into NexentaStor 3.0.3 ✦ nmc: show performance zilZFS Tutorial USENIX LISA’11 77
  78. 78. Synchronous Write Destination Without separate log Sync I/O size > ZIL Destination zfs_immediate_write_sz ? no ZIL log yes bypass to pool With separate log logbias? ZIL Destination latency (default) log device throughput bypass to pool Default zfs_immediate_write_sz = 32 kBytesZFS Tutorial USENIX LISA’11 78
  79. 79. ZIL Synchronicity Project • All-or-nothing policies don’t work well, in general • ZIL Synchronicity project proposed by Robert Milkowski ✦ • Adds new sync property to datasets • Arrived in b140 sync Parameter Behaviour Policy follows previous design: write standard (default) immediate size and separate logs always All writes become synchronous (slow) disabled Synchronous write requests are ignoredZFS Tutorial USENIX LISA’11 79
  80. 80. Disabling the ZIL • Preferred method: change dataset sync property • Rule 0: Don’t disable the ZIL • If you love your data, do not disable the ZIL • You can find references to this as a way to speed up ZFS ✦ NFS workloads ✦ “tar -x” benchmarks • Golden Rule: Don’t disable the ZIL • Can set via mdb, but need to remount the file system • Friends don’t let friends disable the ZIL • Older Solaris - can set in /etc/system • NexentaStor has checkbox for disabling ZIL • Nostradamus wrote, “disabling the ZIL will lead to the apocalypse”ZFS Tutorial USENIX LISA’11 80
  81. 81. DSL - Dataset and Snapshot Layer 81
  82. 82. Dataset & Snapshot Layer • Object ✦ Allocated storage ✦ dnode describes collection of blocks • Object Set Dataset Directory ✦ Group of related objects Dataset • Dataset Object Set Childmap ✦ Snapmap: snapshot relationships Object Object ✦ Space usage Object Properties • Dataset directory Snapmap ✦ Childmap: dataset relationships ✦ PropertiesZFS Tutorial USENIX LISA’11 82
  83. 83. flash Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free z ZFS Tutorial USENIX LISA’11 83
  84. 84. zfs snapshot • Create a read-only, point-in-time window into the dataset (file system or Zvol) • Computationally free, because of COW architecture • Very handy feature ✦ Patching/upgrades • Basis for time-related snapshot interfaces ✦ Solaris Time Slider ✦ NexentaStor Delorean Plugin ✦ NexentaStor Virtual Machine Data CenterZFS Tutorial USENIX LISA’11 84
  85. 85. Snapshot • Create a snapshot by not freeing COWed blocks • Snapshot creation is fast and easy • Number of snapshots determined by use – no hardwired limit • Recursive snapshots also possible Snapshot tree Current tree root rootZFS Tutorial USENIX LISA’11 85
  86. 86. auto-snap serviceZFS Tutorial USENIX LISA’11 86
  87. 87. Clones • Snapshots are read-only • Clones are read-write based upon a snapshot • Child depends on parent ✦ Cannot destroy parent without destroying all children ✦ Can promote children to be parents • Good ideas ✦ OS upgrades ✦ Change control ✦ Replication ✤ zones ✤ virtual disksZFS Tutorial USENIX LISA’11 87
  88. 88. zfs clone • Create a read-write file system from a read-only snapshot • Solaris boot environment administation Install Checkpoint Clone Checkpoint OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 rootfs- rootfs- rootfs- rootfs- nmu- nmu- nmu- nmu- 001 001 001 001 patch/ OS rev1 OS rev1 OS rev1 upgrade clone clone clone rootfs- nmu- 002 grubboot manager Origin snapshot cannot be destroyed, if clone existsZFS Tutorial USENIX LISA’11 88
  89. 89. Deduplication 89
  90. 90. What is Deduplication? • A $2.1 Billion feature • 2009 buzzword of the year • Technique for improving storage space efficiency ✦ Trades big I/Os for small I/Os ✦ Does not eliminate I/O • Implementation styles ✦ offline or post processing ✤ data written to nonvolatile storage ✤ process comes along later and dedupes data ✤ example: tape archive dedup ✦ inline ✤ data is deduped as it is being allocated to nonvolatile storage ✤ example: ZFSZFS Tutorial USENIX LISA’11 90
  91. 91. Dedup how-to • Given a bunch of data • Find data that is duplicated • Build a lookup table of references to data • Replace duplicate data with a pointer to the entry in the lookup table • Grainularity ✦ file ✦ block ✦ byteZFS Tutorial USENIX LISA’11 91
  92. 92. Dedup in ZFS • Leverage block-level checksums ✦ Identify blocks which might be duplicates ✦ Variable block size is ok • Synchronous implementation ✦ Data is deduped as it is being written • Scalable design ✦ No reference count limits • Works with existing features ✦ compression ✦ copies ✦ scrub ✦ resilver • Implemented in ZIO pipelineZFS Tutorial USENIX LISA’11 92
  93. 93. Deduplication Table (DDT) • Internal implementation ✦ Adelson-Velskii, Landis (AVL) tree ✦ Typical table entry ~270 bytes ✤ checksum ✤ logical size ✤ physical size ✤ references ✦ Table entry size increases as the number of references increasesZFS Tutorial USENIX LISA’11 93
  94. 94. Reference Counts Eggs courtesy of Richard’s chickensZFS Tutorial USENIX LISA’11 94
  95. 95. Reference Counts • Problem: loss of the referenced data affects all referrers • Solution: make additional copies of referred data based upon a threshold count of referrers ✦ leverage copies (ditto blocks) ✦ pool-level threshold for automatically adding ditto copies ✤ set via dedupditto pool property # zpool set dedupditto=50 zwimming ✤ add 2nd copy when dedupditto references (50) reached ✤ add 3rd copy when dedupditto2 references (2500) reachedZFS Tutorial USENIX LISA’11 95
  96. 96. Verification write() compress checksum DDT entry lookup yes no DDT verify? match? no yes read data data yes match? add reference no new entryZFS Tutorial USENIX LISA’11 96
  97. 97. Enabling Dedup • Set dedup property for each dataset to be deduped • Remember: properties are inherited • Remember: only applies to newly written data dedup checksum verify? on SHA256 no sha256 on,verify SHA256 yes sha256,verify Fletcher is considered too weak, without verifyZFS Tutorial USENIX LISA’11 97
  98. 98. Dedup Accounting • ...and you thought compression accounting was hard... • Remember: dedup works at pool level ✦ dataset-level accounting doesn’t see other datasets ✦ pool-level accounting is always correctzfs list NAME USED AVAIL REFER MOUNTPOINT bar 7.56G 449G 22K /bar bar/ws 7.56G 449G 7.56G /bar/ws dozer 7.60G 455G 22K /dozer dozer/ws 7.56G 455G 7.56G /dozer/ws tank 4.31G 456G 22K /tank tank/ws 4.27G 456G 4.27G /tank/wszpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT bar 464G 7.56G 456G 1% 1.00x ONLINE - dozer 464G 1.43G 463G 0% 5.92x ONLINE - tank 464G 957M 463G 0% 5.39x ONLINE -ZFS Tutorial DataUSENIX LISA’11team courtesy of the ZFS 98
  99. 99. DDT Histogram # zdb -DD tank DDT-sha256-zap-duplicate: 110173 entries, size 295 on disk, 153 in core DDT-sha256-zap-unique: 302 entries, size 42194 on disk, 52827 in core DDT histogram (aggregated over all DDTs): bucket! allocated! referenced ______ ___________________________ ___________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 302 7.26M 4.24M 4.24M 302 7.26M 4.24M 4.24M 2 103K 1.12G 712M 712M 216K 2.64G 1.62G 1.62G 4 3.11K 30.0M 17.1M 17.1M 14.5K 168M 95.2M 95.2M 8 503 11.6M 6.16M 6.16M 4.83K 129M 68.9M 68.9M 16 100 4.22M 1.92M 1.92M 2.14K 101M 45.8M 45.8MZFS Tutorial USENIX LISA’11 Data courtesy of the ZFS team 99
  100. 100. DDT Histogram$ zdb -DD zwimmingDDT-sha256-zap-duplicate: 19725 entries, size 270 on disk, 153 in coreDDT-sha256-zap-unique: 52369639 entries, size 284 on disk, 159 in coreDDT histogram (aggregated over all DDTs):bucket allocated referenced______ ______________________________ ______________________________refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE------ ------ ----- ----- ----- ------ ----- ----- ----- 1 49.9M 25.0G 25.0G 25.0G 49.9M 25.0G 25.0G 25.0G 2 16.7K 8.33M 8.33M 8.33M 33.5K 16.7M 16.7M 16.7M 4 610 305K 305K 305K 3.33K 1.66M 1.66M 1.66M 8 661 330K 330K 330K 6.67K 3.34M 3.34M 3.34M 16 242 121K 121K 121K 5.34K 2.67M 2.67M 2.67M 32 131 65.5K 65.5K 65.5K 5.54K 2.77M 2.77M 2.77M 64 897 448K 448K 448K 84K 42M 42M 42M 128 125 62.5K 62.5K 62.5K 18.0K 8.99M 8.99M 8.99M 8K 1 512 512 512 12.5K 6.27M 6.27M 6.27M Total 50.0M 25.0G 25.0G 25.0G 50.1M 25.1G 25.1G 25.1Gdedup = 1.00, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.00 ZFS Tutorial USENIX LISA’11 100
  101. 101. Over-the-wire Dedup • Dedup is also possible over the send/receive pipe ✦ Blocks with no checksum are considered duplicates (no verify option) ✦ First copy sent as usual ✦ Subsequent copies sent by reference • Independent of dedup status of originating pool ✦ Receiving pool knows about blocks which have already arrived • Can be a win for dedupable data, especially over slow wires • Remember: send/receive version rules still apply # zfs send -DR zwimming/stuffZFS Tutorial USENIX LISA’11 101
  102. 102. Dedup Performance • Dedup can save space and bandwidth • Dedup increases latency ✦ Caching data improves latency ✦ More memory → more data cached ✦ Cache performance heirarchy ✤ RAM: fastest ✤ L2ARC on SSD: slower ✤ Pool HDD: dreadfully slow • ARC is currently not deduped • Difficult to predict ✦ Dependent variable: number of blocks ✦ Estimate 270 bytes per unique block ✦ Example: ✤ 50M blocks * 270 bytes/block = 13.5 GBytesZFS Tutorial USENIX LISA’11 102
  103. 103. Deduplication Use Cases Data type Dedupe Compression Home directories ✔✔ ✔✔ Internet content ✔ ✔ Media and video ✔✔ ✔ Life sciences ✘ ✔✔ Oil and Gas (seismic) ✘ ✔✔ Virtual machines ✔✔ ✘ Archive ✔✔✔✔ ✔ZFS Tutorial USENIX LISA’11 103
  104. 104. zpool Command 104
  105. 105. Pooled Storage Layer raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ??November 8, 2010 USENIX LISA’10 105
  106. 106. zpool create • zpool create poolname vdev-configuration • nmc: setup volume create ✦ vdev-configuration examples ✤ mirror c0t0d0 c3t6d0 ✤ mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6 ✤ mirror disk1s0 disk2s0 cache disk4s0 log disk5 ✤ raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0 • Solaris ✦ Additional checks for disk/slice overlaps or in use ✦ Whole disks are given EFI labels • Can set initial pool or dataset properties • By default, creates a file system with the same name ✦ poolname pool → /poolname file system People get confused by a file system with same name as the poolZFS Tutorial USENIX LISA’11 106
  107. 107. zpool destroy • Destroy the pool and all datasets therein • zpool destroy poolname ✦ Can (try to) force with “-f” ✦ There is no “are you sure?” prompt – if you werent sure, you would not have typed “destroy” • nmc: destroy volume volumename ✦ nmc prompts for confirmation, by default zpool destroy is destructive... really! Use with caution!ZFS Tutorial USENIX LISA’11 107
  108. 108. zpool add • Adds a device to the pool as a top-level vdev • Does NOT not add columns to a raidz set • Does NOT attach a mirror – use zpool attach instead • zpool add poolname vdev-configuration ✦ vdev-configuration can be any combination also used for zpool create ✦ Complains if the added vdev-configuration would cause a different data protection scheme than is already in use ✤ use “-f” to override ✦ Good idea: try with “-n” flag first ✤ will show final configuration without actually performing the add • nmc: setup volume volumename grow Do not add a device which is in use as a cluster quorum deviceZFS Tutorial USENIX LISA’11 108
  109. 109. zpool remove • Remove a top-level vdev from the pool • zpool remove poolname vdev • nmc: setup volume volumename remove-lun • Today, you can only remove the following vdevs: ✦ cache ✦ hot spare ✦ separate log (b124, NexentaStor 3.0) Dont confuse “remove” with “detach”ZFS Tutorial USENIX LISA’11 109
  110. 110. zpool attach • Attach a vdev as a mirror to an existing vdev • zpool attach poolname existing-vdev vdev • nmc: setup volume volumename attach-lun • Attaching vdev must be the same size or larger than the existing vdev vdev Configurations ok simple vdev → mirror ok mirror ok log → mirrored log no RAIDZ no RAIDZ2 no RAIDZ3ZFS Tutorial USENIX LISA’11 110
  111. 111. zpool detach • Detach a vdev from a mirror • zpool detach poolname vdev • nmc: setup volume volumename detach-lun • A resilvering vdev will wait until resilvering is completeZFS Tutorial USENIX LISA’11 111
  112. 112. zpool replace • Replaces an existing vdev with a new vdev • zpool replace poolname existing-vdev vdev • nmc: setup volume volumename replace-lun • Effectively, a shorthand for “zpool attach” followed by “zpool detach” • Attaching vdev must be the same size or larger than the existing vdev • Works for any top-level vdev-configuration, including RAIDZ “Same size” literally means the same number of blocks until b117. Many “same size” disks have different number of available blocks.ZFS Tutorial USENIX LISA’11 112
  113. 113. zpool import • Import a pool and mount all mountable datasets • Import a specific pool ✦ zpool import poolname ✦ zpool import GUID ✦ nmc: setup volume import • Scan LUNs for pools which may be imported ✦ zpool import • Can set options, such as alternate root directory or other properties ✦ alternate root directory important for rpool or syspool Beware of zpool.cache interactions Beware of artifacts, especially partial artifactsZFS Tutorial USENIX LISA’11 113
  114. 114. zpool export • Unmount datasets and export the pool • zpool export poolname • nmc: setup volume volumename export • Removes pool entry from zpool.cache ✦ useful when unimported pools remain in zpool.cacheZFS Tutorial USENIX LISA’11 114
  115. 115. zpool upgrade• Display current versions ✦ zpool upgrade• View available upgrade versions, with features, but dont actually upgrade ✦ zpool upgrade -v• Upgrade pool to latest version ✦ zpool upgrade poolname ✦ nmc: setup volume volumename version- upgrade• Upgrade pool to specific version Once you upgrade, there is no downgrade Beware of grub and rollback issuesZFS Tutorial USENIX LISA’11 115
  116. 116. zpool history • Show history of changes made to the pool • nmc and Solaris use same command# zpool history rpoolHistory for rpool:2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -ocachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s02009-03-04.07:29:47 zfs set canmount=noauto rpool2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_1062009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_1062009-03-04.07:29:51 zfs set canmount=on rpool2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export2009-03-04.07:29:51 zfs create rpool/export/home2009-03-04.00:21:42 zpool import -f -R /a 171116493289280739432009-03-04.00:21:42 zpool export rpool2009-03-04.08:47:08 zpool set bootfs=rpool rpool2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b1082009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/snv_b108...ZFS Tutorial USENIX LISA’11 116
  117. 117. zpool status • Shows the status of the current pools, including their configuration • Important troubleshooting step • nmc and Solaris use same command # zpool status … pool: zwimming state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using zpool upgrade. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM zwimming ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0s0 ONLINE 0 0 0 c0t0d0s7 ONLINE 0 0 0 errors: No known data errors Understanding status output error messages can be trickyZFS Tutorial USENIX LISA’11 117
  118. 118. zpool clear • Clears device errors • Clears device error counters • Starts any resilvering, as needed • Improves sysadmin sanity and reduces sweating • zpool clear poolname • nmc: setup volume volumename clear-errorsZFS Tutorial USENIX LISA’11 118
  119. 119. zpool iostat • Show pool physical I/O activity, in an iostat-like manner • Solaris: fsstat will show I/O activity looking into a ZFS file system • Especially useful for showing slog activity • nmc and Solaris use same command # zpool iostat -v capacity operations bandwidth pool used avail read write read write ------------ ----- ----- ----- ----- ----- ----- rpool 16.5G 131G 0 0 1.16K 2.80K c0t0d0s0 16.5G 131G 0 0 1.16K 2.80K ------------ ----- ----- ----- ----- ----- ----- zwimming 135G 14.4G 0 5 2.09K 27.3K mirror 135G 14.4G 0 5 2.09K 27.3K c0t2d0s0 - - 0 3 1.25K 27.5K c0t0d0s7 - - 0 2 1.27K 27.5K ------------ ----- ----- ----- ----- ----- ----- Unlike iostat, does not show latencyZFS Tutorial USENIX LISA’11 119
  120. 120. zpool scrub • Manually starts scrub ✦ zpool scrub poolname • Scrubbing performed in background • Use zpool status to track scrub progress • Stop scrub ✦ zpool scrub -s poolname • How often to scrub? ✦ Depends on level of paranoia ✦ Once per month seems reasonable ✦ After a repair or recovery procedure • NexentaStor auto-scrub features easily manages scrubs and schedules Estimated scrub completion time improves over timeZFS Tutorial USENIX LISA’11 120
  121. 121. auto-scrub serviceZFS Tutorial USENIX LISA’11 121
  122. 122. zfs Command 122
  123. 123. Dataset Management raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ??November 8, 2010 USENIX LISA’10 123
  124. 124. zfs create, destroy • By default, a file system with the same name as the pool is created by zpool create • Dataset name format is: pool/name[/name ...] • File system / folder ✦ zfs create dataset-name ✦ nmc: create folder ✦ zfs destroy dataset-name ✦ nmc: destroy folder • Zvol ✦ zfs create -V size dataset-name ✦ nmc: create zvol ✦ zfs destroy dataset-name ✦ nmc: destroy zvolZFS Tutorial USENIX LISA’11 124
  125. 125. zfs mount, unmount • Note: mount point is a file system parameter ✦ zfs get mountpoint fs-name • Rarely used subcommand (!) • Display mounted file systems ✦ zfs mount • Mount a file system ✦ zfs mount fs-name ✦ zfs mount -a • Unmount (not umount) ✦ zfs unmount fs-name ✦ zfs unmount -aZFS Tutorial USENIX LISA’11 125
  126. 126. zfs list • List mounted datasets • NexentaStor 2: listed everything • NexentaStor 3: do not list snapshots ✦ See zpool listsnapshots property • Examples ✦ zfs list ✦ zfs list -t snapshot ✦ zfs list -H -o nameZFS Tutorial USENIX LISA’11 126
  127. 127. Replication Services Days Traditional Backup NDMP Hours Auto-TierRecovery rsync Point Text Auto-Sync ZFS send/receiveObjective Seconds Auto-CDP Application Level AVS (SNDR) Mirror Replication Slower Faster System I/O Performance ZFS Tutorial USENIX LISA’11 127
  128. 128. zfs send, receive • Send ✦ send a snapshot to stdout ✦ data is decompressed • Receive ✦ receive a snapshot from stdin ✦ receiving file system parameters apply (compression, • Can incrementally send snapshots in time order • Handy way to replicate dataset snapshots • NexentaStor ✦ simplifies management ✦ manages snapshots and send/receive to remote systems • Only method for replicating dataset properties, except quotas • NOT a replacement for traditional backup solutionsZFS Tutorial USENIX LISA’11 128
  129. 129. auto-sync ServiceZFS Tutorial USENIX LISA’11 129
  130. 130. zfs upgrade• Display current versions ✦ zfs upgrade• View available upgrade versions, with features, but dont actually upgrade ✦ zfs upgrade -v• Upgrade pool to latest version ✦ zfs upgrade dataset• Upgrade pool to specific version ✦ zfs upgrade -V version dataset• NexentaStor: not needed until 3.0 You can upgrade, there is no downgrade Beware of grub and rollback issuesZFS Tutorial USENIX LISA’11 130
  131. 131. Sharing 131
  132. 132. Sharing • zfs share dataset • Type of sharing set by parameters ✦ shareiscsi = [on | off] ✦ sharenfs = [on | off | options] ✦ sharesmb = [on | off | options] • Shortcut to manage sharing ✦ Uses external services (nfsd, iscsi target, smbshare, etc) ✦ Importing pool will also share ✦ Implementation is OS-specific ✤ sharesmb uses in-kernel SMB server for Solaris-derived OSes ✤ sharesmb uses Samba for FreeBSDZFS Tutorial USENIX LISA’11 132
  133. 133. Properties 133
  134. 134. Properties • Properties are stored in an nvlist • By default, are inherited • Some properties are common to all datasets, but a specific dataset type may have additional properties • Easily set or retrieved via scripts • In general, properties affect future file system activity zpool get doesnt script as nicely as zfs getZFS Tutorial USENIX LISA’11 134
  135. 135. Getting Properties• zpool get all poolname• nmc: show volume volumename property propertyname• zpool get propertyname poolname• zfs get all dataset-name• nmc: show folder foldername property• nmc: show zvol zvolname propertyZFS Tutorial USENIX LISA’11 135
  136. 136. Setting Properties• zpool set propertyname=value poolname• nmc: setup volume volumename property propertyname• zfs set propertyname=value dataset-name• nmc: setup folder foldername property propertynameZFS Tutorial USENIX LISA’11 136
  137. 137. User-defined Properties • Names ✦ Must include colon : ✦ Can contain lower case alphanumerics or “+” “.” “_” ✦ Max length = 256 characters ✦ By convention, module:property ✤ com.sun:auto-snapshot • Values ✦ Max length = 1024 characters • Examples ✦ com.sun:auto-snapshot=true ✦ com.richardelling:important_files=trueZFS Tutorial USENIX LISA’11 137
  138. 138. Clearing Properties • Reset to inherited value ✦ zfs inherit compression export/home/relling • Clear user-defined parameter ✦ zfs inherit com.sun:auto-snapshot export/ home/relling • NexentaStor doesn’t offer method in nmcZFS Tutorial USENIX LISA’11 138
  139. 139. Pool Properties Property Change? Brief Description altroot Alternate root directory (ala chroot) autoexpand Policy for expanding when vdev size changes autoreplace vdev replacement policy available readonly Available storage space bootfs Default bootable dataset for root pool Cache file to use other than /etc/zfs/ cachefile zpool.cache capacity readonly Percent of pool space used delegation Master pool delegation switch failmode Catastrophic pool failure policyZFS Tutorial USENIX LISA’11 139
  140. 140. More Pool Properties Property Change? Brief Description guid readonly Unique identifier health readonly Current health of the pool listsnapshots zfs list policy size readonly Total size of pool used readonly Amount of space used version readonly Current on-disk versionZFS Tutorial USENIX LISA’11 140
  141. 141. Common Dataset Properties Property Change? Brief Description available readonly Space available to dataset & children checksum Checksum algorithm compression Compression algorithm Compression ratio – logical compressratio readonly size:referenced physical copies Number of copies of user data creation readonly Dataset creation time dedup Deduplication policy logbias Separate log write policy mlslabel Multilayer security label origin readonly For clones, origin snapshotZFS Tutorial USENIX LISA’11 141
  142. 142. More Common Dataset Properties Property Change? Brief Description primarycache ARC caching policy readonly Is dataset in readonly mode? referenced readonly Size of data accessible by this dataset Minimum space guaranteed to a refreservation dataset, excluding descendants (snapshots & clones) Minimum space guaranteed to dataset, reservation including descendants secondarycache L2ARC caching policy sync Synchronous write policy Type of dataset (filesystem, snapshot, type readonly volume)ZFS Tutorial USENIX LISA’11 142
  143. 143. More Common Dataset Properties Property Change? Brief Description used readonly Sum of usedby* (see below) usedbychildren readonly Space used by descendants usedbydataset readonly Space used by dataset Space used by a refreservation for usedbyrefreservation readonly this dataset Space used by all snapshots of this usedbysnapshots readonly dataset Is dataset added to non-global zone zoned readonly (Solaris)ZFS Tutorial USENIX LISA’11 143
  144. 144. Volume Dataset Properties Property Change? Brief Description shareiscsi iSCSI service (not COMSTAR) volblocksize creation fixed block size volsize Implicit quota Set if dataset delegated to non-global zoned readonly zone (Solaris)ZFS Tutorial USENIX LISA’11 144
  145. 145. File System Properties Property Change? Brief Description ACL inheritance policy, when files or aclinherit directories are created ACL modification policy, when chmod is aclmode used atime Disable access time metadata updates canmount Mount policy Filename matching algorithm (CIFS client casesensitivity creation feature) devices Device opening policy for dataset exec File execution policy for dataset mounted readonly Is file system currently mounted?ZFS Tutorial USENIX LISA’11 145
  146. 146. More File System Properties Property Change? Brief Description export/ File system should be mounted with non-blocking nbmand import mandatory locks (CIFS client feature)normalization creation Unicode normalization of file names for matching quota Max space dataset and descendants can consume recordsize Suggested maximum block size for files Max space dataset can consume, not including refquota descendants setuid setuid mode policy sharenfs NFS sharing options sharesmb Files system shared with CIFSZFS Tutorial USENIX LISA’11 146
  147. 147. File System Properties Property Change? Brief Description snapdir Controls whether .zfs directory is hidden utf8only creation UTF-8 character file name policy vscan Virus scan enabled xattr Extended attributes policyZFS Tutorial USENIX LISA’11 147
  148. 148. Forking Properties Pool Properties Release Property Brief Description illumos comment Human-readable comment field Dataset Properties Release Property Brief Description Solaris 11 encryption Dataset encryption Delphix/illumos clones Clone descendants Delphix/illumos refratio Compression ratio for references Solaris 11 share Combines sharenfs & sharesmb Solaris 11 shadow Shadow copy NexentaOS/illumos worm WORM feature Amount of data written since last Delphix/illumos written snapshotZFS Tutorial USENIX LISA’11 148
  149. 149. More Goodies 149
  150. 150. Dataset Space Accounting • used = usedbydataset + usedbychildren + usedbysnapshots + usedbyrefreservation • Lazy updates, may not be correct until txg commits • ls and du will show size of allocated files which includes all copies of a file • Shorthand report available$ zfs list -o spaceNAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILDrpool 126G 18.3G 0 35.5K 0 18.3Grpool/ROOT 126G 15.3G 0 18K 0 15.3Grpool/ROOT/snv_106 126G 86.1M 0 86.1M 0 0rpool/ROOT/snv_b108 126G 15.2G 5.89G 9.28G 0 0rpool/dump 126G 1.00G 0 1.00G 0 0rpool/export 126G 37K 0 19K 0 18Krpool/export/home 126G 18K 0 18K 0 0rpool/swap 128G 2G 0 193M 1.81G 0 ZFS Tutorial USENIX LISA’11 150
  151. 151. Pool Space Accounting • Pool space accounting changed in b128, along with deduplication • Compression, deduplication, and raidz complicate pool accounting (the numbers are correct, the interpretation is suspect) • Capacity planning for remaining free space can be challenging $ zpool list zwimming NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT zwimming 100G 43.9G 56.1G 43% 1.00x ONLINE -ZFS Tutorial USENIX LISA’11 151
  152. 152. zfs vs zpool Space Accounting • zfs list != zpool list • zfs list shows space used by the dataset plus space for internal accounting • zpool list shows physical space available to the pool • For simple pools and mirrors, they are nearly the same • For RAIDZ, RAIDZ2, or RAIDZ3, zpool list will show space available for parity Users will be confused about reported space availableZFS Tutorial USENIX LISA’11 152
  153. 153. NexentaStor Snapshot ServicesZFS Tutorial USENIX LISA’11 153
  154. 154. Accessing Snapshots • By default, snapshots are accessible in .zfs directory • Visibility of .zfs directory is tunable via snapdir property ✦ Dont really want find to find the .zfs directory • Windows CIFS clients can see snapshots as Shadow Copies for Shared Folders (VSS) # zfs snapshot rpool/export/home/relling@20090415 # ls -a /export/home/relling … .Xsession .xsession-errors # ls /export/home/relling/.zfs shares snapshot # ls /export/home/relling/.zfs/snapshot 20090415 # ls /export/home/relling/.zfs/snapshot/20090415 Desktop Documents Downloads PublicZFS Tutorial USENIX LISA’11 154
  155. 155. Time Slider - Automatic Snapshots • Solaris feature similar to OSXs Time Machine • SMF service for managing snapshots • SMF properties used to specify policies: frequency (interval) and number to keep • Creates cron jobs • GUI tool makes it easy to select individual file systems • Tip: take additional snapshots for important milestones to avoid automatic snapshot deletion Service Name Interval (default) Keep (default) auto-snapshot:frequent 15 minutes 4 auto-snapshot:hourly 1 hour 24 auto-snapshot:daily 1 day 31 auto-snapshot:weekly 7 days 4 auto-snapshot:monthly 1 month 12ZFS Tutorial USENIX LISA’11 155
  156. 156. Nautilus • File system views which can go back in timeZFS Tutorial USENIX LISA’11 156
  157. 157. Resilver & Scrub • Can be read IOPS bound • Resilver can also be bandwidth bound to the resilvering device • Both work at lower I/O scheduling priority than normal work, but that may not matter for read IOPS bound devices • Dueling RFEs: ✦ Resilver should go faster ✦ Resilver should go slower ✤ Integrated in b140ZFS Tutorial USENIX LISA’11 157