• Save
ZFS  Tutorial  USENIX June 2009
Upcoming SlideShare
Loading in...5
×
 

ZFS Tutorial USENIX June 2009

on

  • 7,534 views

ZFS presentation delivered as a tutorial at the 2009 USENIX technical conference by Richard Elling

ZFS presentation delivered as a tutorial at the 2009 USENIX technical conference by Richard Elling

Statistics

Views

Total Views
7,534
Views on SlideShare
7,498
Embed Views
36

Actions

Likes
17
Downloads
0
Comments
0

2 Embeds 36

http://www.slideshare.net 23
http://www.linkedin.com 13

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

ZFS  Tutorial  USENIX June 2009 ZFS Tutorial USENIX June 2009 Presentation Transcript

  • USENIX 2009 ZFS Tutorial Richard.Elling@RichardElling.com
  • Agenda ● Overview ● Foundations ● Pooled Storage Layer ● Transactional Object Layer ● Commands – zpool – zfs ● Sharing ● Properties ● More goodies ● Performance ● Wrap June 13, 2009 © 2009 Richard Elling 2
  • History ● Announced September 14, 2004 ● Integration history – SXCE b27 (November 2005) – FreeBSD (April 2007) – Mac OSX Leopard (~ June 2007) – OpenSolaris 2008.05 – Solaris 10 6/06 (June 2006) – Linux FUSE (summer 2006) – greenBytes ZFS+ (September 2008) ● More than 45 patents, contributed to the CDDL Patents Common June 13, 2009 © 2009 Richard Elling 3
  • Brief List of Features ● Future-proof ● “No silent data corruption ever” ● Cutting-edge data integrity ● “Mind-boggling scalability” ● High performance ● “Breathtaking speed” ● Simplified administration ● “Near zero administration” ● Eliminates need for volume ● “Radical new architecture” managers ● “Greatly simplifies support ● Reduced costs issues” ● Compatibility with POSIX file ● “RAIDZ saves money” system & block devices ● Self-healing Marketing: 2 drink minimum June 13, 2009 © 2009 Richard Elling 4
  • ZFS Design Goals ● Figure out why storage has gotten so complicated ● Blow away 20+ years of obsolete assumptions ● Gotta replace UFS ● Design an integrated system from scratch ● End the suffering June 13, 2009 © 2009 Richard Elling 5
  • Limits 248 — Number of entries in any individual directory 256 — Number of attributes of a f le [1] i 256 — Number of f les in a directory [1] i 16 EiB (264 bytes) — Maximum size of a f le system i 16 EiB — Maximum size of a single file 16 EiB — Maximum size of any attribute 264 — Number of devices in any pool 264 — Number of pools in a system 264 — Number of f le systems in a pool i 264 — Number of snapshots of any f le system i 256 ZiB (278 bytes) — Maximum size of any pool [1] actually constrained to 248 for the number of f les in a ZFS f le system i i June 13, 2009 © 2009 Richard Elling 6
  • Sidetrack: Understanding Builds ● Build is often referenced when speaking of feature/bug integration ● Short-hand notation: b# ● OpenSolaris and SXCE are based on NV ● ZFS development done for NV – Bi-weekly build cycle – Schedule at http://opensolaris.org/os/community/on/schedule/ ● ZFS is ported to Solaris 10 and other OSes June 13, 2009 © 2009 Richard Elling 7
  • Foundations June 13, 2009 © 2009 Richard Elling 8
  • Overhead View of a Pool Pool File System Configuration Information Volume File System Volume Dataset June 13, 2009 © 2009 Richard Elling 9
  • Layer View raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? June 13, 2009 © 2009 Richard Elling 10
  • Source Code Structure File system Device GUI Mgmt Consumer Consumer JNI User libzfs Kernel Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration June 13, 2009 © 2009 Richard Elling 11
  • Acronyms ● ARC – Adaptive Replacement Cache ● DMU – Data Management Unit ● DSL – Dataset and Snapshot Layer ● JNI – Java Native InterfaceZPL – ZFS POSIX Layer (traditional file system interface) ● VDEV – Virtual Device layer ● ZAP – ZFS Attribute Processor ● ZIL – ZFS Intent Log ● ZIO – ZFS I/O layer ● Zvol – ZFS volume (raw/cooked block device interface) June 13, 2009 © 2009 Richard Elling 12
  • nvlists ● name=value pairs ● libnvpair(3LIB) ● Allows ZFS capabilities to change without changing the physical on- disk format ● Data stored is XDR encoded ● A good thing, used often June 13, 2009 © 2009 Richard Elling 13
  • Versioning ● Features can be added and identified by nvlist entries ● Change in pool or dataset versions do not change physical on-disk format (!) – does change nvlist parameters ● Older-versions can be used – might see warning messages, but harmless ● Available versions and features can be easily viewed – zpool upgrade -v – zfs upgrade -v ● Online references – zpool: www.opensolaris.org/os/community/zfs/version/N – zfs: www.opensolaris.org/os/community/zfs/version/zpl/N Don't confuse zpool and zfs versions June 13, 2009 © 2009 Richard Elling 14
  • zpool versions VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit support 15 user and group quotas 16 COMSTAR support June 13, 2009 © 2009 Richard Elling 15
  • zfs versions VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS filesystem version 2 Enhanced directory entries 3 Case insensitive and File system unique identifier (FUID) 4 user and group quotas June 13, 2009 © 2009 Richard Elling 16
  • Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free June 13, 2009 © 2009 Richard Elling 17
  • COW Notes ● COW works on blocks, not files ● ZFS reserves 32 MBytes or 1/64 of pool size – COWs need some free space to remove files – need space for ZIL ● For fixed-record size workloads “fragmentation” and “poor performance” can occur if the recordsize is not matched ● Spatial distribution is good fodder for performance speculation – affects HDDs – moot for SSDs June 13, 2009 © 2009 Richard Elling 18
  • Pooled Storage Layer raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? June 13, 2009 © 2009 Richard Elling 19
  • vdevs – Virtual Devices Logical vdevs root vdev top-level vdev top-level vdev children[0] children[1] mirror mirror vdev vdev vdev vdev type=disk type=disk type=disk type=disk children[0] children[1] children[0] children[1] Physical or leaf vdevs June 13, 2009 © 2009 Richard Elling 20
  • vdev Labels ● vdev labels != disk labels ● 4 labels written to every physical vdev ● Label size = 256kBytes ● Two-stage update process – write label0 & label2 – check for errors – write label1 & label3 0 256k 512k 4M N-512k N-256k N Boot label0 label1 label2 label3 Block June 13, 2009 © 2009 Richard Elling 21
  • vdev Label Contents 0 256k 512k 4M N-512k N-256k N Boot label0 label1 label2 label3 Block Boot Name=Value Blank Header Pairs 128-slot Uberblock Array 0 8k 16k 128k 256k June 13, 2009 © 2009 Richard Elling 22
  • Observing Labels # zdb -l /dev/rdsk/c0t0d0s0 -------------------------------------------- LABEL 0 -------------------------------------------- version=14 name='rpool' state=0 txg=13152 pool_guid=17111649328928073943 hostid=8781271 hostname='' top_guid=11960061581853893368 guid=11960061581853893368 vdev_tree type='disk' id=0 guid=11960061581853893368 path='/dev/dsk/c0t0d0s0' devid='id1,sd@SATA_____ST3500320AS_________________9QM3FWFT/a' phys_path='/pci@0,0/pci1458,b002@11/disk@0,0:a' whole_disk=0 metaslab_array=24 metaslab_shift=30 ashift=9 asize=157945167872 is_log=0 June 13, 2009 © 2009 Richard Elling 23
  • Uberblocks ● 1 kByte ● Stored in 128-entry circular queue ● Only one uberblock is active at any time – highest transaction group number – correct SHA-256 checksum ● Stored in machine's native format – A magic number is used to determine endian format when imported ● Contains pointer to MOS June 13, 2009 © 2009 Richard Elling 24
  • MOS – Meta Object Set ● Only one MOS per pool ● Contains object directory pointers – root_dataset – references all top-level datasets in the pool – config – nvlist describing the pool configuration – sync_bplist – list of block pointers which need to be freed during the next transaction June 13, 2009 © 2009 Richard Elling 25
  • Block Pointers ● blkptr_t structure ● 128 bytes ● contents: – 3x data virtual address (DVA) – endianess – level of indirection – DMU object type – checksum function – compression function – physical size – logical size – birth txg – fill count – checksum (256 bits) June 13, 2009 © 2009 Richard Elling 26
  • DVA – Data Virtual Address ● Contains – vdev id – offset in sectors – grid (future) – allocated size – gang block indicator ● Physical block address = (offset << 9) + 4 MBytes June 13, 2009 © 2009 Richard Elling 27
  • Gang Blocks ● Gang blocks contain block pointers ● Used when space requested is not available in a contiguous block ● 512 bytes ● self checksummed ● contains 3 block pointers June 13, 2009 © 2009 Richard Elling 28
  • To fsck or not to fsck ● fsck was created to fix known inconsistencies in file system metadata – UFS is not transactional – metadata inconsistencies must be reconciled – does NOT repair data – how could it? ● ZFS doesn't need fsck, as-is – all on-disk changes are transactional – COW means previously existing, consistent metadata is not overwritten – ZFS can repair itself ● metadata is at least dual-redundant ● data can also be redundant ● Reality check – this does not mean that ZFS is not susceptible to corruption – nor is any other file system June 13, 2009 © 2009 Richard Elling 29
  • VDEV June 13, 2009 © 2009 Richard Elling 30
  • Dynamic Striping ● RAID-0 – SNIA definition: fixed-length sequences of virtual disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern ● Dynamic Stripe – Data is dynamically mapped to member disks – No fixed-length sequences – Allocate up to ~1 MByte/vdev before changing vdev – vdevs can be different size – Good combination of the concatenation feature with RAID-0 performance June 13, 2009 © 2009 Richard Elling 31
  • Dynamic Striping RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes ZFS Dynamic Stripe recordsize = 128 kBytes Total write size = 2816 kBytes June 13, 2009 © 2009 Richard Elling 32
  • Mirroring ● Straightforward: put N copies of the data on N vdevs ● Unlike RAID-1 – No 1:1 mapping at the block level – vdev labels are still at beginning and end – vdevs can be of different size ● effective space is that of smallest vdev ● Arbitration: ZFS does not blindly trust either side of mirror – Most recent, correct view of data wins – Checksums validate data June 13, 2009 © 2009 Richard Elling 33
  • Mirroring June 13, 2009 © 2009 Richard Elling 34
  • Dynamic vdev Replacement ● zpool replace poolname vdev [vdev] ● Today, replacing vdev must be same size or larger (as measured by blocks) ● Replacing all vdevs in a top-level vdev with larger vdevs results in (automatic?) top-level vdev resizing 15G 10G 10G 15G 10G 20G 15G 20G 10G 15G 20G 20G 20G 20G 10G 10G Mirror 10G Mirror 15G Mirror 15G Mirror 20G Mirror June 13, 2009 © 2009 Richard Elling 35
  • RAIDZ ● RAID-5 – Parity check data is distributed across the RAID array's disks – Must read/modify/write when data is smaller than stripe width ● RAIDZ – Dynamic data placement – Parity added as needed – Writes are full-stripe writes – No read/modify/write (write hole) ● Arbitration: ZFS does not blindly trust any device – Does not rely on disk reporting read error – Checksums validate data – If checksum fails, read parity Space used is dependent on how used June 13, 2009 © 2009 Richard Elling 36
  • RAID-5 vs RAIDZ DiskA DiskB DiskC DiskD DiskE D0:0 D0:1 D0:2 D0:3 P0 RAID-5 P1 D1:0 D1:1 D1:2 D1:3 D2:3 P2 D2:0 D2:1 D2:2 D3.2 D3:3 P3 D3:0 D3:1 DiskA DiskB DiskC DiskD DiskE P0 D0:0 D0:1 D0:2 D0:3 RAIDZ P1 D1:0 D1:1 P2:0 D2:0 D2:1 D2:2 D2:3 P2:1 D2:4 D2:5 June 13, 2009 © 2009 Richard Elling 37
  • RAID-5 Write Hole ● Occurs when data to be written is smaller than stripe size ● Must read unallocated columns to recalculate the parity or the parity must be read/modify/write ● Read/modify/write is risky for consistency – Multiple disks – Reading independently – Writing independently – System failure before all writes are complete to media could result in data loss ● Effects can be hidden from host using RAID array with nonvolatile write cache, but extra I/O cannot be hidden from disks June 13, 2009 © 2009 Richard Elling 38
  • RAIDZ2 ● RAIDZ2 = double parity RAIDZ – Can recover data if any 2 leaf vdevs fail ● Sorta like RAID-6 – Parity 1: XOR – Parity 2: another Reed-Soloman syndrome ● More computationally expensive than RAIDZ ● Arbitration: ZFS does not blindly trust any device – Does not rely on disk reporting read error – Checksums validate data – If data not valid, read parity – If data still not valid, read other parity Space used is dependent on how used June 13, 2009 © 2009 Richard Elling 39
  • Evaluating Data Retention ● MTTDL = Mean Time To Data Loss ● Note: MTBF is not constant in the real world, but keeps math simple ● MTTDL[1] is a simple MTTDL model ● No parity (single vdev, striping, RAID-0) – MTTDL[1] = MTBF / N ● Single Parity (mirror, RAIDZ, RAID-1, RAID-5) – MTTDL[1] = MTBF2 / (N * (N-1) * MTTR) ● Double Parity (3-way mirror, RAIDZ2, RAID-6) – MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2) June 13, 2009 © 2009 Richard Elling 40
  • Another MTTDL Model ● MTTDL[1] model doesn't take into account unrecoverable read ● But unrecoverable reads (UER) are becoming the dominant failure mode – UER specifed as errors per bits read – More bits = higher probability of loss per vdev ● MTTDL[2] model considers UER June 13, 2009 © 2009 Richard Elling 41
  • Why Worry about UER? ● Richard's study – 3,684 hosts with 12,204 LUNs – 11.5% of all LUNs reported read errors ● Bairavasundaram et.al. FAST08 www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf – 1.53M LUNs over 41 months – RAID reconstruction discovers 8% of checksum mismatches – 4% of disks studies developed checksum errors over 17 months June 13, 2009 © 2009 Richard Elling 42
  • Why Worry about UER? ● RAID array study June 13, 2009 © 2009 Richard Elling 43
  • MTTDL[2] Model ● Probability that a reconstruction will fail – Precon_fail = (N-1) * size / UER ● Model doesn't work for non-parity schemes (single vdev, striping, RAID-0) ● Single Parity (mirror, RAIDZ, RAID-1, RAID-5) – MTTDL[2] = MTBF / (N * Precon_fail) ● Double Parity (3-way mirror, RAIDZ2, RAID-6) – MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail) June 13, 2009 © 2009 Richard Elling 44
  • Practical View of MTTDL[1] June 13, 2009 © 2009 Richard Elling 45
  • MTTDL Models: Mirror June 13, 2009 © 2009 Richard Elling 46
  • MTTDL Models: RAIDZ2 June 13, 2009 © 2009 Richard Elling 47
  • Ditto Blocks ● Recall that each blkptr_t contains 3 DVAs ● Allows up to 3 physical copies of the data ZFS copies parameter Data copies Metadata copies default 1 2 copies=2 2 3 copies=3 3 3 June 13, 2009 © 2009 Richard Elling 48
  • Copies ● Dataset property used to indicate how many copies (aka ditto blocks) of data is desired – Write all copies – Read any copy – Recover corrupted read from a copy ● By default – data copies=1 – metadata copies=data copies +1 or max=3 ● Not a replacement for mirroring ● Easier to describe in pictures... June 13, 2009 © 2009 Richard Elling 49
  • Copies in Pictures June 13, 2009 © 2009 Richard Elling 50
  • Copies in Pictures June 13, 2009 © 2009 Richard Elling 51
  • ZIO – ZFS I/O Layer June 13, 2009 © 2009 Richard Elling 52
  • ZIO Framework ● All physical disk I/O goes through ZIO Framework ● Translates DVAs into Logical Block Address (LBA) on leaf vdevs – Keeps free space maps (spacemap) – If contiguous space is not available: ● Allocate smaller blocks (the gang) ● Allocate gang block, pointing to the gang ● Implemented as multi-stage pipeline – Allows extensions to be added fairly easily ● Handles I/O errors June 13, 2009 © 2009 Richard Elling 53
  • SpaceMap from Space June 13, 2009 © 2009 Richard Elling 54
  • ZIO Write Pipeline ZIO State Compression Crypto Checksum DVA vdev I/O open compress if savings > 12.5% encrypt generate allocate start start start done done done assess assess assess done Gang activity elided, for clarity June 13, 2009 © 2009 Richard Elling 55
  • ZIO Read Pipeline ZIO State Compression Crypto Checksum DVA vdev I/O open start start start done done done assess assess assess verify decrypt decompress done Gang activity elided, for clarity June 13, 2009 © 2009 Richard Elling 56
  • VDEV – Virtual Device Subsytem ● Where mirrors, RAIDZ, and RAIDZ2 are implemented – Surprisingly few lines of code needed to implement RAID ● Leaf vdev (physical device) I/O management – Number of outstanding iops – Read-ahead cache Name Priority NOW 0 ● Priority scheduling SYNC_READ 0 SYNC_WRITE 0 FREE 0 CACHE_FILL 0 LOG_WRITE 0 ASYNC_READ 4 ASYNC_WRITE 4 RESILVER 10 SCRUB 20 June 13, 2009 © 2009 Richard Elling 57
  • ARC – Adaptive Replacement Cache June 13, 2009 © 2009 Richard Elling 58
  • Object Cache ● UFS uses page cache managed by the virtual memory system ● ZFS does not use the page cache, except for mmap'ed files ● ZFS uses a Adaptive Replacement Cache (ARC) ● ARC used by DMU to cache DVA data objects ● Only one ARC per system, but caching policy can be changed on a per-dataset basis ● Seems to work much better than page cache ever did for UFS June 13, 2009 © 2009 Richard Elling 59
  • Traditional Cache ● Works well when data being accessed was recently added ● Doesn't work so well when frequently accessed data is evicted Misses cause insert MRU Dynamic caches can change Cache size size by either not evicting or aggressively evicting LRU Evict the oldest June 13, 2009 © 2009 Richard Elling 60
  • ARC – Adaptive Replacement Cache Evict the oldest single-use entry LRU Recent Cache Miss MRU Evictions and dynamic MRU resizing needs to choose best Hit size cache to evict (shrink) Frequent Cache LRU Evict the oldest multiple accessed entry June 13, 2009 © 2009 Richard Elling 61
  • ZFS ARC – Adaptive Replacement Cache with Locked Pages Evict the oldest single-use entry Cannot evict LRU locked pages! Recent Cache Miss MRU MRU Hit size Frequent If hit occurs Cache within 62 ms LRU Evict the oldest multiple accessed entry ZFS ARC handles mixed-size pages June 13, 2009 © 2009 Richard Elling 62
  • ARC Directory ● Each ARC directory entry contains arc_buf_hdr structs – Info about the entry – Pointer to the entry ● Directory entries have size, ~200 bytes ● ZFS block size is dynamic, 512 bytes – 128 kBytes ● Disks are large ● Suppose we use a Seagate LP 2 TByte disk for the L2ARC – Disk has 3,907,029,168 512 byte sectors, guaranteed – Workload uses 8 kByte fixed record size – RAM needed for arc_buf_hdr entries ● Need = 3,907,029,168 * 200 / 16 = 45 GBytes ● Don't underestimate the RAM needed for large L2ARCs June 13, 2009 © 2009 Richard Elling 63
  • L2ARC – Level 2 ARC ● ARC evictions are sent to cache vdev ● ARC directory remains in memory ARC ● Works well when cache vdev is optimized for fast reads – lower latency than pool disks evicted – inexpensive way to “increase memory” data ● Content considered volatile, no ZFS data protection allowed ● Monitor usage with zpool iostat “cache” “cache” “cache” vdev vdev vdev June 13, 2009 © 2009 Richard Elling 64
  • ARC Tips ● In general, it seems to work well for most workloads ● ARC size will vary, based on usage ● Internals tracked by kstats in Solaris – Use memory_throttle_count to observe pressure to evict ● Can limit at boot time – Solaris – set zfs:zfs_arc_max in /etc/system ● Performance – Prior to b107, L2ARC fill rate was limited to 8 MBytes/s L2ARC keeps its directory in kernel memory June 13, 2009 © 2009 Richard Elling 65
  • Transactional Object Layer June 13, 2009 © 2009 Richard Elling 66
  • flash Source Code Structure File system Device GUI Mgmt Consumer Consumer JNI User libzfs Kernel Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration June 13, 2009 © 2009 Richard Elling 67
  • ZAP – ZFS Attribute Processor ● Module sits on top of DMU ● Important component for managing everything ● Operates on ZAP objects – Contain name=value pairs ● FatZAP – Flexible architecture for storing large numbers of attributes ● MicroZAP – Lightweight version of fatzap – Uses 1 block – All name=value pairs must fit in block – Names <= 50 chars (including NULL terminator) – Values are type uint64_t June 13, 2009 © 2009 Richard Elling 68
  • DMU – Data Management Layer ● Datasets issue transactions to the DMU ● Transactional based object model ● Transactions are – Atomic – Grouped (txg = transaction group) ● Responsible for on-disk data ● ZFS Attribute Processor (ZAP) ● Dataset and Snapshot Layer (DSL) ● ZFS Intent Log (ZIL) June 13, 2009 © 2009 Richard Elling 69
  • Transaction Engine ● Manages physical I/O ● Transactions grouped into transaction group (txg) – txg updates – All-or-nothing – Commit interval ● Older versions: 5 seconds (zfs_ ● Now: 30 seconds max, dynamically scale based on time required to commit txg ● Delay committing data to physical storage – Improves performance – A bad thing for sync workloads – hence the ZFS Intent Log (ZIL) 30 second delay could impact failure detection time June 13, 2009 © 2009 Richard Elling 70
  • ZIL – ZFS Intent Log ● DMU is transactional, and likes to group I/O into transactions for later commits, but still needs to handle “write it now” desire of sync writers – NFS – Databases ● If I/O < 32 kBytes – write it (now) to ZIL (allocated from pool) – write it later as part of the txg commit ● If I/O > 32 kBytes, write it to pool now – Should be faster for large, sequential writes ● Never read, except at import (eg reboot), when transactions may need to be rolled forward June 13, 2009 © 2009 Richard Elling 71
  • Separate Logs (slogs) ● ZIL competes with pool for iops – Applications will wait for sync writes to be on nonvolatile media – Very noticeable on HDD JBODs ● Put ZIL on separate vdev, outside of pool – ZIL writes tend to be sequential – No competition with pool for iops – Downside: slog device required to be operational at import ● 10x or more performance improvements possible – Better if using write-optimized SSD or non-volatile write cache on RAID array ● Use zilstat to observe ZIL activity June 13, 2009 © 2009 Richard Elling 72
  • DSL – Dataset and Snapshot Layer June 13, 2009 © 2009 Richard Elling 73
  • flash Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free June 13, 2009 © 2009 Richard Elling 74
  • zfs snapshot ● Create a read-only, point-in-time window into the dataset (file system or Zvol) ● Computationally free, because of COW architecture ● Very handy feature – Patching/upgrades – Basis for Time Slider June 13, 2009 © 2009 Richard Elling 75
  • Snapshot Snapshot tree root Current tree root ● Create a snapshot by not free'ing COWed blocks ● Snapshot creation is fast and easy ● Number of snapshots determined by use – no hardwired limit ● Recursive snapshots also possible June 13, 2009 © 2009 Richard Elling 76
  • Clones ● Snapshots are read-only ● Clones are read-write based upon a snapshot ● Child depends on parent – Cannot destroy parent without destroying all children – Can promote children to be parents ● Good ideas – OS upgrades – Change control – Replication ● zones ● virtual disks June 13, 2009 © 2009 Richard Elling 77
  • zfs clone ● Create a read-write file system from a read-only snapshot ● Used extensively for OpenSolaris upgrades OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 snapshot snapshot snapshot OS rev1 upgrade OS rev2 clone boot manager Origin snapshot cannot be destroyed, if clone exists June 13, 2009 © 2009 Richard Elling 78
  • zfs promote OS b104 OS b104 clone OS rev1 rpool/ROOT/b104 rpool/ROOT/b104 OS b104 OS b105 OS rev2 snapshot snapshot snapshot rpool/ROOT/b104@today rpool/ROOT/b105@today OS b105 OS b105 clone promote OS rev2 rpool/ROOT/b105 rpool/ROOT/b105 June 13, 2009 © 2009 Richard Elling 79
  • zfs rollback OS b104 OS b104 rpool/ROOT/b104 rpool/ROOT/b104 OS b104 OS b104 snapshot rollback snapshot rpool/ROOT/b104@today rpool/ROOT/b104@today June 13, 2009 © 2009 Richard Elling 80
  • Commands June 13, 2009 © 2009 Richard Elling 81
  • zpool(1m) raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? June 13, 2009 © 2009 Richard Elling 82
  • Dataset & Snapshot Layer ● Object – Allocated storage – dnode describes collection Dataset Directory of blocks Dataset ● Object Set Object Set Childmap – Group of related objects ● Dataset Object Object Object Properties – Snapmap: snapshot relationships Snapmap – Space usage ● Dataset directory – Childmap: dataset relationships – Properties June 13, 2009 © 2009 Richard Elling 83
  • zpool create ● zpool create poolname vdev-configuration – vdev-configuration examples ● mirror c0t0d0 c3t6d0 ● mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6 ● mirror disk1s0 disk2s0 cache disk4s0 log disk5 ● raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0 ● Solaris – Additional checks to see if disk/slice overlaps or is currently in use – Whole disks are given EFI labels ● Can set initial pool or dataset properties ● By default, creates a file system with the same name – poolname pool → /poolname file system People get confused by a file system with same name as the pool June 13, 2009 © 2009 Richard Elling 84
  • zpool destroy ● Destroy the pool and all datasets therein ● zpool destroy poolname ● Can (try to) force with “-f” ● There is no “are you sure?” prompt – if you weren't sure, you would not have typed “destroy” zpool destroy is destructive... really! Use with caution! June 13, 2009 © 2009 Richard Elling 85
  • zpool add ● Adds a device to the pool as a top-level vdev ● zpool add poolname vdev-configuration ● vdev-configuration can be any combination also used for zpool create ● Complains if the added vdev-configuration would cause a different data protection scheme than is already in use – use “-f” to override ● Good idea: try with “-n” flag first – will show final configuration without actually performing the add Do not add a device which is in use as a quorum device June 13, 2009 © 2009 Richard Elling 86
  • zpool remove ● Remove a top-level vdev from the pool ● zpool remove poolname vdev ● Today, you can only remove the following vdevs: – cache – hot spare ● An RFE is open to allow removal of other top-level vdevs Don't confuse “remove” with “detach” June 13, 2009 © 2009 Richard Elling 87
  • zpool attach ● Attach a vdev as a mirror to an existing vdev ● zpool attach poolname existing-vdev vdev ● Attaching vdev must be the same size or larger than the existing vdev ● Note: today this is not available for RAIDZ or RAIDZ2 vdevs vdev Configurations ok simple vdev → mirror ok mirror ok log → mirrored log no RAIDZ no RAIDZ2 “Same size” literally means the same number of blocks. Beware that many “same size” disks have different number of available blocks. June 13, 2009 © 2009 Richard Elling 88
  • zpool detach ● Detach a vdev from a mirror ● zpool detach poolname vdev ● A resilvering vdev will wait until resilvering is complete June 13, 2009 © 2009 Richard Elling 89
  • zpool replace ● Replaces an existing vdev with a new vdev ● zpool replace poolname existing-vdev vdev ● Effectively, a shorthand for “zpool attach” followed by “zpool detach” ● Attaching vdev must be the same size or larger than the existing vdev ● Works for any top-level vdev-configuration, including RAIDZ and RAIDZ2 vdev Configurations ok simple vdev ok mirror ok log ok RAIDZ ok RAIDZ2 “Same size” literally means the same number of blocks. Beware that many “same size” disks have different number of available blocks. June 13, 2009 © 2009 Richard Elling 90
  • zpool import ● Import a pool and mount all mountable datasets ● Import a specific pool – zpool import poolname – zpool import GUID ● Scan LUNs for pools which may be imported – zpool import ● Can set options, such as alternate root directory or other properties Beware of zpool.cache interactions Beware of artifacts, especially partial artifacts June 13, 2009 © 2009 Richard Elling 91
  • zpool export ● Unmount datasets and export the pool ● zpool export poolname ● Removes pool entry from zpool.cache June 13, 2009 © 2009 Richard Elling 92
  • zpool upgrade ● Display current versions – zpool upgrade ● View available upgrade versions, with features, but don't actually upgrade – zpool upgrade -v ● Upgrade pool to latest version – zpool upgrade poolname ● Upgrade pool to specific version – zpool upgrade -V version poolname Once you upgrade, there is no downgrade June 13, 2009 © 2009 Richard Elling 93
  • zpool history ● Show history of changes made to the pool # zpool history rpool History for 'rpool': 2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -o cachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s0 2009-03-04.07:29:47 zfs set canmount=noauto rpool 2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool 2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT 2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap 2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump 2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_106 2009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_106 2009-03-04.07:29:51 zfs set canmount=on rpool 2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export 2009-03-04.07:29:51 zfs create rpool/export/home 2009-03-04.00:21:42 zpool import -f -R /a 17111649328928073943 2009-03-04.00:21:42 zpool export rpool 2009-03-04.08:47:08 zpool set bootfs=rpool rpool 2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b108 2009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/snv_b108 ... June 13, 2009 © 2009 Richard Elling 94
  • zpool status ● Shows the status of the current pools, including their configuration ● Important troubleshooting step # zpool status … pool: stuff state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM stuff ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0s0 ONLINE 0 0 0 c0t0d0s7 ONLINE 0 0 0 errors: No known data errors Understanding status output error messages can be tricky June 13, 2009 © 2009 Richard Elling 95
  • zpool clear ● Clears device errors ● Clears device error counters ● Improves sysadmin sanity and reduces sweating June 13, 2009 © 2009 Richard Elling 96
  • zpool iostat ● Show pool physical I/O activity, in an iostat-like manner ● Solaris: fsstat will show I/O activity looking into a ZFS file system ● Especially useful for showing slog activity # zpool iostat -v capacity operations bandwidth pool used avail read write read write ------------ ----- ----- ----- ----- ----- ----- rpool 16.5G 131G 0 0 1.16K 2.80K c0t0d0s0 16.5G 131G 0 0 1.16K 2.80K ------------ ----- ----- ----- ----- ----- ----- stuff 135G 14.4G 0 5 2.09K 27.3K mirror 135G 14.4G 0 5 2.09K 27.3K c0t2d0s0 - - 0 3 1.25K 27.5K c0t0d0s7 - - 0 2 1.27K 27.5K ------------ ----- ----- ----- ----- ----- ----- Unlike iostat, does not show latency June 13, 2009 © 2009 Richard Elling 97
  • zpool scrub ● Manually starts scrub – zpool scrub poolname ● Scrubbing performed in background ● Use zpool status to track scrub progress ● Stop scrub – zpool scrub -s poolname Estimated scrub completion time improves over time June 13, 2009 © 2009 Richard Elling 98
  • zfs(1m) ● Manages file systems (ZPL) and Zvols ● Can proxy to other, related commands – iSCSI, NFS, CIFS raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer June 13, 2009 © 2009 Richard Elling 99
  • zfs create, destroy ● By default, a file system with the same name as the pool is created by zpool create ● Name format is: pool/name[/name ...] ● File system – zfs create fs-name – zfs destroy fs-name ● Zvol – zfs create -V size vol-name – zfs destroy vol-name ● Parameters can be set at create time June 13, 2009 © 2009 Richard Elling 100
  • zfs mount, unmount ● Note: mount point is a file system parameter – zfs get mountpoint fs-name ● Rarely used subcommand (!) ● Display mounted file systems – zfs mount ● Mount a file system – zfs mount fs-name – zfs mount -a ● Unmount – zfs unmount fs-name – zfs unmount -a June 13, 2009 © 2009 Richard Elling 101
  • zfs list ● List mounted datasets ● Old versions: listed everything ● New versions: do not list snapshots ● Examples – zfs list – zfs list -t snapshot – zfs list -H -o name June 13, 2009 © 2009 Richard Elling 102
  • zfs send, receive ● Send – send a snapshot to stdout – data is decompressed ● Receive – receive a snapshot from stdin – receiving file system parameters apply (compression, et.al) ● Can incrementally send snapshots in time order ● Handy way to replicate dataset snapshots ● NOT a replacement for traditional backup solutions – All-or-nothing design per snapshot – In general, does not send files (!) – Today, no per-file management Send streams from b35 (or older) no longer supported after b89 June 13, 2009 © 2009 Richard Elling 103
  • zfs rename ● Renames a file system, volume,or snapshot – zfs rename export/home/relling export/home/richard June 13, 2009 © 2009 Richard Elling 104
  • zfs upgrade ● Display current versions – zfs upgrade ● View available upgrade versions, with features, but don't actually upgrade – zfs upgrade -v ● Upgrade pool to latest version – zfs upgrade dataset ● Upgrade pool to specific version – zfs upgrade -V version dataset Once you upgrade, there is no downgrade June 13, 2009 © 2009 Richard Elling 105
  • Sharing June 13, 2009 © 2009 Richard Elling 106
  • Sharing ● zfs share dataset ● Type of sharing set by parameters – shareiscsi = [on | off] – sharenfs = [on | off | options] – sharesmb = [on | off | options] ● Shortcut to manage sharing – Uses external services (nfsd, COMSTAR, etc) – Importing pool will also share ● May vary by OS June 13, 2009 © 2009 Richard Elling 107
  • NFS ● ZFS file systems work as expected – use ACLs based on NFSv4 ACLs ● Parallel NFS, aks pNFS, aka NFSv4.1 – Still a work-in-progress – http://opensolaris.org/os/project/nfsv41/ – zfs create -t pnfsdata mypnfsdata pNFS Client pNFS Data Server pNFS Data Server pnfsdata pnfsdata pNFS dataset dataset Metadata Server pool pool June 13, 2009 © 2009 Richard Elling 108
  • CIFS ● UID mapping ● casesensitivity parameter – Good idea, set when file system is created – zfs create -o casesensitivity=insensitive mypool/Shared ● Shadow Copies for Shared Folders (VSS) supported – CIFS clients cannot create shadow remotely (yet) CIFS features vary by OS, Samba, etc. June 13, 2009 © 2009 Richard Elling 109
  • iSCSI ● SCSI over IP ● Block-level protocol ● Uses Zvols as storage ● Solaris has 2 iSCSI target implementations – shareiscsi enables old, klunky iSCSI target – To use COMSTAR, enable using itadm(1m) – b116 adds COMSTAR support (zpool version 16) June 13, 2009 © 2009 Richard Elling 110
  • Properties June 13, 2009 © 2009 Richard Elling 111
  • Properties ● Properties are stored in an nvlist ● By default, are inherited ● Some properties are common to all datasets, but a specific dataset type may have additional properties ● Easily set or retrieved via scripts ● In general, properties affect future file system activity zpool get doesn't script as nicely as zfs get June 13, 2009 © 2009 Richard Elling 112
  • User-defined Properties ● Names – Must include colon ':' – Can contain lower case alphanumerics or “+” “.” “_” – Max length = 256 characters – By convention, module:property ● com.sun:auto-snapshot ● Values – Max length = 1024 characters ● Examples – com.sun:auto-snapshot=true – com.richardelling:important_files=true June 13, 2009 © 2009 Richard Elling 113
  • set & get properties ● Set – zfs set compression=on export/home/relling ● Get – zfs get compression export/home/relling ● Reset to inherited value – zfs inherit compression export/home/relling ● Clear user-defined parameter – zfs inherit com.sun:auto-snapshot export/home/relling June 13, 2009 © 2009 Richard Elling 114
  • Pool Properties Property Change? Brief Description altroot Alternate root directory (ala chroot) autoreplace vdev replacement policy available readonly Available storage space bootfs Default bootable dataset for root pool cachefile Cache file to use other than /etc/zfs/zpool.cache capacity readonly Percent of pool space used delegation Master pool delegation switch failmode Catastrophic pool failure policy guid readonly Unique identifier health readonly Current health of the pool listsnapshots zfs list policy size readonly Total size of pool used readonly Amount of space used version readonly Current on-disk version June 13, 2009 © 2009 Richard Elling 115
  • Common Dataset Properties Property Change? Brief Description available readonly Space available to dataset & children checksum Checksum algorithm compression Compression algorithm compressratio readonly Compression ratio – logical size:referenced physical copies Number of copies of user data creation readonly Dataset creation time origin readonly For clones, origin snapshot primarycache ARC caching policy readonly Is dataset in readonly mode? referenced readonly Size of data accessible by this dataset refreservation Max space guaranteed to a dataset, not including descendants (snapshots & clones) reservation Minimum space guaranteed to dataset, including descendants June 13, 2009 © 2009 Richard Elling 116
  • Common Dataset Properties Property Change? Brief Description secondarycache L2ARC caching policy type readonly Type of dataset (filesystem, snapshot, volume) used readonly Sum of usedby* (see below) usedbychildren readonly Space used by descendants usedbydataset readonly Space used by dataset usedbyrefreservation readonly Space used by a refreservation for this dataset usedbysnapshots readonly Space used by all snapshots of this dataset zoned readonly Is dataset added to non-global zone (Solaris) June 13, 2009 © 2009 Richard Elling 117
  • File System Dataset Properties Property Change? Brief Description aclinherit ACL inheritance policy, when files or directories are created aclmode ACL modification policy, when chmod is used atime Disable access time metadata updates canmount Mount policy casesensitivity creation Filename matching algorithm devices Device opening policy for dataset exec File execution policy for dataset mounted readonly Is file system currently mounted? nbmand export/ File system should be mounted with non-blocking import mandatory locks (CIFS client feature) normalization creation Unicode normalization of file names for matching June 13, 2009 © 2009 Richard Elling 118
  • File System Dataset Properties Property Change? Brief Description quota Max space dataset and descendants can consume recordsize Suggested maximum block size for files refquota Max space dataset can consume, not including descendants setuid setuid mode policy sharenfs NFS sharing options sharesmb CIFS sharing options snapdir Controls whether .zfs directory is hidden utf8only UTF-8 character file name policy vscan Virus scan enabled xattr Extended attributes policy June 13, 2009 © 2009 Richard Elling 119
  • More Goodies... June 13, 2009 © 2009 Richard Elling 120
  • Dataset Space Accounting ● used = usedbydataset + usedbychildren + usedbysnapshots + usedbyrefreservation ● Lazy updates, may not be correct until txg commits ● ls and du will show size of allocated files which includes all copies of a file ● Shorthand report available $ zfs list -o space NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD rpool 126G 18.3G 0 35.5K 0 18.3G rpool/ROOT 126G 15.3G 0 18K 0 15.3G rpool/ROOT/snv_106 126G 86.1M 0 86.1M 0 0 rpool/ROOT/snv_b108 126G 15.2G 5.89G 9.28G 0 0 rpool/dump 126G 1.00G 0 1.00G 0 0 rpool/export 126G 37K 0 19K 0 18K rpool/export/home 126G 18K 0 18K 0 0 rpool/swap 128G 2G 0 193M 1.81G 0 June 13, 2009 © 2009 Richard Elling 121
  • zfs vs zpool Space Accounting ● zfs list != zpool list ● zfs list shows space used by the dataset plus space for internal accounting ● zpool list shows physical space available to the pool ● For simple pools and mirrors, they are nearly the same ● For RAIDZ or RAIDZ2, zpool list will show space available for parity Users will be confused about reported space available June 13, 2009 © 2009 Richard Elling 122
  • Testing ● ztest ● fstest June 13, 2009 © 2009 Richard Elling 123
  • Accessing Snapshots ● By default, snapshots are accessible in .zfs directory ● Visibility of .zfs directory is tunable via snapdir property – Don't really want find to find the .zfs directory ● Windows CIFS clients can see snapshots as Shadow Copies for Shared Folders (VSS) # zfs snapshot rpool/export/home/relling@20090415 # ls -a /export/home/relling … .Xsession .xsession-errors # ls /export/home/relling/.zfs shares snapshot # ls /export/home/relling/.zfs/snapshot 20090415 # ls /export/home/relling/.zfs/snapshot/20090415 Desktop Documents Downloads Public June 13, 2009 © 2009 Richard Elling 124
  • Resilver & Scrub ● Can be read iops bound ● Resilver can also be bandwidth bound to the resilvering device ● Both work at lower I/O scheduling priority than normal work, but that may not matter for read iops bound devices June 13, 2009 © 2009 Richard Elling 125
  • Time-based Resilvering ● Block pointers contain birth txg number ● Resilvering begins with oldest blocks first 73 73 ● Interrupted resilver will still result in a valid file system view 73 55 73 27 68 73 27 27 73 68 Birth txg = 27 Birth txg = 68 Birth txg = 73 June 13, 2009 © 2009 Richard Elling 126
  • Time Slider – Automatic Snapshots ● Underpinnings for Solaris feature similar to OSX's Time Machine ● SMF service for managing snapshots ● SMF properties used to specify policies – Frequency – Number to keep ● Creates cron jobs ● GUI tool makes it easy to select individual file systems Service Name Interval (default) Keep (default) auto-snapshot:frequent 15 minutes 4 auto-snapshot:hourly 1 hour 24 auto-snapshot:daily 1 day 31 auto-snapshot:weekly 7 days 4 auto-snapshot:monthly 1 month 12 June 13, 2009 © 2009 Richard Elling 127
  • Nautilus ● File system views which can go back in time June 13, 2009 © 2009 Richard Elling 128
  • ACL – Access Control List ● Based on NFSv4 ACLs ● Similar to Windows NT ACLs ● Works well with CIFS services ● Supports ACL inheritance ● Change using chmod ● View using ls June 13, 2009 © 2009 Richard Elling 129
  • Checksums ● DVA contains 256 bits for checksum ● Checksum is in the parent, not in the block itself ● Types – none – fletcher2: truncated Fletcher algorithm – fletcher4: full Fletcher algorithm – SHA-256 ● There are open proposals for better algorithms June 13, 2009 © 2009 Richard Elling 130
  • Checksum Use Use Algorithm Notes Uberblock SHA-256 self-checksummed Metadata fletcher4 Labels SHA-256 Data fletcher2 (default) zfs compression parameter ZIL log fletcher2 self-checksummed Gang block SHA-256 self-checksummed June 13, 2009 © 2009 Richard Elling 131
  • Checksum Performance ● Metadata – you won't notice ● Data – LZJB is barely noticeable – gzip-9 can be very noticeable ● Geriatric hardware ??? June 13, 2009 © 2009 Richard Elling 132
  • Compression ● Builtin – lzjb, Lempel-Ziv by Jeff Bonwick – gzip, levels 1-9 ● Extensible – new compressors can be added – backwards compatibility issues ● Uses taskqs to take advantage of multi-processor systems Cannot boot from gzip compressed root (RFE is open) June 13, 2009 © 2009 Richard Elling 133
  • Encryption ● Placeholder – details TBD ● http://opensolaris.org/os/project/zfs-crypto ● Complicated by: – Block pointer rewrites – Deduplication June 13, 2009 © 2009 Richard Elling 134
  • Impedance Matching ● RAID arrays & columns ● Label offsets – Older Solaris starting block = 34 – Newer Solaris starting block = 256 June 13, 2009 © 2009 Richard Elling 135
  • Quotas ● File system quotas – quota includes descendants (snapshots, clones) – refquota does not include descendants ● User and group quotas (b114) – Works like refquota, descendants don't count – Not inherited – zfs userspace and groupspace subcommands show quotas ● Users can only see their own and group quota, but can delegate – Managed via properties ● [user|group]quota@[UID|username|SID name|SID number] ● not visible via zfs get all June 13, 2009 © 2009 Richard Elling 136
  • zpool.cache ● Old way – mount / – read /etc/[v]fstab – mount file systems ● ZFS – import pool(s) – find mountable datasets and mount them ● /etc/zpool.cache is a cache of pools to be imported at boot time – No scanning of all available LUNs for pools to import – cachefile property permits selecting an alternate zpool.cache ● Useful for OS installers ● Useful for clusters, where you don't want a booting node to automatically import a pool ● Not persistent (!) June 13, 2009 © 2009 Richard Elling 137
  • Mounting ZFS File Systems ● By default, mountable file systems are mounted when the pool is imported – Controlled by canmount policy (not inherited) ● on – (default) file system is mountable ● off – file system is not mountable – if you want children to be mountable, but not the parent ● noauto – file system must be explicitly mounted (boot environment) ● Can zfs set mountpoint=legacy to use /etc/[v]fstab ● By default, cannot mount on top of non-empty directory – Can override explicitly using zfs mount -O or legacy mountpoint ● Mount properties are persistent, use zfs mount -o for temporary changes Imports are done in parallel, beware of mountpoint races June 13, 2009 © 2009 Richard Elling 138
  • recordsize ● Dynamic – Max 128 kBytes – Min 512 Bytes – Power of 2 ● For most workloads, don't worry about it ● For fixed size workloads, can set to match workloads – Databases ● File systems or Zvols ● zfs set recordsize=8k dataset June 13, 2009 © 2009 Richard Elling 139
  • Delegated Administration ● Fine grain control – users or groups of users – subcommands, parameters, or sets ● Similar to Solaris' Role Based Access Control (RBAC) ● Enable/disable at the pool level – zpool set delegation=on mypool (default) ● Allow/unallow at the dataset level – zfs allow relling snapshot mypool/relling – zfs allow @backupusers snapshot,send mypool/relling – zfs allow mypool/relling June 13, 2009 © 2009 Richard Elling 140
  • Delegation Inheritance Beware of inheritance ● Local – zfs allow -l relling snapshot mypool ● Local + descendants – zfs allow -d relling mount mypool Make sure permissions are set at the correct level June 13, 2009 © 2009 Richard Elling 141
  • Delegatable Subcommands ● allow ● receive ● clone ● rename ● create ● rollback ● destroy ● send ● groupquota ● share ● groupused ● snapshot ● mount ● userquota ● promote ● userused June 13, 2009 © 2009 Richard Elling 142
  • Delegatable Parameters ● aclinherit ● nbmand ● sharenfs ● aclmode ● normalization ● sharesmb ● atime ● quota ● snapdir ● canmount ● readonly ● userprop ● casesensitivity ● recordsize ● utf8only ● checksum ● refquota ● version ● compression ● refreservation ● volsize ● copies ● reservation ● vscan ● devices ● setuid ● xattr ● exec ● shareiscsi ● zoned ● mountpoint June 13, 2009 © 2009 Richard Elling 143
  • Browser User Interface ● Solaris – WebConsole ● Nexenta - ● OSX - ● OpenStorage - June 13, 2009 © 2009 Richard Elling 144
  • Solaris WebConsole June 13, 2009 © 2009 Richard Elling 145
  • Solaris WebConsole June 13, 2009 © 2009 Richard Elling 146
  • Solaris Swap and Dump ● Swap – Solaris does not have automatic swap resizing – Swap as a separate dataset – Swap device is raw, with a refreservation – Blocksize matched to pagesize – Don't really need or want snapshots or clones – Can resize while online, manually ● Dump – Only used during crash dump – Preallocated – No refreservation – Checksum off – Compression off (dumps are already compressed) June 13, 2009 © 2009 Richard Elling 147
  • Performance June 13, 2009 © 2009 Richard Elling 148
  • General Comments ● In general, performs well out of the box ● Standard performance improvement techniques apply ● Lots of DTrace knowledge available ● Typical areas of concern: – ZIL ● check with zilstat, improve with slogs – COW “fragmentation” ● check iostat, improve with L2ARC – Memory consumption ● check with arcstat ● set primarycache property ● can be capped ● can compete with large page aware apps – Compression, or lack thereof June 13, 2009 © 2009 Richard Elling 149
  • ZIL Performance ● Big performance increases demonstrated, especially with SSDs ● NFS servers – 32kByte threshold (zfs_immediate_write_sz) also corresponds to NFSv3 write size ● May cause more work than needed ● See CR6686887 ● Databases – May want different sync policies for logs and data – Current ZIL is pool-wide and enabled for all sync writes – CR6832481 proposes a separate intent log bypass property on a per-dataset basis June 13, 2009 © 2009 Richard Elling 150
  • vdev Cache ● vdev cache occurs at the SPA level – readahead – 10 MBytes per vdev – only caches metadata (as of b70) ● Stats collected as Solaris kstats # kstat -n vdev_cache_stats module: zfs instance: 0 name: vdev_cache_stats class: misc crtime 38.83342625 delegations 14030 hits 105169 misses 59452 snaptime 4564628.18130739 Hit rate = 59%, not bad... June 13, 2009 © 2009 Richard Elling 151
  • Intelligent Prefetching ● Intelligent file-level prefetching occurs at the DMU level ● Feeds the ARC ● In a nutshell, prefetch hits cause more prefetching – Read a block, prefetch a block – If we used the prefetched block, read 2 more blocks – Up to 256 blocks ● Recognizes strided reads – 2 sequential reads of same length and a fixed distance will be coalesced ● Fetches backwards ● Seems to work pretty well, as-is, for most workloads ● Easy to disable in mdb for testing on Solaris – echo zfs_prefetch_disable/W0t1 | mdb -kw June 13, 2009 © 2009 Richard Elling 152
  • I/O Queues ● By default, for devices which can support it, 35 iops are queued to each vdev – Tunable with zfs_vdev_max_pending – echo zfs_vdev_max_pending/W0t10 | mdb -kw ● Implies that more vdevs is better – Consider avoiding RAID array with a single, large LUN ● ZFS I/O scheduler loses control once iops are queued – CR6471212 proposes reserved slots for high-priority iops ● May need to match queues for the entire data path – zfs_vdev_max_pending – Fibre channel, SCSI, SAS, SATA driver – RAID array controller ● Fast disks → small queues, slow disks → larger queues June 13, 2009 © 2009 Richard Elling 153
  • COW Penalty ● COW can negatively affect workloads which have updates and sequential reads – Initial writes will be sequential – Updates (writes) will cause seeks to read data ● Lots of people seem to worry a lot about this ● Only affects HDDs ● Very difficult to speculate about the impact on real-world apps – Large sequential scans of random data hurt anyway – Reads are cached in many places in the data path ● Sysbench benchmark used to test on MySQL w/InnoDB engine – One hour read/write test – select count(*) – repeat, for a week June 13, 2009 © 2009 Richard Elling 154
  • COW Penalty Performance seems to level at about 25% penalty Results compliments of Allan Packer & Neelakanth Nadgir http://blogs.sun.com/realneel/resource/MySQL_Conference_2009_ZFS_MySQL.pdf June 13, 2009 © 2009 Richard Elling 155
  • About Disks... ● Disks still the most important performance bottleneck – Modern processors are multi-core – Default checksums and compression are computationally efficient Average Max Size Rotational Disk Size RPM (GBytes) Latency (ms) Average Seek (ms) HDD 2.5” 5,400 500 5.5 11 HDD 3.5” 5,900 2,000 5.1 16 HDD 3.5” 7,200 1,500 4.2 8 - 8.5 HDD 2.5” 10,000 300 3 4.2 - 4.6 HDD 2.5” 15,000 146 2 3.2 - 3.5 SSD (w) 2.5” N/A 73 0 0.02 - 0.15 SSD (r) 2.5” N/A 500 0 0.02 - 0.15 June 13, 2009 © 2009 Richard Elling 156
  • DirectIO ● UFS forcedirectio option brought the early 1980s design of UFS up to the 1990s ● ZFS designed to run on modern multiprocessors ● Databases or applications which manage their data cache may benefit by disabling file system caching ● Expect L2ARC to improve random reads UFS DirectIO ZFS Unbuffered I/O primarycache=metadata primarycache=none Concurrency Available at inception Improved Async I/O code path Available at inception June 13, 2009 © 2009 Richard Elling 157
  • Hybrid Storage Pool SPA separate log L2ARC Main Pool device cache device Write optimized HDD HDD Read optimized device (SSD) HDD device (SSD) Size (GBytes) < 1 GByte large big Cost write iops/$ size/$ size/$ Performance low-latency writes - low-latency reads June 13, 2009 © 2009 Richard Elling 158
  • Future Plans ● Announced enhancements OpenSolaris Town Hall 2009.06 – de-duplication (see also GreenBytes ZFS+) – user quotas (delivered b114) – access-based enumeration – snapshot reference counting – dynamic LUN expansion (delivering b117?) ● Others – mirror to smaller disk (delivered b117) June 13, 2009 © 2009 Richard Elling 159
  • Its a wrap! Thank You! Questions? Richard.Elling@RichardElling.com June 13, 2009 © 2009 Richard Elling 160