• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
ZFS Tutorial USENIX LISA09 Conference

ZFS Tutorial USENIX LISA09 Conference



This presentation is from the ZFS Tutorial presented at the USENIX LISA09 Conference at Baltimore, Maryland in November 2009.

This presentation is from the ZFS Tutorial presented at the USENIX LISA09 Conference at Baltimore, Maryland in November 2009.

Later versions are available on slideshare.net, too.



Total Views
Views on SlideShare
Embed Views



11 Embeds 1,011

http://blog.richardelling.com 786
http://richardelling.blogspot.com 105
http://www.slideshare.net 77
http://lanyrd.com 33
http://www.linkedin.com 3
http://www.plurk.com 2
http://webcache.googleusercontent.com 1 1
http://translate.googleusercontent.com 1
http://www.techgig.com 1
http://www.slashdocs.com 1



Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • thanks
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

ZFS Tutorial USENIX LISA09 Conference ZFS Tutorial USENIX LISA09 Conference Presentation Transcript

  • USENIX LISA09 ZFS Tutorial Richard.Elling@RichardElling.com
  • Agenda Overview Foundations Pooled Storage Layer Transactional Object Layer Commands zpool zfs Sharing Properties Performance Troubleshooting Wrap 2
  • Ground Rules No religilous discussion No licensing discussion No “future of <company>” discussion No zones/containers/jails discussion No “when is it going to be in Solaris 10” discussion... ok maybe a few... 3
  • History Announced September 14, 2004 Integration history SXCE b27 (November 2005) FreeBSD (April 2007) Mac OSX Leopard Preview shown, but removed from Snow Leopard Disappointed community reforming as the zfs-macos google group (Oct 2009) OpenSolaris 2008.05 Solaris 10 6/06 (June 2006) Linux FUSE (summer 2006) greenBytes ZFS+ (September 2008) More than 45 patents, contributed to the CDDL Patents Common 4
  • Brief List of Features Future-proof “No silent data corruption ever” Cutting-edge data integrity “Mind-boggling scalability” High performance “Breathtaking speed” Simplified administration “Near zero administration” Eliminates need for volume “Radical new architecture” managers “Greatly simplifies support Reduced costs issues” Compatibility with POSIX file “RAIDZ saves money” system & block devices Self-healing Marketing: 2 drink minimum 5
  • ZFS Design Goals Figure out why storage has gotten so complicated Blow away 20+ years of obsolete assumptions Gotta replace UFS Design an integrated system from scratch End the suffering 6
  • Limits 248 — Number of entries in any individual directory 256 — Number of attributes of a file [1] 256 — Number of files in a directory [1] 16 EiB (264 bytes) — Maximum size of a file system 16 EiB — Maximum size of a single file 16 EiB — Maximum size of any attribute 264 — Number of devices in any pool 264 — Number of pools in a system 264 — Number of file systems in a pool 264 — Number of snapshots of any file system 256 ZiB (278 bytes) — Maximum size of any pool [1] actually constrained to 248 for the number of files in a ZFS file system 7
  • Sidetrack: Understanding Builds Build is often referenced when speaking of feature/bug integration Short-hand notation: b# OpenSolaris and SXCE are based on NV SXCE will soon end OpenSolaris carries forward ZFS development done for NV Bi-weekly build cycle Schedule at http://opensolaris.org/os/community/on/schedule/ ZFS is ported to Solaris 10 and other OSes 8
  • Foundations 9
  • Overhead View of a Pool Pool File System Configuration Information Volume File System Volume Dataset 10
  • Layer View raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? 11
  • Source Code Structure File system Device GUI Mgmt Consumer Consumer JNI User libzfs Kernel Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration 12
  • Acronyms ARC – Adaptive Replacement Cache DMU – Data Management Unit DSL – Dataset and Snapshot Layer JNI – Java Native InterfaceZPL – ZFS POSIX Layer (traditional file system interface) VDEV – Virtual Device layer ZAP – ZFS Attribute Processor ZIL – ZFS Intent Log ZIO – ZFS I/O layer Zvol – ZFS volume (raw/cooked block device interface) 13
  • nvlists name=value pairs libnvpair(3LIB) Allows ZFS capabilities to change without changing the physical on- disk format Data stored is XDR encoded A good thing, used often 14
  • Versioning Features can be added and identified by nvlist entries Change in pool or dataset versions do not change physical on-disk format (!) does change nvlist parameters Older-versions can be used might see warning messages, but harmless Available versions and features can be easily viewed zpool upgrade -v zfs upgrade -v Online references zpool: www.opensolaris.org/os/community/zfs/version/N zfs: www.opensolaris.org/os/community/zfs/version/zpl/N Don't confuse zpool and zfs versions 15
  • zpool versions VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit support 15 user/group space accounting 16 stmf property support 17 Triple-parity RAID-Z 18 snapshot user holds 19 Log device removal 16
  • zfs versions VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS filesystem version 2 Enhanced directory entries 3 Case insensitive and File system unique identifier (FUID) 4 userquota, groupquota properties 17
  • Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free 18
  • COW Notes COW works on blocks, not files ZFS reserves 32 MBytes or 1/64 of pool size COWs need some free space to remove files need space for ZIL For fixed-record size workloads “fragmentation” and “poor performance” can occur if the recordsize is not matched Spatial distribution is good fodder for performance speculation affects HDDs moot for SSDs 19
  • Pooled Storage Layer raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? 20
  • vdevs – Virtual Devices Logical vdevs root vdev top-level vdev top-level vdev children[0] children[1] mirror mirror vdev vdev vdev vdev type=disk type=disk type=disk type=disk children[0] children[1] children[0] children[1] Physical or leaf vdevs 21
  • vdev Labels vdev labels != disk labels Four 256 kByte labels written to every physical vdev Two-stage update process write label0 & label2 flush cache & check for errors write label1 & label3 flush cache & check for errors N = 256k * (size % 256k) 0 256k 512k 4M N-512k N-256k N label0 label1 boot block label2 label3 Blank Boot Name=Value ... header Pairs 128-slot Uberblock Array 0 8k 16k 128k 256k 22
  • Observing Labels # zdb -l /dev/rdsk/c0t0d0s0 -------------------------------------------- LABEL 0 -------------------------------------------- version=14 name='rpool' state=0 txg=13152 pool_guid=17111649328928073943 hostid=8781271 hostname='' top_guid=11960061581853893368 guid=11960061581853893368 vdev_tree type='disk' id=0 guid=11960061581853893368 path='/dev/dsk/c0t0d0s0' devid='id1,sd@SATA_____ST3500320AS_________________9QM3FWFT/a' phys_path='/pci@0,0/pci1458,b002@11/disk@0,0:a' whole_disk=0 metaslab_array=24 metaslab_shift=30 ashift=9 asize=157945167872 is_log=0 23
  • To fsck or not to fsck fsck was created to fix known inconsistencies in file system metadata UFS is not transactional metadata inconsistencies must be reconciled does NOT repair data – how could it? ZFS doesn't need fsck, as-is all on-disk changes are transactional COW means previously existing, consistent metadata is not overwritten ZFS can repair itself metadata is at least dual-redundant data can also be redundant Reality check – this does not mean that ZFS is not susceptible to corruption nor is any other file system 24
  • VDEV 25
  • Dynamic Striping  RAID-0 − SNIA definition: fixed-length sequences of virtual disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern  Dynamic Stripe − Data is dynamically mapped to member disks − No fixed-length sequences − Allocate up to ~1 MByte/vdev before changing vdev − vdevs can be different size − Good combination of the concatenation feature with RAID-0 performance 26
  • Dynamic Striping RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes ZFS Dynamic Stripe recordsize = 128 kBytes Total write size = 2816 kBytes 27
  • Mirroring  Straightforward: put N copies of the data on N vdevs  Unlike RAID-1 − No 1:1 mapping at the block level − vdev labels are still at beginning and end − vdevs can be of different size  effective space is that of smallest vdev  Arbitration: ZFS does not blindly trust either side of mirror − Most recent, correct view of data wins − Checksums validate data 28
  • Mirroring 29
  • Dynamic vdev Replacement  zpool replace poolname vdev [vdev]  Today, replacing vdev must be same size or larger − Before b117: as measured by blocks − After b117: as measured by metaslabs  Replacing all vdevs in a top-level vdev with larger vdevs results in top-level vdev resizing  Policy controlled by zpool autoexpand property 15G 10G 10G 15G 10G 20G 15G 20G 10G 15G 20G 20G 20G 20G 10G 10G Mirror 10G Mirror 15G Mirror 15G Mirror 20G Mirror 30
  • RAIDZ  RAID-5 − Parity check data is distributed across the RAID array's disks − Must read/modify/write when data is smaller than stripe width  RAIDZ − Dynamic data placement − Parity added as needed − Writes are full-stripe writes − No read/modify/write (write hole)  Arbitration: ZFS does not blindly trust any device − Does not rely on disk reporting read error − Checksums validate data − If checksum fails, read parity Space used is dependent on how used 31
  • RAID-5 vs RAIDZ DiskA DiskB DiskC DiskD DiskE D0:0 D0:1 D0:2 D0:3 P0 RAID-5 P1 D1:0 D1:1 D1:2 D1:3 D2:3 P2 D2:0 D2:1 D2:2 D3:2 D3:3 P3 D3:0 D3:1 DiskA DiskB DiskC DiskD DiskE P0 D0:0 D0:1 D0:2 D0:3 RAIDZ P1 D1:0 D1:1 P2:0 D2:0 D2:1 D2:2 D2:3 P2:1 D2:4 D2:5 Gap P3 D3:0 D3:1 32
  • RAID-5 Write Hole  Occurs when data to be written is smaller than stripe size  Must read unallocated columns to recalculate the parity or the parity must be read/modify/write  Read/modify/write is risky for consistency − Multiple disks − Reading independently − Writing independently − System failure before all writes are complete to media could result in data loss  Effects can be hidden from host using RAID array with nonvolatile write cache, but extra I/O cannot be hidden from disks 33
  • RAIDZ2 and RAIDZ3  RAIDZ2 = double parity RAIDZ  RAIDZ3 = triple parity RAIDZ  Sorta like RAID-6 − Parity 1: XOR − Parity 2: another Reed-Soloman syndrome − Parity 3: yet another Reed-Soloman syndrome  Arbitration: ZFS does not blindly trust any device − Does not rely on disk reporting read error − Checksums validate data − If data not valid, read parity − If data still not valid, read other parity Space used is dependent on how used 34
  • Evaluating Data Retention  MTTDL = Mean Time To Data Loss  Note: MTBF is not constant in the real world, but keeps math simple  MTTDL[1] is a simple MTTDL model  No parity (single vdev, striping, RAID-0) − MTTDL[1] = MTBF / N  Single Parity (mirror, RAIDZ, RAID-1, RAID-5) − MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)  Double Parity (3-way mirror, RAIDZ2, RAID-6) − MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2) 35
  • Another MTTDL Model  MTTDL[1] model doesn't take into account unrecoverable read  But unrecoverable reads (UER) are becoming the dominant failure mode − UER specifed as errors per bits read − More bits = higher probability of loss per vdev  MTTDL[2] model considers UER 36
  • Why Worry about UER?  Richard's study − 3,684 hosts with 12,204 LUNs − 11.5% of all LUNs reported read errors  Bairavasundaram et.al. FAST08 www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf − 1.53M LUNs over 41 months − RAID reconstruction discovers 8% of checksum mismatches − 4% of disks studies developed checksum errors over 17 months 37
  • MTTDL[2] Model  Probability that a reconstruction will fail − Precon_fail = (N-1) * size / UER  Model doesn't work for non-parity schemes (single vdev, striping, RAID-0)  Single Parity (mirror, RAIDZ, RAID-1, RAID-5) − MTTDL[2] = MTBF / (N * Precon_fail)  Double Parity (3-way mirror, RAIDZ2, RAID-6) − MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail) 38
  • Practical View of MTTDL[1] 39
  • MTTDL Models: Mirror 40
  • MTTDL Models: RAIDZ2 41
  • Ditto Blocks Recall that each blkptr_t contains 3 DVAs Dataset property used to indicate how many copies (aka ditto blocks) of data is desired Write all copies Read any copy Recover corrupted read from a copy Not a replacement for mirroring Easier to describe in pictures... copies parameter Data copies Metadata copies copies=1 (default) 1 2 copies=2 2 3 copies=3 3 3 42
  • Copies in Pictures 43
  • Copies in Pictures 44
  • ZIO – ZFS I/O Layer 45
  • ZIO Framework All physical disk I/O goes through ZIO Framework Translates DVAs into Logical Block Address (LBA) on leaf vdevs Keeps free space maps (spacemap) If contiguous space is not available: Allocate smaller blocks (the gang) Allocate gang block, pointing to the gang Implemented as multi-stage pipeline Allows extensions to be added fairly easily Handles I/O errors 46
  • SpaceMap from Space 47
  • ZIO Write Pipeline ZIO State Compression Crypto Checksum DVA vdev I/O open compress if savings > 12.5% encrypt generate allocate start start start done done done assess assess assess done Gang activity elided, for clarity 48
  • ZIO Read Pipeline ZIO State Compression Crypto Checksum DVA vdev I/O open start start start done done done assess assess assess verify decrypt decompress done Gang activity elided, for clarity 49
  • VDEV – Virtual Device Subsytem Where mirrors, RAIDZ, and RAIDZ2 are implemented Name Priority Surprisingly few lines of code NOW 0 needed to implement RAID SYNC_READ 0 Leaf vdev (physical device) I/O SYNC_WRITE 0 management FREE 0 Number of outstanding iops CACHE_FILL 0 Read-ahead cache LOG_WRITE 0 Priority scheduling ASYNC_READ 4 ASYNC_WRITE 4 RESILVER 10 SCRUB 20 50
  • ARC – Adaptive Replacement Cache 51
  • Object Cache UFS uses page cache managed by the virtual memory system ZFS does not use the page cache, except for mmap'ed files ZFS uses a Adaptive Replacement Cache (ARC) ARC used by DMU to cache DVA data objects Only one ARC per system, but caching policy can be changed on a per-dataset basis Seems to work much better than page cache ever did for UFS 52
  • Traditional Cache Works well when data being accessed was recently added Doesn't work so well when frequently accessed data is evicted Misses cause insert MRU Dynamic caches can change Cache size size by either not evicting or aggressively evicting LRU Evict the oldest 53
  • ARC – Adaptive Replacement Cache Evict the oldest single-use entry LRU Recent Cache Miss MRU Evictions and dynamic MRU size resizing needs to choose best Hit cache to evict (shrink) Frequent Cache LRU Evict the oldest multiple accessed entry 54
  • ZFS ARC – Adaptive Replacement Cache with Locked Pages Evict the oldest single-use entry Cannot evict LRU locked pages! Recent Cache Miss MRU MRU Hit size Frequent If hit occurs Cache within 62 ms LRU Evict the oldest multiple accessed entry ZFS ARC handles mixed-size pages 55
  • ARC Directory Each ARC directory entry contains arc_buf_hdr structs Info about the entry Pointer to the entry Directory entries have size, ~200 bytes ZFS block size is dynamic, 512 bytes – 128 kBytes Disks are large Suppose we use a Seagate LP 2 TByte disk for the L2ARC Disk has 3,907,029,168 512 byte sectors, guaranteed Workload uses 8 kByte fixed record size RAM needed for arc_buf_hdr entries Need = (3,907,029,168 - 9,232) * 200 / 16 = ~48 GBytes Don't underestimate the RAM needed for large L2ARCs 56
  • L2ARC – Level 2 ARC ARC evictions are sent to cache vdev ARC directory remains in memory Works well when cache vdev is optimized for fast reads ARC lower latency than pool disks inexpensive way to “increase memory” Content considered volatile, no ZFS data evicted protection allowed data Monitor usage with zpool iostat “cache” “cache” “cache” vdev vdev vdev 57
  • ARC Tips In general, it seems to work well for most workloads ARC size will vary, based on usage Default max is 3/4 of memory or memory - 1 GByte Min is 64 MB Metadata capped at 1/4 of max ARC size Internals tracked by kstats in Solaris Use memory_throttle_count to observe pressure to evict Can limit at boot time Solaris – set zfs:zfs_arc_max in /etc/system Performance Prior to b107, L2ARC fill rate was limited to 8 MBytes/s L2ARC keeps its directory in kernel memory 58
  • Transactional Object Layer 59
  • flash Source Code Structure File system Device GUI Mgmt Consumer Consumer JNI User libzfs Kernel Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration 60
  • DMU – Data Management Layer Datasets issue transactions to the DMU Transactional based object model Transactions are Atomic Grouped (txg = transaction group) Responsible for on-disk data ZFS Attribute Processor (ZAP) Dataset and Snapshot Layer (DSL) ZFS Intent Log (ZIL) 61
  • Transaction Engine Manages physical I/O Transactions grouped into transaction group (txg) txg updates All-or-nothing Commit interval Older versions: 5 seconds Now: 30 seconds max, dynamically scale based on time required to commit txg Delay committing data to physical storage Improves performance A bad thing for sync workloads – hence the ZFS Intent Log (ZIL) 30 second delay can impact failure detection time 62
  • ZIL – ZFS Intent Log DMU is transactional, and likes to group I/O into transactions for later commits, but still needs to handle “write it now” desire of sync writers NFS Databases ZIL recordsize inflation can occur for some workloads May cause larger than expected actual I/O for sync workloads Oracle redo logs Can tune zfs_immediate_write_sz, but after b122 use logbias property instead Never read, except at import (eg reboot), when transactions may need to be rolled forward 63
  • Separate Logs (slogs) ZIL competes with pool for iops Applications will wait for sync writes to be on nonvolatile media Very noticeable on HDD JBODs Put ZIL on separate vdev, outside of pool ZIL writes tend to be sequential No competition with pool for IOPS Downside: slog device required to be operational at import b125 adds slog device removal support Size of separate log < than size of RAM (duh) 10x or more performance improvements possible Use write-optimized SSD or non-volatile write cache on RAID array Use zilstat to observe ZIL activity 64
  • Synchronous Write Destination Without separate log Sync I/O size > zfs_immediate_write_sz ? ZIL Destination no ZIL log yes bypass to pool With separate log Sync I/O size > zfs_immediate_write_sz ? logbias? ZIL Destination no log device yes prior to logbias (b122) log device latency (default) log device throughput bypass to pool + Default zfs_immediate_write_sz = 32 kBytes 65
  • Disabling the ZIL Rule 0: Don’t disable the ZIL If you love your data, do not disable the ZIL You can find references to this as a way to speed up ZFS NFS workloads “tar -x” benchmarks Golden Rule: Don’t disable the ZIL Can set via mdb, but need to remount the file system under test Friends don’t let friends disable the ZIL Solaris - can set in /etc/system *** TEMPORARY disable ZIL for non-production use *** disabled by <your name> on <date> set zfs:zil_disable=1 Nostradamus wrote, “disabling the ZIL will lead to the apocalypse” 66
  • DSL – Dataset and Snapshot Layer 67
  • flash Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free 68
  • zfs snapshot Create a read-only, point-in-time window into the dataset (file system or Zvol) Computationally free, because of COW architecture Very handy feature Patching/upgrades Basis for Time Slider 69
  • Snapshot Current tree root Snapshot tree root Create a snapshot by not free'ing COWed blocks Snapshot creation is fast and easy Number of snapshots determined by use – no hardwired limit Recursive snapshots also possible 70
  • Clones Snapshots are read-only Clones are read-write based upon a snapshot Child depends on parent Cannot destroy parent without destroying all children Can promote children to be parents Good ideas OS upgrades Change control Replication zones virtual disks 71
  • zfs clone Create a read-write file system from a read-only snapshot Used extensively for OpenSolaris upgrades OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 snapshot snapshot snapshot OS rev1 upgrade OS rev2 clone boot manager Origin snapshot cannot be destroyed, if clone exists 72
  • zfs rollback OS b104 OS b104 rpool/ROOT/b104 rpool/ROOT/b104 OS b104 OS b104 snapshot rollback snapshot rpool/ROOT/b104@today rpool/ROOT/b104@today 73
  • Commands 74
  • zpool(1m) raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? 75
  • Dataset & Snapshot Layer Object Allocated storage dnode describes collection of Dataset Directory blocks Dataset Object Set Object Set Childmap Group of related objects Dataset Object Object Object Properties Snapmap: snapshot relationships Snapmap Space usage Dataset directory Childmap: dataset relationships Properties 76
  • zpool create zpool create poolname vdev-configuration vdev-configuration examples mirror c0t0d0 c3t6d0 mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6 mirror disk1s0 disk2s0 cache disk4s0 log disk5 raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0 Solaris Additional checks to see if disk/slice overlaps or is currently in use Whole disks are given EFI labels Can set initial pool or dataset properties By default, creates a file system with the same name poolname pool → /poolname file system People get confused by a file system with same name as the pool 77
  • zpool add Adds a device to the pool as a top-level vdev zpool add poolname vdev-configuration vdev-configuration can be any combination also used for zpool create Complains if the added vdev-configuration would cause a different data protection scheme than is already in use – use “-f” to override Good idea: try with “-n” flag first – will show final configuration without actually performing the add Do not add a device which is in use as a quorum device 78
  • zpool remove Remove a top-level vdev from the pool zpool remove poolname vdev Today, you can only remove the following vdevs: cache hot spare separate log (b124) An RFE is open to allow removal of other top-level vdevs Don't confuse “remove” with “detach” 79
  • zpool attach Attach a vdev as a mirror to an existing vdev zpool attach poolname existing-vdev vdev Attaching vdev must be the same size or larger than the existing vdev Note: today, not available for RAIDZ, RAIDZ2, or RAIDZ3 vdevs vdev Configurations ok simple vdev → mirror ok mirror ok log → mirrored log no RAIDZ no RAIDZ2 no RAIDZ3 “Same size” literally means the same number of blocks until b117. Beware that many “same size” disks have different number of available blocks. 80
  • zpool import Import a pool and mount all mountable datasets Import a specific pool zpool import poolname zpool import GUID Scan LUNs for pools which may be imported zpool import Can set options, such as alternate root directory or other properties Beware of zpool.cache interactions Beware of artifacts, especially partial artifacts 81
  • zpool history Show history of changes made to the pool # zpool history rpool History for 'rpool': 2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -o cachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s0 2009-03-04.07:29:47 zfs set canmount=noauto rpool 2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool 2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT 2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap 2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump 2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_106 2009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_106 2009-03-04.07:29:51 zfs set canmount=on rpool 2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export 2009-03-04.07:29:51 zfs create rpool/export/home 2009-03-04.00:21:42 zpool import -f -R /a 17111649328928073943 2009-03-04.00:21:42 zpool export rpool 2009-03-04.08:47:08 zpool set bootfs=rpool rpool 2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b108 2009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/ snv_b108 ... 82
  • zpool status Shows the status of the current pools, including their configuration Important troubleshooting step # zpool status … pool: zwimming state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM zwimming ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0s0 ONLINE 0 0 0 c0t0d0s7 ONLINE 0 0 0 errors: No known data errors Understanding status output error messages can be tricky 83
  • zpool iostat Show pool physical I/O activity, in an iostat-like manner Solaris: fsstat will show I/O activity looking into a ZFS file system Especially useful for showing slog activity # zpool iostat -v capacity operations bandwidth pool used avail read write read write ------------ ----- ----- ----- ----- ----- ----- rpool 16.5G 131G 0 0 1.16K 2.80K c0t0d0s0 16.5G 131G 0 0 1.16K 2.80K ------------ ----- ----- ----- ----- ----- ----- zwimming 135G 14.4G 0 5 2.09K 27.3K mirror 135G 14.4G 0 5 2.09K 27.3K c0t2d0s0 - - 0 3 1.25K 27.5K c0t0d0s7 - - 0 2 1.27K 27.5K ------------ ----- ----- ----- ----- ----- ----- Unlike iostat, does not show latency 84
  • zfs(1m) raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? 85
  • zfs create, destroy By default, a file system with the same name as the pool is created by zpool create Name format is: pool/name[/name ...] File system zfs create fs-name zfs destroy fs-name Zvol zfs create -V size vol-name zfs destroy vol-name Parameters can be set at create time 86
  • zfs list List mounted datasets Old versions: listed everything After b108: do not list snapshots See zpool listsnapshots property Examples zfs list zfs list -t snapshot zfs list -H -o name 87
  • zfs send, receive Send send a snapshot to stdout data is decompressed Receive receive a snapshot from stdin receiving file system parameters apply (compression, et.al) Can incrementally send snapshots in time order Handy way to replicate dataset snapshots Only method for replicating dataset properties, except quotas NOT a replacement for traditional backup solutions All-or-nothing design per snapshot In general, does not send files (!) Send streams from b35 (or older) no longer supported after b89 88
  • Sharing 89
  • Sharing zfs share dataset Type of sharing set by parameters shareiscsi = [on | off] sharenfs = [on | off | options] sharesmb = [on | off | options] Shortcut to manage sharing Uses external services (nfsd, iscsi target, smbshare, etc) Importing pool will also share May vary by OS 90
  • NFS ZFS file systems work as expected use ACLs based on NFSv4 ACLs Parallel NFS, aks pNFS, aka NFSv4.1 Still a work-in-progress http://opensolaris.org/os/project/nfsv41/ zfs create -t pnfsdata mypnfsdata pNFS Client pNFS Data Server pNFS Data Server pnfsdata pnfsdata pNFS dataset dataset Metadata Server pool pool 91
  • CIFS UID mapping casesensitivity parameter Good idea, set when file system is created zfs create -o casesensitivity=insensitive mypool/Shared Shadow Copies for Shared Folders (VSS) supported CIFS clients cannot create shadow remotely (yet) CIFS features vary by OS, Samba, etc. 92
  • iSCSI SCSI over IP Block-level protocol Uses Zvols as storage Solaris has 2 iSCSI target implementations shareiscsi enables old, user-land iSCSI target To use COMSTAR, enable using itadm(1m) b116 more closely integrates COMSTAR (zpool version 16) iSCSI performance hiccup Prior to b107, iSCSI over Zvols didn’t properly handle sync writes b107-b113, iSCSI over Zvols made all writes sync (read: slow) Workaround: enable write cache enable in the iSCSI target, see CR6770534 OpenSolaris 2009.06 is b111 b114, write cache enable works automatically iSCSI over Zvol 93
  • Properties 94
  • Properties Properties are stored in an nvlist By default, are inherited Some properties are common to all datasets, but a specific dataset type may have additional properties Easily set or retrieved via scripts In general, properties affect future file system activity zpool get doesn't script as nicely as zfs get 95
  • User-defined Properties Names Must include colon ':' Can contain lower case alphanumerics or “+” “.” “_” Max length = 256 characters By convention, module:property com.sun:auto-snapshot Values Max length = 1024 characters Examples com.sun:auto-snapshot=true com.richardelling:important_files=true 96
  • set & get properties Set zfs set compression=on export/home/relling Get zfs get compression export/home/relling Reset to inherited value zfs inherit compression export/home/relling Clear user-defined parameter zfs inherit com.sun:auto-snapshot export/home/ relling 97
  • Pool Properties Property Change? Brief Description altroot Alternate root directory (ala chroot) autoexpand Policy for expanding when vdev size changes autoreplace vdev replacement policy available readonly Available storage space bootfs Default bootable dataset for root pool cachefile Cache file to use other than /etc/zfs/ zpool.cache capacity readonly Percent of pool space used delegation Master pool delegation switch failmode Catastrophic pool failure policy 98
  • More Pool Properties Property Change? Brief Description guid readonly Unique identifier health readonly Current health of the pool listsnapshots zfs list policy size readonly Total size of pool used readonly Amount of space used version readonly Current on-disk version 99
  • Property Change? Brief Description available readonly Space available to dataset & children checksum Checksum algorithm compression Compression algorithm compressratio readonly Compression ratio – logical size:referenced physical copies Number of copies of user data creation readonly Dataset creation time logbias Separate log write policy origin readonly For clones, origin snapshot primarycache ARC caching policy readonly Is dataset in readonly mode? referenced readonly Size of data accessible by this dataset 100
  • More Common Dataset Properties Property Change? Brief Description refreservation Max space guaranteed to a dataset, excluding descendants (snapshots & clones) reservation Minimum space guaranteed to dataset, including descendants secondarycache L2ARC caching policy type readonly Type of dataset (filesystem, snapshot, volume) 101
  • More Common Dataset Properties Property Change? Brief Description used readonly Sum of usedby* (see below) usedbychildren readonly Space used by descendants usedbydataset readonly Space used by dataset usedbyrefreservation readonly Space used by a refreservation for this dataset usedbysnapshots readonly Space used by all snapshots of this dataset zoned readonly Is dataset added to non-global zone (Solaris) 102
  • Volume Dataset Properties Property Change? Brief Description shareiscsi iSCSI service (not COMSTAR) volblocksize creation fixed block size volsize Implicit quota zoned readonly Set if dataset delegated to non-global zone (Solaris) 103
  • File System Properties Property Change? Brief Description aclinherit ACL inheritance policy, when files or directories are created aclmode ACL modification policy, when chmod is used atime Disable access time metadata updates canmount Mount policy casesensitivity creation Filename matching algorithm (CIFS client feature) devices Device opening policy for dataset exec File execution policy for dataset mounted readonly Is file system currently mounted? 104
  • More File System Properties Property Change? Brief Description nbmand export/ File system should be mounted with non- import blocking mandatory locks (CIFS client feature) normalization creation Unicode normalization of file names for matching quota Max space dataset and descendants can consume recordsize Suggested maximum block size for files refquota Max space dataset can consume, not including descendants setuid setuid mode policy sharenfs NFS sharing options sharesmb Files system shared with CIFS 105
  • File System Properties Property Change? Brief Description snapdir Controls whether .zfs directory is hidden utf8only creation UTF-8 character file name policy vscan Virus scan enabled xattr Extended attributes policy 106
  • More Goodies... 107
  • Dataset Space Accounting used = usedbydataset + usedbychildren + usedbysnapshots + usedbyrefreservation Lazy updates, may not be correct until txg commits ls and du will show size of allocated files which includes all copies of a file Shorthand report available $ zfs list -o space NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD rpool 126G 18.3G 0 35.5K 0 18.3G rpool/ROOT 126G 15.3G 0 18K 0 15.3G rpool/ROOT/snv_106 126G 86.1M 0 86.1M 0 0 rpool/ROOT/snv_b108 126G 15.2G 5.89G 9.28G 0 0 rpool/dump 126G 1.00G 0 1.00G 0 0 rpool/export 126G 37K 0 19K 0 18K rpool/export/home 126G 18K 0 18K 0 0 rpool/swap 128G 2G 0 193M 1.81G 0 108
  • zfs vs zpool Space Accounting zfs list != zpool list zfs list shows space used by the dataset plus space for internal accounting zpool list shows physical space available to the pool For simple pools and mirrors, they are nearly the same For RAIDZ, RAIDZ2, or RAIDZ3, zpool list will show space available for parity Users will be confused about reported space available 109
  • Accessing Snapshots By default, snapshots are accessible in .zfs directory Visibility of .zfs directory is tunable via snapdir property Don't really want find to find the .zfs directory Windows CIFS clients can see snapshots as Shadow Copies for Shared Folders (VSS) # zfs snapshot rpool/export/home/relling@20090415 # ls -a /export/home/relling … .Xsession .xsession-errors # ls /export/home/relling/.zfs shares snapshot # ls /export/home/relling/.zfs/snapshot 20090415 # ls /export/home/relling/.zfs/snapshot/20090415 Desktop Documents Downloads Public 110
  • Time-based Resilvering Block pointers contain birth txg number Resilvering begins with oldest blocks first 73 73 Interrupted resilver will still result in a valid file system view 73 55 73 27 68 73 27 27 73 68 Birth txg = 27 Birth txg = 68 Birth txg = 73 111
  • Time Slider - Automatic Snapshots Underpinnings for Solaris feature similar to OSX's Time Machine SMF service for managing snapshots SMF properties used to specify policies: frequency (interval) and number to keep Creates cron jobs GUI tool makes it easy to select individual file systems Tip: take additional snapshots for important milestones to avoid automatic snapshot deletion Service Name Interval (default) Keep (default) auto-snapshot:frequent 15 minutes 4 auto-snapshot:hourly 1 hour 24 auto-snapshot:daily 1 day 31 auto-snapshot:weekly 7 days 4 auto-snapshot:monthly 1 month 12 112
  • Nautilus File system views which can go back in time 113
  • ACL – Access Control List Based on NFSv4 ACLs Similar to Windows NT ACLs Works well with CIFS services Supports ACL inheritance Change using chmod View using ls 114
  • Checksums for Data DVA contains 256 bits for checksum Checksum is in the parent, not in the block itself Types none fletcher2: truncated 2nd order Fletcher-like algorithm (default prior to b114) fletcher4: 4th order Fletcher-like algorithm (default, starting b114) SHA-256 There are open proposals for better algorithms 115
  • Checksum Use Pool Algorithm Notes Uberblock SHA-256 self-checksummed Metadata fletcher4 Labels SHA-256 Gang block SHA-256 self-checksummed Dataset Algorithm Notes Metadata fletcher4 Data fletcher4 (default) zfs checksum parameter ZIL log fletcher2 self-checksummed Send stream fletcher4 Note: fletcher2 was the default for data prior to b114 Note: ZIL log has additional checking beyond the checksum 116
  • Compression Builtin lzjb, Lempel-Ziv by Jeff Bonwick gzip, levels 1-9 Extensible new compressors can be added backwards compatibility issues Uses taskqs to take advantage of multi-processor systems Do you have a better compressor in mind? http://richardelling.blogspot.com/2009/08/justifying-new- compression-algorithms.html Cannot boot from gzip compressed root (RFE is open) 117
  • Encryption Placeholder – details TBD http://opensolaris.org/os/project/zfs-crypto Complicated by: Block pointer rewrites Deduplication 118
  • Quotas File system quotas quota includes descendants (snapshots, clones) refquota does not include descendants User and group quotas b114, Solaris 10 10/09 (patch 141444-03 or 141445-03) Works like refquota, descendants don't count Not inherited zfs userspace and groupspace subcommands show quotas Users can only see their own and group quota, but can delegate Managed like properties [user|group]quota@[UID|username|SID name|SID number] not visible via zfs get all 119
  • zpool.cache Old way mount / read /etc/[v]fstab mount file systems ZFS import pool(s) find mountable datasets and mount them /etc/zpool.cache is a cache of pools to be imported at boot time No scanning of all available LUNs for pools to import Binary: dump contents with zdb -C cachefile property permits selecting an alternate zpool.cache Useful for OS installers Useful for clusters, where you don't want a booting node to automatically import a pool Not persistent (!) 120
  • Mounting ZFS File Systems By default, mountable file systems are mounted when the pool is imported Controlled by canmount policy (not inherited) on – (default) file system is mountable off – file system is not mountable if you want children to be mountable, but not the parent noauto – file system must be explicitly mounted (boot environment) Can zfs set mountpoint=legacy to use /etc/vfstab By default, cannot mount on top of non-empty directory Can override explicitly using zfs mount -O or legacy mountpoint Mount properties are persistent, use zfs mount -o for temporary changes Imports are done in parallel, beware of mountpoint races prior to b104 121
  • recordsize Dynamic Max 128 kBytes Min 512 Bytes Power of 2 For most workloads, don't worry about it For fixed size workloads, can set to match workloads Databases iSCSI Zvols serving NTFS or ext3 (use 4 KB) File systems or Zvols zfs set recordsize=8k dataset 122
  • Delegated Administration Fine grain control users or groups of users subcommands, parameters, or sets Similar to Solaris' Role Based Access Control (RBAC) Enable/disable at the pool level zpool set delegation=on mypool (default) Allow/unallow at the dataset level zfs allow relling snapshot mypool/relling zfs allow @backupusers snapshot,send mypool/sw zfs allow mypool/relling 123
  • Delegatable Subcommands allow receive clone rename create rollback destroy send groupquota share groupused snapshot mount userquota promote userused 124
  • Delegatable Parameters aclinherit nbmand sharesmb aclmode normalization snapdir atime quota userprop canmount readonly utf8only casesensitivity recordsize version checksum refquota volsize compression refreservation vscan copies reservation xattr devices setuid zoned exec shareiscsi mountpoint sharenfs 125
  • Browser User Interface Solaris 10 – WebConsole Nexenta OpenStorage 126
  • Solaris WebConsole 127
  • Solaris WebConsole 128
  • Nexenta www.nexenta.com/corp/images/stories/pdfs/nexentastor%20briefing%206%2030%20final%20june%2029%2009.pdf 129
  • OpenStorage 130
  • Solaris Swap and Dump Swap Solaris does not have automatic swap resizing Swap as a separate dataset Swap device is raw, with a refreservation Blocksize matched to pagesize: 8 kB SPARC, 4 kB x86 Don't really need or want snapshots or clones Can resize while online, manually Dump Only used during crash dump Preallocated No refreservation Checksum off Compression off (dumps are already compressed) 131
  • Performance 132
  • General Comments In general, performs well out of the box Standard performance improvement techniques apply Lots of DTrace knowledge available Typical areas of concern: ZIL check with zilstat, improve with slogs COW “fragmentation” check iostat, improve with L2ARC Memory consumption check with arcstat set primarycache property can be capped can compete with large page aware apps Compression, or lack thereof 133
  • ZIL Performance : NFS Big performance increases demonstrated especially with SSDs for RAID arrays with nonvolatile RAM cache, not so much NFS servers 32kByte threshold (zfs_immediate_write_sz) also corresponds to NFSv3 write size May cause more work than needed See CR6686887 134
  • ZIL Performance : Databases The logbias property can be set on a dataset to control threshold for writing to pool when a slog is used logbias=latency (default) all writes go to slog logbias=throughput, writes > zfs_immediate_write_sz go to pool Settable on-the-fly Consider changing policy during database loads Can have different sync policies for logs and data Oracle, separate latency-sensitive redo log traffic from Redo logs: logbias=latency Indexes: logbias=latency Data files: logbias=throughput MySQL with InnoDB logbias=latency 135
  • More ZIL Performance : Databases I/O size inflation Once a file grows to use a block size, it will keep that block size Block size is capped by recordsize recordsize is a power of 2: 512 bytes, 1 KB, 2 KB, 4 KB, ... 128 KB Can be inefficient if the workload is sync and writes variable sized data Oracle performance work: Roch reports 40% improvement for JBOD (HDD) + separate log (SSD) with: File system or Zvol Role recordsize logbias data files 8 KB throughput redo logs 128 KB (default) latency (default) indices 8-32 KB? latency (default) 136
  • vdev Cache vdev cache occurs at the SPA level readahead 10 MBytes per vdev only caches metadata (b70 or later) Stats collected as Solaris kstats # kstat -n vdev_cache_stats module: zfs instance: 0 name: vdev_cache_stats class: misc crtime 38.83342625 delegations 14030 hits 105169 misses 59452 snaptime 4564628.18130739 Hit rate = 59%, not bad... 137
  • Intelligent Prefetching Intelligent file-level prefetching occurs at the DMU level Feeds the ARC In a nutshell, prefetch hits cause more prefetching Read a block, prefetch a block If we used the prefetched block, read 2 more blocks Up to 256 blocks Recognizes strided reads 2 sequential reads of same length and a fixed distance will be coalesced Fetches backwards Seems to work pretty well, as-is, for most workloads 138
  • Unintelligent Prefetch? Some workloads don't do so well with intelligent prefetch CR6859997, zfs caching performance problem, fixed in NV b124 Look for time spent in zfetch_* functions using lockstat lockstat -I sleep 10 Easy to disable in mdb for testing on Solaris echo zfs_prefetch_disable/W0t1 | mdb -kw Re-enable with echo zfs_prefetch_disable/W0t0 | mdb -kw Set via /etc/system set zfs:zfs_prefetch_disable = 1 139
  • I/O Queues By default, for devices which can support multiple I/Os, up to 35 I/Os are queued to each vdev Tunable with zfs_vdev_max_pending, set to 10 with: echo zfs_vdev_max_pending/W0t10 | mdb -kw Implies that more vdevs is better Consider avoiding RAID array with a single, large LUN ZFS I/O scheduler loses control once iops are queued CR6471212 proposes reserved slots for high-priority iops May need to match queues for the entire data path zfs_vdev_max_pending Fibre channel, SCSI, SAS, SATA driver RAID array controller Fast disks → small queues, slow disks → larger queues 140
  • COW Penalty COW can negatively affect workloads which have updates and sequential reads Initial writes will be sequential Updates (writes) will cause seeks to read data Lots of people seem to worry a lot about this Only affects HDDs Very difficult to speculate about the impact on real-world apps Large sequential scans of random data hurt anyway Reads are cached in many places in the data path Databases can COW, too Sysbench benchmark used to test on MySQL w/InnoDB engine One hour read/write test select count(*) repeat, for a week 141
  • COW Penalty Performance seems to level at about 25% penalty Results compliments of Allan Packer & Neelakanth Nadgir http://blogs.sun.com/realneel/resource/MySQL_Conference_2009_ZFS_MySQL.pdf 142
  • About Disks... Disks still the most important performance bottleneck Modern processors are multi-core Default checksums and compression are computationally efficient Average Max Size Rotational Average Seek Disk Size RPM (GBytes) Latency (ms) (ms) HDD 2.5” 5,400 500 5.5 11 HDD 3.5” 5,900 2,000 5.1 16 HDD 3.5” 7,200 1,500 4.2 8 - 8.5 HDD 2.5” 10,000 300 3 4.2 - 4.6 HDD 2.5” 15,000 146 2 3.2 - 3.5 SSD 2.5” N/A 73 0 0.02 - 0.15 (w) (r) SSD 2.5” N/A 500 0 0.02 - 0.15 143
  • DirectIO UFS forcedirectio option brought the early 1980s design of UFS up to the 1990s ZFS designed to run on modern multiprocessors Databases or applications which manage their data cache may benefit by disabling file system caching Expect L2ARC to improve random reads (secondarycache) Prefetch disabled by primarycache=none|metadata UFS DirectIO ZFS Unbuffered I/O primarycache=metadata primarycache=none Concurrency Available at inception Improved Async I/O code path Available at inception 144
  • Hybrid Storage Pool SPA separate log L2ARC Main Pool device cache device Write optimized HDD HDD Read optimized device (SSD) HDD device (SSD) Size (GBytes) < 1 GByte large big Cost write iops/$ size/$ size/$ Performance low-latency - low-latency writes reads 145
  • RAID-Z Bandwidth Traditional RAID-Z had a “mind the gap” feature Impacts possible bandwidth Mirrors could show higher bandwidth Now RAID-Z shows better bandwidth, when channel bandwidth is the constrained resource Implementation caused spurious errors for b118-b123 146
  • Troubleshooting 147
  • Checking Status zpool status zpool status -v Solaris fmadm faulty fmdump fmdump -ev or fmdump -eV format or rmformat 148
  • flash Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free What if the uberblock is updated prior to leaves? 149
  • What if flush is ignored? Some devices ignore cache flush commands (!) Virtualization default=ignore flush: VirtualBox, others? Some USB/Firewire to IDE/SATA converters Problem: uberblock could be updated before leaves Symptom: can’t import pool, uberblock points to random data Affected systems Many OSes and file systems Laptops - rarely because of battery Enterprise-class systems - rarely because of power redundancy and solid design Desktops - more frequently Solution (pending further automation) Check integrity of recent transaction groups If damaged, rollback to older uberblock Today, can do this by hand, but process is tedious 150
  • Can't Import Pool? Check device paths with zpool import Be aware of /etc/zfs/zpool.cache May need zpool -d directory option “phantom paths”? Check for 4 labels zdb -l /dev/dsk/c0t0d0s0 Beware of device short names: c0d0 != c0d0s0 151
  • Slow Pool Import? Case: zvols with snapshots Symptom: reboot or zpool import is really slllooooowwwwwww... Cause: inefficient incrementing over all zvols creating entries in /dev/zvol/dsk Cure: CR6761786 integrated in b125 152
  • File System Mounts B0rken? Prevention Avoid complex heirarchies (KISS) Be aware of legacy mounts Be aware of alternate boot environments (Solaris) Check mountpoint properties zfs list -o name,mountpoint Shared file systems Be aware of inherited shares Some clients do not mirror mount (Linux) NFS version differences? Check name services 153
  • Can't Boot? Check if BIOS/OBP supports booting from device Make sure LUN has SMI label, not EFI Common mistake when mirroring root OK: zpool attach rpool c0t0d0s0 c0t1d0s7 Not OK: zpool attach rpool c0t0d0s0 c0t1d0 installboot? grub issues Boot environments usually handled by grub Check grub menu.lst Know how to do a failsafe boot Be aware of LiveCD import Be aware of zpool.cache interactions 154
  • Future Plans Announced enhancements in the pipeline from Kernel Conference Australia, July 15-17 2009 Encryption Deduplication Block pointer rewrite Shadow migration More performance tweeks New block allocator Pipeline improvements Raw scrub Scrub prefetch Just in time decompression or decryption Native iSCSI (COMSTAR) Zero copy I/O Parallel device open 155
  • More Future Plans Snapshot holds (b124) Access-based enumeration (b125) Multiple mount protection Separate log offlining (b125) (removal later) 156
  • Now you know... ZFS structure: pools, datasets Data redundancy: mirrors, RAIDZ, copies Data verification: checksums Data replication: snapshots, clones, send, receive Hybrid storage: separate logs, cache devices, ARC Security: allow, deny, encryption Resource management: quotas, references, I/O scheduler Performance: latency, COW, zilstat, arcstat, logbias, recordsize Troubleshooting: FMA, zdb, importance of cache flushes 157
  • Its a wrap! Thank You! Questions? Richard.Elling@RichardElling.com 158