USENIX LISA11 Tutorial: ZFS a

ZFS: A File System for
Modern Hardware
Richard.Elling@RichardElling.com
Richard@Nexenta.com

USENIX LISA’11 Conference
USENIX LISA11 December, 2011

Agenda
• Overview
• Foundations
• Pooled Storage Layer
• Transactional Object Layer
• ZFS commands
• Sharing
• Properties
• Other goodies
• Performance
• Troubleshooting

ZFS Tutorial USENIX LISA’11 2

ZFS History
• Announced September 14, 2004
• Integration history
✦ SXCE b27 (November 2005)
✦ FreeBSD (April 2007)
✦ Mac OSX Leopard
✤ Preview shown, but removed from Snow Leopard
✤ Disappointed community reforming as the zfs-macos google group
(Oct 2009)
✦ OpenSolaris 2008.05
✦ Solaris 10 6/06 (June 2006)
✦ Linux FUSE (summer 2006)
✦ greenBytes ZFS+ (September 2008)
✦ Linux native port funded by the US DOE (2010)
• More than 45 patents, contributed to the CDDL Patents Common

ZFS Design Goals
• Figure out why storage has gotten so complicated
• Blow away 20+ years of obsolete assumptions
• Gotta replace UFS
• Design an integrated system from scratch

End the suffering


Limits

• 248 — Number of entries in any individual directory
• 256 — Number of attributes of a file [*]
• 256 — Number of files in a directory [*]
• 16 EiB (264 bytes) — Maximum size of a file system
• 16 EiB — Maximum size of a single file
• 16 EiB — Maximum size of any attribute
• 264 — Number of devices in any pool
• 264 — Number of pools in a system
• 264 — Number of file systems in a pool
• 264 — Number of snapshots of any file system
• 256 ZiB (278 bytes) — Maximum size of any pool

[*] actually constrained to 248 for the number of files in a ZFS file system


Understanding Builds
• Build is often referenced when speaking of feature/bug integration
• Short-hand notation: b###
• Distributions derived from Solaris NV (Nevada)
✦ NexentaStor
✦ Nexenta Core Platform
✦ SmartOS
✦ Solaris 11 (nee OpenSolaris)
✦ OpenIndiana
✦ StormOS
✦ BelleniX
✦ SchilliX
✦ MilaX
• OpenSolaris builds
✦ Binary builds died at b134
✦ Source releases continued through b147
• illumos stepping up to ﬁll void left by OpenSolaris’ demise

Community Links
• Community links
✦ nexenta.org
✦ nexentastor.org
✦ freebsd.org
✦ zfsonlinux.org
✦ zfs-fuse.net
✦ groups.google.com/group/zfs-macos
• ZFS Community
✦ hub.opensolaris.org/bin/view/Community+Group+zfs/
• IRC channels at irc.freenode.net
✦ #zfs


Overhead View of a Pool

Pool
File System
Configuration
Information

Volume
File System

Volume
Dataset


Hybrid Storage Pool

Adaptive Replacement Cache (ARC)

separate Main Pool
Main Pool Level 2 ARC
intent log

Write optimized HDD
HDD Read optimized
device (SSD) HDD device (SSD)

Size (GBytes) 1 - 10 GByte large big
Cost write iops/$ size/$ size/$

Use sync writes persistent storage read cache

Performance secondary
low-latency writes low-latency reads
optimization
Need more
speed? stripe more, faster devices stripe


Layer View

raw swap dump iSCSI ?? ZFS NFS CIFS ??

ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ??

Transactional Object Layer

Pooled Storage Layer

Block Device Driver

HDD SSD iSCSI ??

November 8, 2010 USENIX LISA’10 11

Source Code Structure

File system Mgmt
Device
Consumer Consumer
libzfs

Interface
ZPL ZVol /dev/zfs
Layer

Transactional ZIL ZAP Traversal
Object
Layer DMU DSL

ARC
Pooled
Storage ZIO
Layer
VDEV Configuration


Acronyms
• ARC – Adaptive Replacement Cache
• DMU – Data Management Unit
• DSL – Dataset and Snapshot Layer
• JNI – Java Native Interface
• ZPL – ZFS POSIX Layer (traditional ﬁle system interface)
• VDEV – Virtual Device
• ZAP – ZFS Attribute Processor
• ZIL – ZFS Intent Log
• ZIO – ZFS I/O layer
• Zvol – ZFS volume (raw/cooked block device interface)


NexentaStor Rosetta Stone

NexentaStor OpenSolaris/ZFS
Volume Storage pool
ZVol Volume
Folder File system


nvlists
• name=value pairs
• libnvpair(3LIB)
• Allows ZFS capabilities to change without changing the
physical on-disk format
• Data stored is XDR encoded
• A good thing, used often


Versioning
• Features can be added and identiﬁed by nvlist entries
• Change in pool or dataset versions do not change physical on-
disk format (!)
✦ does change nvlist parameters
• Older-versions can be used
✦ might see warning messages, but harmless
• Available versions and features can be easily viewed
✦ zpool upgrade -v
✦ zfs upgrade -v
• Online references (broken?)
✦ zpool: hub.opensolaris.org/bin/view/Community+Group+zfs/N
✦ zfs: hub.opensolaris.org/bin/view/Community+Group+zfs/N-1

Don't confuse zpool and zfs versions

zpool Versions

VER DESCRIPTION
--- ------------------------------------------------
1 Initial ZFS version
2 Ditto blocks (replicated metadata)
3 Hot spares and double parity RAID-Z
4 zpool history
5 Compression using the gzip algorithm
6 bootfs pool property
7 Separate intent log devices
8 Delegated administration
9 refquota and refreservation properties
10 Cache devices
11 Improved scrub performance
12 Snapshot properties
13 snapused property
14 passthrough-x aclinherit support

Continued...


More zpool Versions

VER DESCRIPTION
--- ------------------------------------------------
15 user/group space accounting
16 stmf property support
17 Triple-parity RAID-Z
18 snapshot user holds
19 Log device removal
20 Compression using zle (zero-length encoding)
21 Deduplication
22 Received properties
23 Slim ZIL
24 System attributes
25 Improved scrub stats
26 Improved snapshot deletion performance
27 Improved snapshot creation performance
28 Multiple vdev replacements

For Solaris 10, version 21 is “reserved”

zfs Versions

VER DESCRIPTION
----------------------------------------------
1 Initial ZFS filesystem version
2 Enhanced directory entries
3 Case insensitive and File system unique
identifier (FUID)
4 userquota, groupquota properties
5 System attributes


Copy on Write

1. Initial block tree 2. COW some data

3. COW metadata 4. Update Uberblocks & free


COW Notes

• COW works on blocks, not files
• ZFS reserves 32 MBytes or 1/64 of
pool size
✦ COWs need some free space to
remove files
✦ need space for ZIL
• For fixed-record size workloads
“fragmentation” and “poor
performance” can occur if the
recordsize is not matched
• Spatial distribution is good fodder for
performance speculation
✦ affects HDDs
✦ moot for SSDs

To fsck or not to fsck
• fsck was created to fix known inconsistencies in file system
metadata
✦ UFS is not transactional
✦ metadata inconsistencies must be reconciled
✦ does NOT repair data – how could it?
• ZFS doesn't need fsck, as-is
✦ all on-disk changes are transactional
✦ COW means previously existing, consistent metadata is not
overwritten
✦ ZFS can repair itself
✤ metadata is at least dual-redundant
✤ data can also be redundant
• Reality check – this does not mean that ZFS is not susceptible to
corruption
✦ nor is any other file system






Block Device Driver

HDD SSD iSCSI ??


vdevs – Virtual Devices

Logical vdevs

root vdev

top-level vdev top-level vdev
children[0] children[1]
mirror mirror

vdev vdev vdev vdev
type = disk type = disk type = disk type = disk
children[0] children[0] children[0] children[0]

Physical or leaf vdevs


vdev Labels
• vdev labels != disk labels
• Four 256 kByte labels written to every physical vdev
• Two-stage update process
✦ write label0 & label2
✦ ﬂush cache & check for errors
✦ write label1 & label3
✦ ﬂush cache & check for errors N = 256k * (size % 256k)
M = 128k / MIN(1k, sector size)
0 256k 512k 4M N-512k N-256k N

label0 label1 boot block label2 label3

...
Boot Name=Value
Blank
header Pairs M-slot Uberblock Array

0 8k 16k 128k 256k
25

ZFS Tutorial USENIX LISA’11

Observing Labels
# zdb -l /dev/rdsk/c0t0d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
version=14
name='rpool'
state=0
txg=13152
pool_guid=17111649328928073943
hostid=8781271
hostname=''
top_guid=11960061581853893368
guid=11960061581853893368
vdev_tree
type='disk'
id=0
guid=11960061581853893368
path='/dev/dsk/c0t0d0s0'
devid='id1,sd@SATA_____ST3500320AS_________________9QM3FWFT/a'
phys_path='/pci@0,0/pci1458,b002@11/disk@0,0:a'
whole_disk=0
metaslab_array=24
metaslab_shift=30
ashift=9
asize=157945167872
is_log=0

Uberblocks
• Sized based on minimum device block size
• Stored in 128-entry circular queue
• Only one uberblock is active at any time
✦ highest transaction group number
✦ correct SHA-256 checksum
• Stored in machine's native format
✦ A magic number is used to determine endian format when
imported
• Contains pointer to Meta Object Set (MOS)
Device Block Size Uberblock Size Queue Entries
512 Bytes,1 KB 1 KB 128
2 KB 2 KB 64
4 KB 4 KB 32

About Sizes
• Sizes are dynamic
• LSIZE = logical size
• PSIZE = physical size after compression
• ASIZE = allocated size including:
✦ physical size
✦ raidz parity
✦ gang blocks

Old notions of size reporting confuse people

VDEV


Dynamic Striping
• RAID-0
✦ SNIA definition: fixed-length sequences of virtual disk data
addresses are mapped to sequences of member disk addresses
in a regular rotating pattern
• Dynamic Stripe
✦ Data is dynamically mapped to member disks
✦ No fixed-length sequences
✦ Allocate up to ~1 MByte/vdev before changing vdev
✦ vdevs can be different size
✦ Good combination of the concatenation feature with RAID-0
performance


Dynamic Striping

RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes

384 kBytes

ZFS Dynamic Stripe recordsize = 128 kBytes

Total write size = 2816 kBytes


Mirroring
• Straightforward: put N copies of the data on N vdevs
• Unlike RAID-1
✦ No 1:1 mapping at the block level
✦ vdev labels are still at beginning and end
✦ vdevs can be of different size
✤ effective space is that of smallest vdev
• Arbitration: ZFS does not blindly trust either side of mirror
✦ Most recent, correct view of data wins
✦ Checksums validate data


Dynamic vdev Replacement

• zpool replace poolname vdev [vdev]
• Today, replacing vdev must be same size or larger
✦ NexentaStor 2 ‒ as measured by blocks
✦ NexentaStor 3 ‒ as measured by metaslabs
• Replacing all vdevs in a top-level vdev with larger vdevs results
in top-level vdev resizing
• Expansion policy controlled by:
✦ NexentaStor 2 ‒ resize on import
✦ NexentaStor 3 ‒ zpool autoexpand property

15G 10G
10G 15G 10G 20G 15G 20G
10G 15G 20G 20G 20G 20G
10G

10G Mirror 10G Mirror 15G Mirror 15G Mirror 20G Mirror


RAIDZ
• RAID-5
✦ Parity check data is distributed across the RAID array's disks
✦ Must read/modify/write when data is smaller than stripe width
• RAIDZ
✦ Dynamic data placement
✦ Parity added as needed
✦ Writes are full-stripe writes
✦ No read/modify/write (write hole)
• Arbitration: ZFS does not blindly trust any device
✦ Does not rely on disk reporting read error
✦ If checksum fails, read parity

Space used is dependent on how used

RAID-5 vs RAIDZ

DiskA DiskB DiskC DiskD DiskE
D0:0 D0:1 D0:2 D0:3 P0
RAID-5 P1 D1:0 D1:1 D1:2 D1:3
D2:3 P2 D2:0 D2:1 D2:2
D3:2 D3:3 P3 D3:0 D3:1

DiskA DiskB DiskC DiskD DiskE
P0 D0:0 D0:1 D0:2 D0:3
RAIDZ P1 D1:0 D1:1 P2:0 D2:0
D2:1 D2:2 D2:3 P2:1 D2:4
D2:5 Gap P3 D3:0


RAIDZ and Block Size

If block size >> N * sector size, space consumption is like RAID-5
If block size = sector size, space consumption is like mirroring

PSIZE=2KB
ASIZE=2.5KB DiskA DiskB DiskC DiskD DiskE
P0 D0:0 D0:1 D0:2 D0:3
P1 D1:0 D1:1 P2:0 D2:0
PSIZE=1KB D2:1 D2:2 D2:3 P2:1 D2:4
ASIZE=1.5KB D2:5 Gap P3 D3:0

PSIZE=3KB PSIZE=512 bytes
ASIZE=4KB + Gap ASIZE=1KB

Sector size = 512 bytes

Sector size can impact space savings

RAID-5 Write Hole
• Occurs when data to be written is smaller than stripe size
• Must read unallocated columns to recalculate the parity or the
parity must be read/modify/write
• Read/modify/write is risky for consistency
✦ Multiple disks
✦ Reading independently
✦ Writing independently
✦ System failure before all writes are complete to media could
result in data loss
• Effects can be hidden from host using RAID array with
nonvolatile write cache, but extra I/O cannot be hidden from
disks


RAIDZ2 and RAIDZ3
• RAIDZ2 = double parity RAIDZ
• RAIDZ3 = triple parity RAIDZ
• Sorta like RAID-6
✦ Parity 1: XOR
✦ Parity 2: another Reed-Soloman syndrome
✦ Parity 3: yet another Reed-Soloman syndrome
• Arbitration: ZFS does not blindly trust any device
✦ Does not rely on disk reporting read error
✦ If data not valid, read parity
✦ If data still not valid, read other parity

Space used is dependent on how used

Evaluating Data Retention
• MTTDL = Mean Time To Data Loss
• Note: MTBF is not constant in the real world, but keeps math
simple
• MTTDL[1] is a simple MTTDL model
• No parity (single vdev, striping, RAID-0)
✦ MTTDL[1] = MTBF / N
• Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
✦ MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)
• Double Parity (3-way mirror, RAIDZ2, RAID-6)
✦ MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)
• Triple Parity (4-way mirror, RAIDZ3)
✦ MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3)


Another MTTDL Model
• MTTDL[1] model doesn't take into account unrecoverable
read
• But unrecoverable reads (UER) are becoming the dominant
failure mode
✦ UER specifed as errors per bits read
✦ More bits = higher probability of loss per vdev
• MTTDL[2] model considers UER


Why Worry about UER?
• Richard's study
✦ 3,684 hosts with 12,204 LUNs
✦ 11.5% of all LUNs reported read errors
• Bairavasundaram et.al. FAST08
www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf
✦ 1.53M LUNs over 41 months
✦ RAID reconstruction discovers 8% of checksum mismatches
✦ “For some drive models as many as 4% of drives develop
checksum mismatches during the 17 months examined”


• RAID array study


• RAID array study

Unrecoverable Disk Disappeared
Reads “disk pull”

“Disk pull” tests aren’t very useful

MTTDL[2] Model
• Probability that a reconstruction will fail
✦ Precon_fail = (N-1) * size / UER
• Model doesn't work for non-parity schemes
✦ single vdev, striping, RAID-0
• Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
✦ MTTDL[2] = MTBF / (N * Precon_fail)
• Double Parity (3-way mirror, RAIDZ2, RAID-6)
✦ MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)
• Triple Parity (4-way mirror, RAIDZ3)
✦ MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 *
Precon_fail)


Practical View of MTTDL[1]


MTTDL[1] Comparison


MTTDL Models: Mirror

Spares are not always better...

MTTDL Models: RAIDZ2


Space, Dependability, and Performance


Dependability Use Case
• Customer has 15+ TB of read-mostly data
• 16-slot, 3.5” drive chassis
• 2 TB HDDs
• Option 1: one raidz2 set
✦ 24 TB available space
✤ 12 data
✤ 2 parity
✤ 2 hot spares, 48 hour disk replacement time
✦ MTTDL[1] = 1,790,000 years
• Option 2: two raidz2 sets
✦ 24 TB available space (each set)
✤ 6 data
✤ 2 parity
✤ no hot spares
✦ MTTDL[1] = 7,450,000 years

Ditto Blocks
• Recall that each blkptr_t contains 3 DVAs
• Dataset property used to indicate how many copies (aka ditto
blocks) of data is desired
✦ Write all copies
✦ Read any copy
✦ Recover corrupted read from a copy
• Not a replacement for mirroring
✦ For single disk, can handle data loss on approximately 1/8
contiguous space
• Easier to describe in pictures...
copies parameter Data copies Metadata copies
copies=1 (default) 1 2
copies=2 2 3
copies=3 3 3

Copies in Pictures


When Good Data Goes Bad

File system If it’s a metadata Or we get
does bad read block FS panics back bad
Can not tell does disk rebuild data


Checksum Veriﬁcation

ZFS veriﬁes checksums for every read
Repairs data when possible (mirror, raidz, copies>1)

Read bad data Read good data Repair bad data


ZIO - ZFS I/O Layer

56

ZIO Framework
• All physical disk I/O goes through ZIO Framework
• Translates DVAs into Logical Block Address (LBA) on leaf
vdevs
✦ Keeps free space maps (spacemap)
✦ If contiguous space is not available:
✤ Allocate smaller blocks (the gang)
✤ Allocate gang block, pointing to the gang
• Implemented as multi-stage pipeline
✦ Allows extensions to be added fairly easily
• Handles I/O errors


ZIO Write Pipeline

ZIO State Compression Checksum DVA vdev I/O

open
compress if
savings >
12.5%
generate
allocate

start
start
start
done
done
done
assess
assess
assess
done

Gang and deduplicaiton activity elided, for clarity

ZIO Read Pipeline

ZIO State Compression Checksum DVA vdev I/O

open

start
start
start
done
done
done
assess
assess
assess

verify

decompress

done

Gang and deduplicaiton activity elided, for clarity

VDEV – Virtual Device Subsytem
• Where mirrors, RAIDZ, and Name Priority
RAIDZ2 are implemented
NOW 0
✦ Surprisingly few lines of code
SYNC_READ 0
needed to implement RAID
SYNC_WRITE 0
• Leaf vdev (physical device) I/O FREE 0
management CACHE_FILL 0
✦ Number of outstanding iops LOG_WRITE 0
✦ Read-ahead cache ASYNC_READ 4
• Priority scheduling ASYNC_WRITE 4
RESILVER 10
SCRUB 20


ARC - Adaptive
Replacement Cache

61

Object Cache
• UFS uses page cache managed by the virtual memory system
• ZFS does not use the page cache, except for mmap'ed ﬁles
• ZFS uses a Adaptive Replacement Cache (ARC)
• ARC used by DMU to cache DVA data objects
• Only one ARC per system, but caching policy can be changed
on a per-dataset basis
• Seems to work much better than page cache ever did for UFS


Traditional Cache
• Works well when data being accessed was recently added
• Doesn't work so well when frequently accessed data is evicted

Misses cause insert

MRU

Dynamic caches can change
Cache size size by either not evicting
or aggressively evicting

LRU

Evict the oldest


ARC – Adaptive Replacement Cache

Evict the oldest single-use entry

LRU
Recent
Cache
Miss
MRU
Evictions and dynamic
MFU size resizing needs to choose best
Hit
cache to evict (shrink)
Frequent
Cache

LFU

Evict the oldest multiple accessed entry


ARC with Locked Pages

Evict the oldest single-use entry

Cannot evict LRU
locked pages!
Recent
Cache
Miss
MRU
MFU size
Hit
Frequent
If hit occurs Cache
within 62 ms
LFU

Evict the oldest multiple accessed entry

ZFS ARC handles mixed-size pages


L2ARC – Level 2 ARC
• Data soon to be evicted from the ARC is added
to a queue to be sent to cache vdev
✦ Another thread sends queue to cache vdev ARC
✦ Data is copied to the cache vdev with a throttle
data soon to
to limit bandwidth consumption be evicted
✦ Under heavy memory pressure, not all evictions
will arrive in the cache vdev
• ARC directory remains in memory
• Good idea - optimize cache vdev for fast reads
✦ lower latency than pool disks
✦ inexpensive way to “increase memory”
cache
• Content considered volatile, no raid needed
• Monitor usage with zpool iostat and ARC kstats


ARC Directory
• Each ARC directory entry contains arc_buf_hdr structs
✦ Info about the entry
✦ Pointer to the entry
• Directory entries have size, ~200 bytes
• ZFS block size is dynamic, sector size to 128 kBytes
• Disks are large
• Suppose we use a Seagate LP 2 TByte disk for the L2ARC
✦ Disk has 3,907,029,168 512 byte sectors, guaranteed
✦ Workload uses 8 kByte ﬁxed record size
✦ RAM needed for arc_buf_hdr entries
✤ Need = (3,907,029,168 - 9,232) * 200 / 16 = ~48 GBytes
• Don't underestimate the RAM needed for large L2ARCs


ARC Tips
• In general, it seems to work well for most workloads
• ARC size will vary, based on usage
✦ Default target max is 7/8 of physical memory or (memory - 1
GByte)
✦ Target min is 64 MB
✦ Metadata capped at 1/4 of max ARC size
• Dynamic size can be reduced when:
✦ page scanner is running
✤ freemem < lotsfree + needfree + desfree
✦ swapfs does not have enough space so that anonymous
reservations can succeed
✤ availrmem < swapfs_minfree + swapfs_reserve + desfree
✦ [x86 only] kernel heap space more than 75% full
• Can limit at boot time

Observing ARC
• ARC statistics stored in kstats
• kstat -n arcstats
• Interesting statistics:
✦ size = current ARC size
✦ p = size of MFU cache
✦ c = target ARC size
✦ c_max = maximum target ARC size
✦ c_min = minimum target ARC size
✦ l2_hdr_size = space used in ARC by L2ARC
✦ l2_size = size of data in L2ARC


General Status - ARC


More ARC Tips
• Performance
✦ Prior to b107, L2ARC fill rate was limited to 8 MB/sec
✦ After b107, cold L2ARC fill rate increases to 16 MB/sec
• Internals tracked by kstats in Solaris
✦ Use memory_throttle_count to observe pressure to
evict
• Dedup Table (DDT) also uses ARC
✦ lots of dedup objects need lots of RAM
✦ field reports that L2ARC can help with dedup

L2ARC keeps its directory in kernel memory

Transactional
Object Layer

72

ﬂash
Source Code Structure

File system Mgmt
Device
Consumer Consumer
libzfs

Interface
ZPL ZVol /dev/zfs
Layer

Transactional ZIL ZAP Traversal
Object
Layer DMU DSL

ARC
Pooled
Storage ZIO
Layer
VDEV Configuration


Transaction Engine
• Manages physical I/O
• Transactions grouped into transaction group (txg)
✦ txg updates
✦ All-or-nothing
✦ Commit interval
✤ Older versions: 5 seconds
✤ Less old versions: 30 seconds
✤ b143 and later: 5 seconds
• Delay committing data to physical storage
✦ Improves performance
✦ A bad thing for sync workload performance – hence the ZFS
Intent Log (ZIL)

30 second delay can impact failure detection time

ZIL – ZFS Intent Log
• DMU is transactional, and likes to group I/O into transactions
for later commits, but still needs to handle “write it now”
desire of sync writers
✦ NFS
✦ Databases
• ZIL recordsize inﬂation can occur for some workloads
✦ May cause larger than expected actual I/O for sync workloads
✦ Oracle redo logs
✦ No slog: can tune zfs_immediate_write_sz,
zvol_immediate_write_sz
✦ With slog: use logbias property instead
• Never read, except at import (eg reboot), when transactions
may need to be rolled forward


Separate Logs (slogs)
• ZIL competes with pool for IOPS
✦ Applications wait for sync writes to be on nonvolatile media
✦ Very noticeable on HDD JBODs
• Put ZIL on separate vdev, outside of pool
✦ ZIL writes tend to be sequential
✦ No competition with pool for IOPS
✦ Downside: slog device required to be operational at import
✦ NexentaStor 3 allows slog device removal
✦ Size of separate log < than size of RAM (duh)
• 10x or more performance improvements possible
✦ Nonvolatile RAM card
✦ Write-optimized SSD
✦ Nonvolatile write cache on RAID array


zilstat
• http://www.richardelling.com/Home/scripts-and-programs-1/
zilstat
• Integrated into NexentaStor 3.0.3
✦ nmc: show performance zil


Synchronous Write Destination

Without separate log
Sync I/O size >
ZIL Destination
zfs_immediate_write_sz ?
no ZIL log
yes bypass to pool

With separate log
logbias? ZIL Destination
latency (default) log device
throughput bypass to pool

Default zfs_immediate_write_sz = 32 kBytes

ZIL Synchronicity Project
• All-or-nothing policies don’t work well, in general
• ZIL Synchronicity project proposed by Robert Milkowski
✦ http://milek.blogspot.com
• Adds new sync property to datasets
• Arrived in b140
sync Parameter Behaviour
Policy follows previous design: write
standard (default)
immediate size and separate logs
always All writes become synchronous (slow)
disabled Synchronous write requests are ignored


Disabling the ZIL

• Preferred method: change dataset sync property
• Rule 0: Don’t disable the ZIL
• If you love your data, do not disable the ZIL
• You can ﬁnd references to this as a way to speed up ZFS
✦ NFS workloads
✦ “tar -x” benchmarks
• Golden Rule: Don’t disable the ZIL
• Can set via mdb, but need to remount the ﬁle system
• Friends don’t let friends disable the ZIL
• Older Solaris - can set in /etc/system
• NexentaStor has checkbox for disabling ZIL
• Nostradamus wrote, “disabling the ZIL will lead to the
apocalypse”

DSL - Dataset and
Snapshot Layer

81

Dataset & Snapshot Layer
• Object
✦ Allocated storage
✦ dnode describes collection of blocks
• Object Set
Dataset Directory
✦ Group of related objects Dataset
• Dataset Object Set Childmap
✦ Snapmap: snapshot relationships
Object
Object
✦ Space usage Object Properties

• Dataset directory Snapmap
✦ Childmap: dataset relationships
✦ Properties


ﬂash
Copy on Write

1. Initial block tree 2. COW some data

3. COW metadata 4. Update Uberblocks & free

z

zfs snapshot
• Create a read-only, point-in-time window into the dataset (ﬁle
system or Zvol)
• Computationally free, because of COW architecture
• Very handy feature
✦ Patching/upgrades
• Basis for time-related snapshot interfaces
✦ Solaris Time Slider
✦ NexentaStor Delorean Plugin
✦ NexentaStor Virtual Machine Data Center


Snapshot
• Create a snapshot by not free'ing COWed blocks
• Snapshot creation is fast and easy
• Number of snapshots determined by use – no hardwired limit
• Recursive snapshots also possible

Snapshot tree Current tree
root root


auto-snap service


Clones
• Snapshots are read-only
• Clones are read-write based upon a snapshot
• Child depends on parent
✦ Cannot destroy parent without destroying all children
✦ Can promote children to be parents
• Good ideas
✦ OS upgrades
✦ Change control
✦ Replication
✤ zones
✤ virtual disks


zfs clone
• Create a read-write ﬁle system from a read-only snapshot
• Solaris boot environment administation
Install Checkpoint Clone Checkpoint

OS rev1 OS rev1 OS rev1 OS rev1 OS rev1

rootfs- rootfs- rootfs- rootfs-
nmu- nmu- nmu- nmu-
001 001 001 001

patch/
OS rev1 OS rev1 OS rev1
upgrade
clone clone clone
rootfs-
nmu-
002
grubboot
manager

Origin snapshot cannot be destroyed, if clone exists


What is Deduplication?
• A $2.1 Billion feature
• 2009 buzzword of the year
• Technique for improving storage space efﬁciency
✦ Trades big I/Os for small I/Os
✦ Does not eliminate I/O
• Implementation styles
✦ ofﬂine or post processing
✤ data written to nonvolatile storage
✤ process comes along later and dedupes data
✤ example: tape archive dedup
✦ inline
✤ data is deduped as it is being allocated to nonvolatile storage
✤ example: ZFS


Dedup how-to
• Given a bunch of data
• Find data that is duplicated
• Build a lookup table of references to data
• Replace duplicate data with a pointer to the entry in the
lookup table
• Grainularity
✦ ﬁle
✦ block
✦ byte


Dedup in ZFS
• Leverage block-level checksums
✦ Identify blocks which might be duplicates
✦ Variable block size is ok
• Synchronous implementation
✦ Data is deduped as it is being written
• Scalable design
✦ No reference count limits
• Works with existing features
✦ compression
✦ copies
✦ scrub
✦ resilver
• Implemented in ZIO pipeline

Deduplication Table (DDT)
• Internal implementation
✦ Adelson-Velskii, Landis (AVL) tree
✦ Typical table entry ~270 bytes
✤ checksum
✤ logical size
✤ physical size
✤ references
✦ Table entry size increases as the number of references
increases


Reference Counts

Eggs courtesy of Richard’s chickens

Reference Counts
• Problem: loss of the referenced data affects all referrers
• Solution: make additional copies of referred data based upon a
threshold count of referrers
✦ leverage copies (ditto blocks)
✦ pool-level threshold for automatically adding ditto copies
✤ set via dedupditto pool property
# zpool set dedupditto=50 zwimming

✤ add 2nd copy when dedupditto references (50) reached
✤ add 3rd copy when dedupditto2 references (2500) reached


Veriﬁcation

write()

compress

checksum

DDT entry lookup

yes no
DDT
verify?
match?

no yes

read data

data yes
match? add reference

no

new entry


Enabling Dedup
• Set dedup property for each dataset to be deduped
• Remember: properties are inherited
• Remember: only applies to newly written data

dedup checksum verify?
on
SHA256 no
sha256
on,verify
SHA256 yes
sha256,verify

Fletcher is considered too weak, without verify

Dedup Accounting
• ...and you thought compression accounting was hard...
• Remember: dedup works at pool level
✦ dataset-level accounting doesn’t see other datasets
✦ pool-level accounting is always correct

zfs list
NAME USED AVAIL REFER MOUNTPOINT
bar 7.56G 449G 22K /bar
bar/ws 7.56G 449G 7.56G /bar/ws
dozer 7.60G 455G 22K /dozer
dozer/ws 7.56G 455G 7.56G /dozer/ws
tank 4.31G 456G 22K /tank
tank/ws 4.27G 456G 4.27G /tank/ws

zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
bar 464G 7.56G 456G 1% 1.00x ONLINE -
dozer 464G 1.43G 463G 0% 5.92x ONLINE -
tank 464G 957M 463G 0% 5.39x ONLINE -

ZFS Tutorial DataUSENIX LISA’11team
courtesy of the ZFS 98

DDT Histogram

# zdb -DD tank
DDT-sha256-zap-duplicate: 110173 entries, size 295 on disk, 153 in core
DDT-sha256-zap-unique: 302 entries, size 42194 on disk, 52827 in core

DDT histogram (aggregated over all DDTs):

bucket! allocated! referenced
______ ___________________________ ___________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 302 7.26M 4.24M 4.24M 302 7.26M 4.24M 4.24M
2 103K 1.12G 712M 712M 216K 2.64G 1.62G 1.62G
4 3.11K 30.0M 17.1M 17.1M 14.5K 168M 95.2M 95.2M
8 503 11.6M 6.16M 6.16M 4.83K 129M 68.9M 68.9M
16 100 4.22M 1.92M 1.92M 2.14K 101M 45.8M 45.8M

ZFS Tutorial USENIX LISA’11
Data courtesy of the ZFS team 99

DDT Histogram

$ zdb -DD zwimming
DDT-sha256-zap-duplicate: 19725 entries, size 270 on disk, 153 in core
DDT-sha256-zap-unique: 52369639 entries, size 284 on disk, 159 in core

DDT histogram (aggregated over all DDTs):

bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 49.9M 25.0G 25.0G 25.0G 49.9M 25.0G 25.0G 25.0G
2 16.7K 8.33M 8.33M 8.33M 33.5K 16.7M 16.7M 16.7M
4 610 305K 305K 305K 3.33K 1.66M 1.66M 1.66M
8 661 330K 330K 330K 6.67K 3.34M 3.34M 3.34M
16 242 121K 121K 121K 5.34K 2.67M 2.67M 2.67M
32 131 65.5K 65.5K 65.5K 5.54K 2.77M 2.77M 2.77M
64 897 448K 448K 448K 84K 42M 42M 42M
128 125 62.5K 62.5K 62.5K 18.0K 8.99M 8.99M 8.99M
8K 1 512 512 512 12.5K 6.27M 6.27M 6.27M
Total 50.0M 25.0G 25.0G 25.0G 50.1M 25.1G 25.1G 25.1G

dedup = 1.00, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.00


Over-the-wire Dedup
• Dedup is also possible over the send/receive pipe
✦ Blocks with no checksum are considered duplicates (no verify
option)
✦ First copy sent as usual
✦ Subsequent copies sent by reference
• Independent of dedup status of originating pool
✦ Receiving pool knows about blocks which have already arrived
• Can be a win for dedupable data, especially over slow wires
• Remember: send/receive version rules still apply

# zfs send -DR zwimming/stuff


Dedup Performance
• Dedup can save space and bandwidth
• Dedup increases latency
✦ Caching data improves latency
✦ More memory → more data cached
✦ Cache performance heirarchy
✤ RAM: fastest
✤ L2ARC on SSD: slower
✤ Pool HDD: dreadfully slow
• ARC is currently not deduped
• Difﬁcult to predict
✦ Dependent variable: number of blocks
✦ Estimate 270 bytes per unique block
✦ Example:
✤ 50M blocks * 270 bytes/block = 13.5 GBytes

Deduplication Use Cases

Data type Dedupe Compression
Home directories ✔✔ ✔✔

Internet content ✔ ✔

Media and video ✔✔ ✔

Life sciences ✘ ✔✔

Oil and Gas (seismic) ✘ ✔✔

Virtual machines ✔✔ ✘

Archive ✔✔✔✔ ✔







Block Device Driver

HDD SSD iSCSI ??


zpool create
• zpool create poolname vdev-configuration
• nmc: setup volume create
✦ vdev-configuration examples

✤ mirror c0t0d0 c3t6d0
✤ mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6
✤ mirror disk1s0 disk2s0 cache disk4s0 log disk5
✤ raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0
• Solaris
✦ Additional checks for disk/slice overlaps or in use
✦ Whole disks are given EFI labels
• Can set initial pool or dataset properties
• By default, creates a file system with the same name
✦ poolname pool → /poolname file system

People get confused by a file system with same name as the pool

zpool destroy
• Destroy the pool and all datasets therein
• zpool destroy poolname
✦ Can (try to) force with “-f”
✦ There is no “are you sure?” prompt – if you weren't sure, you
would not have typed “destroy”
• nmc: destroy volume volumename
✦ nmc prompts for conﬁrmation, by default

zpool destroy is destructive... really! Use with caution!

zpool add
• Adds a device to the pool as a top-level vdev
• Does NOT not add columns to a raidz set
• Does NOT attach a mirror – use zpool attach instead
• zpool add poolname vdev-configuration
✦ vdev-configuration can be any combination also used for zpool

create
✦ Complains if the added vdev-configuration would cause a

different data protection scheme than is already in use
✤ use “-f” to override
✦ Good idea: try with “-n” flag first
✤ will show final configuration without actually performing the add
• nmc: setup volume volumename grow

Do not add a device which is in use as a cluster quorum device

zpool remove
• Remove a top-level vdev from the pool
• zpool remove poolname vdev
• nmc: setup volume volumename remove-lun
• Today, you can only remove the following vdevs:
✦ cache

✦ hot spare

✦ separate log (b124, NexentaStor 3.0)

Don't confuse “remove” with “detach”

zpool attach
• Attach a vdev as a mirror to an existing vdev
• zpool attach poolname existing-vdev vdev
• nmc: setup volume volumename attach-lun
• Attaching vdev must be the same size or larger than the
existing vdev

vdev Conﬁgurations
ok simple vdev → mirror
ok mirror
ok log → mirrored log
no RAIDZ
no RAIDZ2
no RAIDZ3


zpool detach
• Detach a vdev from a mirror
• zpool detach poolname vdev
• nmc: setup volume volumename detach-lun
• A resilvering vdev will wait until resilvering is complete


zpool replace
• Replaces an existing vdev with a new vdev
• zpool replace poolname existing-vdev vdev
• nmc: setup volume volumename replace-lun
• Effectively, a shorthand for “zpool attach” followed by “zpool
detach”
• Attaching vdev must be the same size or larger than the
existing vdev
• Works for any top-level vdev-conﬁguration, including RAIDZ

“Same size” literally means the same number of blocks until b117.
Many “same size” disks have different number of available blocks.

zpool import
• Import a pool and mount all mountable datasets
• Import a speciﬁc pool
✦ zpool import poolname
✦ zpool import GUID
✦ nmc: setup volume import
• Scan LUNs for pools which may be imported
✦ zpool import
• Can set options, such as alternate root directory or other
properties
✦ alternate root directory important for rpool or syspool

Beware of zpool.cache interactions

Beware of artifacts, especially partial artifacts

zpool export
• Unmount datasets and export the pool
• zpool export poolname
• nmc: setup volume volumename export
• Removes pool entry from zpool.cache
✦ useful when unimported pools remain in zpool.cache


zpool upgrade

• Display current versions
✦ zpool upgrade
• View available upgrade versions, with features, but don't
actually upgrade
✦ zpool upgrade -v
• Upgrade pool to latest version
✦ zpool upgrade poolname
✦ nmc: setup volume volumename version-
upgrade
• Upgrade pool to speciﬁc version

Once you upgrade, there is no downgrade

Beware of grub and rollback issues

zpool history
• Show history of changes made to the pool
• nmc and Solaris use same command
# zpool history rpool
History for 'rpool':
2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -o
cachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s0
2009-03-04.07:29:47 zfs set canmount=noauto rpool
2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool
2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT
2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap
2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump
2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_106
2009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool
2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_106
2009-03-04.07:29:51 zfs set canmount=on rpool
2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export
2009-03-04.07:29:51 zfs create rpool/export/home
2009-03-04.00:21:42 zpool import -f -R /a 17111649328928073943
2009-03-04.00:21:42 zpool export rpool
2009-03-04.08:47:08 zpool set bootfs=rpool rpool
2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool
2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b108
2009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/snv_b108
...


zpool status
• Shows the status of the current pools, including their
conﬁguration
• Important troubleshooting step
# zpool status
…
pool: zwimming
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
zwimming ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t2d0s0 ONLINE 0 0 0
c0t0d0s7 ONLINE 0 0 0
errors: No known data errors

Understanding status output error messages can be tricky

zpool clear
• Clears device errors
• Clears device error counters
• Starts any resilvering, as needed
• Improves sysadmin sanity and reduces sweating
• zpool clear poolname
• nmc: setup volume volumename clear-errors


zpool iostat
• Show pool physical I/O activity, in an iostat-like manner
• Solaris: fsstat will show I/O activity looking into a ZFS ﬁle
system
• Especially useful for showing slog activity
# zpool iostat -v
capacity operations bandwidth
pool used avail read write read write
------------ ----- ----- ----- ----- ----- -----
rpool 16.5G 131G 0 0 1.16K 2.80K
c0t0d0s0 16.5G 131G 0 0 1.16K 2.80K
------------ ----- ----- ----- ----- ----- -----
zwimming 135G 14.4G 0 5 2.09K 27.3K
mirror 135G 14.4G 0 5 2.09K 27.3K
c0t2d0s0 - - 0 3 1.25K 27.5K
c0t0d0s7 - - 0 2 1.27K 27.5K
------------ ----- ----- ----- ----- ----- -----

Unlike iostat, does not show latency

zpool scrub
• Manually starts scrub
✦ zpool scrub poolname
• Scrubbing performed in background
• Use zpool status to track scrub progress
• Stop scrub
✦ zpool scrub -s poolname
• How often to scrub?
✦ Depends on level of paranoia
✦ Once per month seems reasonable
✦ After a repair or recovery procedure
• NexentaStor auto-scrub features easily manages scrubs and
schedules

Estimated scrub completion time improves over time

auto-scrub service


Dataset Management





Block Device Driver

HDD SSD iSCSI ??


zfs create, destroy
• By default, a ﬁle system with the same name as the pool is
created by zpool create
• Dataset name format is: pool/name[/name ...]
• File system / folder
✦ zfs create dataset-name
✦ nmc: create folder
✦ zfs destroy dataset-name
✦ nmc: destroy folder
• Zvol
✦ zfs create -V size dataset-name
✦ nmc: create zvol
✦ zfs destroy dataset-name
✦ nmc: destroy zvol


zfs mount, unmount
• Note: mount point is a file system parameter
✦ zfs get mountpoint fs-name
• Rarely used subcommand (!)
• Display mounted file systems
✦ zfs mount
• Mount a file system
✦ zfs mount fs-name
✦ zfs mount -a
• Unmount (not umount)
✦ zfs unmount fs-name
✦ zfs unmount -a


zfs list
• List mounted datasets
• NexentaStor 2: listed everything
• NexentaStor 3: do not list snapshots
✦ See zpool listsnapshots property
• Examples
✦ zfs list
✦ zfs list -t snapshot
✦ zfs list -H -o name


Replication Services

Days
Traditional Backup NDMP

Hours Auto-Tier
Recovery rsync

Point Text
Auto-Sync
ZFS send/receive
Objective
Seconds
Auto-CDP Application Level
AVS (SNDR) Mirror Replication

Slower Faster

System I/O Performance

zfs send, receive
• Send
✦ send a snapshot to stdout
✦ data is decompressed
• Receive
✦ receive a snapshot from stdin
✦ receiving ﬁle system parameters apply (compression, et.al)
• Can incrementally send snapshots in time order
• Handy way to replicate dataset snapshots
• NexentaStor
✦ simpliﬁes management
✦ manages snapshots and send/receive to remote systems
• Only method for replicating dataset properties, except quotas
• NOT a replacement for traditional backup solutions

auto-sync Service


zfs upgrade

• Display current versions
✦ zfs upgrade
• View available upgrade versions, with features, but don't
actually upgrade
✦ zfs upgrade -v
• Upgrade pool to latest version
✦ zfs upgrade dataset
• Upgrade pool to speciﬁc version
✦ zfs upgrade -V version dataset
• NexentaStor: not needed until 3.0
You can upgrade, there is no downgrade

Beware of grub and rollback issues

Sharing
• zfs share dataset
• Type of sharing set by parameters
✦ shareiscsi = [on | off]

✦ sharenfs = [on | off | options]

✦ sharesmb = [on | off | options]

• Shortcut to manage sharing
✦ Uses external services (nfsd, iscsi target, smbshare, etc)

✦ Importing pool will also share

✦ Implementation is OS-speciﬁc

✤ sharesmb uses in-kernel SMB server for Solaris-derived OSes
✤ sharesmb uses Samba for FreeBSD


Properties

• Properties are stored in an nvlist
• By default, are inherited
• Some properties are common to all datasets, but a speciﬁc
dataset type may have additional properties
• Easily set or retrieved via scripts
• In general, properties affect future ﬁle system activity

zpool get doesn't script as nicely as zfs get

Getting Properties

• zpool get all poolname
• nmc: show volume volumename property
propertyname
• zpool get propertyname poolname

• zfs get all dataset-name
• nmc: show folder foldername property
• nmc: show zvol zvolname property


Setting Properties

• zpool set propertyname=value poolname
• nmc: setup volume volumename property
propertyname

• zfs set propertyname=value dataset-name
• nmc: setup folder foldername property
propertyname


User-deﬁned Properties
• Names
✦ Must include colon ':'
✦ Can contain lower case alphanumerics or “+” “.” “_”
✦ Max length = 256 characters
✦ By convention, module:property
✤ com.sun:auto-snapshot
• Values
✦ Max length = 1024 characters
• Examples
✦ com.sun:auto-snapshot=true
✦ com.richardelling:important_ﬁles=true


Clearing Properties
• Reset to inherited value
✦ zfs inherit compression export/home/relling
• Clear user-deﬁned parameter
✦ zfs inherit com.sun:auto-snapshot export/
home/relling
• NexentaStor doesn’t offer method in nmc


Pool Properties

Property Change? Brief Description
altroot Alternate root directory (ala chroot)
autoexpand Policy for expanding when vdev size changes
autoreplace vdev replacement policy
available readonly Available storage space
bootfs Default bootable dataset for root pool
Cache ﬁle to use other than /etc/zfs/
cacheﬁle
zpool.cache
capacity readonly Percent of pool space used
delegation Master pool delegation switch
failmode Catastrophic pool failure policy


More Pool Properties

guid readonly Unique identiﬁer
health readonly Current health of the pool
listsnapshots zfs list policy
size readonly Total size of pool
used readonly Amount of space used
version readonly Current on-disk version


Common Dataset Properties

available readonly Space available to dataset & children
checksum Checksum algorithm
compression Compression algorithm
Compression ratio – logical
compressratio readonly
size:referenced physical
copies Number of copies of user data
creation readonly Dataset creation time
dedup Deduplication policy
logbias Separate log write policy
mlslabel Multilayer security label
origin readonly For clones, origin snapshot


More Common Dataset Properties

primarycache ARC caching policy
readonly Is dataset in readonly mode?
referenced readonly Size of data accessible by this dataset
Minimum space guaranteed to a
refreservation dataset, excluding descendants
(snapshots & clones)
Minimum space guaranteed to dataset,
reservation
including descendants
secondarycache L2ARC caching policy
sync Synchronous write policy
Type of dataset (ﬁlesystem, snapshot,
type readonly
volume)


More Common Dataset Properties

used readonly Sum of usedby* (see below)
usedbychildren readonly Space used by descendants
usedbydataset readonly Space used by dataset
Space used by a refreservation for
usedbyrefreservation readonly
this dataset
Space used by all snapshots of this
usedbysnapshots readonly
dataset
Is dataset added to non-global zone
zoned readonly
(Solaris)


Volume Dataset Properties

shareiscsi iSCSI service (not COMSTAR)
volblocksize creation ﬁxed block size
volsize Implicit quota
Set if dataset delegated to non-global
zoned readonly
zone (Solaris)


File System Properties

ACL inheritance policy, when files or
aclinherit
directories are created
ACL modification policy, when chmod is
aclmode
used
atime Disable access time metadata updates
canmount Mount policy
Filename matching algorithm (CIFS client
casesensitivity creation
feature)
devices Device opening policy for dataset
exec File execution policy for dataset
mounted readonly Is file system currently mounted?


More File System Properties

export/ File system should be mounted with non-blocking
nbmand
import mandatory locks (CIFS client feature)
normalization creation Unicode normalization of ﬁle names for matching
quota Max space dataset and descendants can consume
recordsize Suggested maximum block size for ﬁles
Max space dataset can consume, not including
refquota
descendants
setuid setuid mode policy
sharenfs NFS sharing options
sharesmb Files system shared with CIFS


File System Properties

snapdir Controls whether .zfs directory is hidden
utf8only creation UTF-8 character ﬁle name policy
vscan Virus scan enabled
xattr Extended attributes policy


Forking Properties

Pool Properties
Release Property Brief Description
illumos comment Human-readable comment ﬁeld

Dataset Properties
Release Property Brief Description
Solaris 11 encryption Dataset encryption
Delphix/illumos clones Clone descendants
Delphix/illumos refratio Compression ratio for references
Solaris 11 share Combines sharenfs & sharesmb
Solaris 11 shadow Shadow copy
NexentaOS/illumos worm WORM feature
Amount of data written since last
Delphix/illumos written
snapshot

Dataset Space Accounting
• used = usedbydataset + usedbychildren + usedbysnapshots +
usedbyrefreservation
• Lazy updates, may not be correct until txg commits
• ls and du will show size of allocated ﬁles which includes all
copies of a ﬁle
• Shorthand report available

$ zfs list -o space
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
rpool 126G 18.3G 0 35.5K 0 18.3G
rpool/ROOT 126G 15.3G 0 18K 0 15.3G
rpool/ROOT/snv_106 126G 86.1M 0 86.1M 0 0
rpool/ROOT/snv_b108 126G 15.2G 5.89G 9.28G 0 0
rpool/dump 126G 1.00G 0 1.00G 0 0
rpool/export 126G 37K 0 19K 0 18K
rpool/export/home 126G 18K 0 18K 0 0
rpool/swap 128G 2G 0 193M 1.81G 0


Pool Space Accounting
• Pool space accounting changed in b128, along with
deduplication
• Compression, deduplication, and raidz complicate pool
accounting (the numbers are correct, the interpretation is
suspect)
• Capacity planning for remaining free space can be challenging

$ zpool list zwimming
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
zwimming 100G 43.9G 56.1G 43% 1.00x ONLINE -


zfs vs zpool Space Accounting
• zfs list != zpool list
• zfs list shows space used by the dataset plus space for
internal accounting
• zpool list shows physical space available to the pool
• For simple pools and mirrors, they are nearly the same
• For RAIDZ, RAIDZ2, or RAIDZ3, zpool list will show space
available for parity

Users will be confused about reported space available


NexentaStor Snapshot Services


Accessing Snapshots
• By default, snapshots are accessible in .zfs directory
• Visibility of .zfs directory is tunable via snapdir property
✦ Don't really want ﬁnd to ﬁnd the .zfs directory
• Windows CIFS clients can see snapshots as Shadow Copies
for Shared Folders (VSS)

# zfs snapshot rpool/export/home/relling@20090415
# ls -a /export/home/relling
…
.Xsession
.xsession-errors
# ls /export/home/relling/.zfs
shares snapshot
# ls /export/home/relling/.zfs/snapshot
20090415
# ls /export/home/relling/.zfs/snapshot/20090415
Desktop Documents Downloads Public


Time Slider - Automatic Snapshots
• Solaris feature similar to OSX's Time Machine
• SMF service for managing snapshots
• SMF properties used to specify policies: frequency (interval)
and number to keep
• Creates cron jobs
• GUI tool makes it easy to select individual ﬁle systems
• Tip: take additional snapshots for important milestones to
avoid automatic snapshot deletion
Service Name Interval (default) Keep (default)
auto-snapshot:frequent 15 minutes 4
auto-snapshot:hourly 1 hour 24
auto-snapshot:daily 1 day 31
auto-snapshot:weekly 7 days 4
auto-snapshot:monthly 1 month 12

Nautilus
• File system views which can go back in time


Resilver & Scrub
• Can be read IOPS bound
• Resilver can also be bandwidth bound to the resilvering device
• Both work at lower I/O scheduling priority than normal
work, but that may not matter for read IOPS bound devices
• Dueling RFEs:
✦ Resilver should go faster
✦ Resilver should go slower
✤ Integrated in b140


Time-based Resilvering
• Block pointers contain
birth txg number
73
• Resilvering begins with 73

oldest blocks ﬁrst
73 55
• Interrupted resilver will still 73 27

result in a valid ﬁle system 27 27
68 73
view 73 68

Birth txg = 27
Birth txg = 68
Birth txg = 73


ACL – Access Control List
• Based on NFSv4 ACLs
• Similar to Windows NT ACLs
• Works well with CIFS services
• Supports ACL inheritance
• Change using chmod
• View using ls
• Some changes in b146 to make behaviour more consistent


Checksums for Data
• DVA contains 256 bits for checksum
• Checksum is in the parent, not in the block itself
• Types
✦ none
✦ ﬂetcher2: truncated 2nd order Fletcher-like algorithm
✦ ﬂetcher4: 4th order Fletcher-like algorithm
✦ SHA-256
• There are open proposals for better algorithms


Checksum Use

Pool Algorithm Notes
Uberblock SHA-256 self-checksummed
Metadata fletcher4
Labels SHA-256
Gang block SHA-256 self-checksummed

Dataset Algorithm Notes
Metadata fletcher4
fletcher2
Data zfs checksum parameter
fletcher4 (b114)
fletcher2
ZIL log self-checksummed
fletcher4 (b135)
Send stream fletcher4
Note: ZIL log has additional checking beyond the checksum

USENIX LISA11 Tutorial: ZFS a

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to USENIX LISA11 Tutorial: ZFS a

Similar to USENIX LISA11 Tutorial: ZFS a (20)

Recently uploaded

Recently uploaded (20)

USENIX LISA11 Tutorial: ZFS a