USENIX LISA09


   ZFS Tutorial
Richard.Elling@RichardElling.com
Agenda
Overview
Foundations
Pooled Storage Layer
Transactional Object Layer
Commands
  zpool
  zfs
Sharing
Properties
Perf...
Ground Rules
No religilous discussion
No licensing discussion
No “future of <company>” discussion
No zones/containers/jail...
History
Announced September 14, 2004
Integration history
   SXCE b27 (November 2005)
   FreeBSD (April 2007)
   Mac OSX Le...
Brief List of Features
Future-proof                       “No silent data corruption ever”
Cutting-edge data integrity    ...
ZFS Design Goals
Figure out why storage has gotten so complicated
Blow away 20+ years of obsolete assumptions
Gotta replac...
Limits

248 — Number of entries in any individual directory
256 — Number of attributes of a file [1]
256 — Number of files...
Sidetrack: Understanding Builds
Build is often referenced when speaking of feature/bug integration
Short-hand notation: b#...
Foundations


              9
Overhead View of a Pool

                    Pool
                                  File System
Configuration
 Information...
Layer View

raw   swap dump iSCSI   ??      ZFS   NFS CIFS        ??


ZFS Volume Emulator (Zvol)       ZFS POSIX Layer (Z...
Source Code Structure
                     File system     Device     GUI        Mgmt
                     Consumer       ...
Acronyms
ARC – Adaptive Replacement Cache
DMU – Data Management Unit
DSL – Dataset and Snapshot Layer
JNI – Java Native In...
nvlists
name=value pairs
libnvpair(3LIB)
Allows ZFS capabilities to change without changing the physical on-
   disk forma...
Versioning
Features can be added and identified by nvlist entries
Change in pool or dataset versions do not change physica...
zpool versions
VER   DESCRIPTION
---   --------------------------------------------------------
 1    Initial ZFS version
...
zfs versions
VER   DESCRIPTION
---   --------------------------------------------------------
 1    Initial ZFS filesystem...
Copy on Write
1. Initial block tree     2. COW some data




3. COW metadata         4. Update Uberblocks & free




     ...
COW Notes
COW works on blocks, not files
ZFS reserves 32 MBytes or 1/64 of
  pool size
   COWs need some free space to
   ...
Pooled Storage Layer

raw   swap dump iSCSI   ??      ZFS   NFS CIFS        ??


ZFS Volume Emulator (Zvol)       ZFS POSI...
vdevs – Virtual Devices
                  Logical vdevs

                        root vdev



       top-level vdev       ...
vdev Labels
vdev labels != disk labels
Four 256 kByte labels written to every physical vdev
Two-stage update process
   wr...
Observing Labels
# zdb -l /dev/rdsk/c0t0d0s0
--------------------------------------------
LABEL 0
------------------------...
To fsck or not to fsck
fsck was created to fix known inconsistencies in file system metadata
  UFS is not transactional
  ...
VDEV


       25
Dynamic Striping
   RAID-0
     −  SNIA definition: fixed-length sequences of virtual disk data
        addresses are map...
Dynamic Striping

RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes




ZFS Dynamic Stripe recordsize = 128 kByte...
Mirroring
   Straightforward: put N copies of the data on N vdevs
   Unlike RAID-1
     −    No 1:1 mapping at the block...
Mirroring




   29
Dynamic vdev Replacement
    zpool replace poolname vdev [vdev]
    Today, replacing vdev must be same size or larger
  ...
RAIDZ
   RAID-5
     −  Parity check data is distributed across the RAID array's disks
     −  Must read/modify/write whe...
RAID-5 vs RAIDZ

         DiskA   DiskB   DiskC   DiskD   DiskE
         D0:0    D0:1    D0:2    D0:3     P0
RAID-5    P1 ...
RAID-5 Write Hole
   Occurs when data to be written is smaller than stripe size
   Must read unallocated columns to reca...
RAIDZ2 and RAIDZ3
   RAIDZ2 = double parity RAIDZ
   RAIDZ3 = triple parity RAIDZ
   Sorta like RAID-6
     −    Parity...
Evaluating Data Retention

   MTTDL = Mean Time To Data Loss
   Note: MTBF is not constant in the real world, but keeps ...
Another MTTDL Model
   MTTDL[1] model doesn't take into account unrecoverable read
   But unrecoverable reads (UER) are ...
Why Worry about UER?

   Richard's study
     −   3,684 hosts with 12,204 LUNs
     −   11.5% of all LUNs reported read e...
MTTDL[2] Model

   Probability that a reconstruction will fail
     −  Precon_fail = (N-1) * size / UER
   Model doesn't...
Practical View of MTTDL[1]




                     39
MTTDL Models: Mirror




                40
MTTDL Models: RAIDZ2




               41
Ditto Blocks
Recall that each blkptr_t contains 3 DVAs
Dataset property used to indicate how many copies (aka ditto blocks...
Copies in Pictures




            43
Copies in Pictures




            44
ZIO – ZFS I/O Layer


                  45
ZIO Framework
All physical disk I/O goes through ZIO Framework
Translates DVAs into Logical Block Address (LBA) on leaf vd...
SpaceMap from Space




              47
ZIO Write Pipeline
ZIO State    Compression          Crypto    Checksum       DVA       vdev I/O

  open
              com...
ZIO Read Pipeline
ZIO State   Compression       Crypto    Checksum       DVA    vdev I/O

  open

                        ...
VDEV – Virtual Device Subsytem
Where mirrors, RAIDZ, and RAIDZ2
  are implemented                   Name          Priority...
ARC – Adaptive
Replacement Cache


                51
Object Cache
UFS uses page cache managed by the virtual memory system
ZFS does not use the page cache, except for mmap'ed ...
Traditional Cache
Works well when data being accessed was recently added
Doesn't work so well when frequently accessed dat...
ARC – Adaptive Replacement
                          Cache
  Evict the oldest single-use entry


                 LRU
    ...
ZFS ARC – Adaptive Replacement
      Cache with Locked Pages
                        Evict the oldest single-use entry


 ...
ARC Directory
Each ARC directory entry contains arc_buf_hdr structs
   Info about the entry
   Pointer to the entry
Direct...
L2ARC – Level 2 ARC
ARC evictions are sent to cache vdev
ARC directory remains in memory
Works well when cache vdev is opt...
ARC Tips
In general, it seems to work well for most workloads
ARC size will vary, based on usage
    Default max is 3/4 of...
Transactional Object
       Layer


                   59
flash
                              Source Code Structure
                        File system     Device     GUI        Mg...
DMU – Data Management Layer
Datasets issue transactions to the DMU
Transactional based object model
Transactions are
  Ato...
Transaction Engine
Manages physical I/O
Transactions grouped into transaction group (txg)
   txg updates
   All-or-nothing...
ZIL – ZFS Intent Log
DMU is transactional, and likes to group I/O into transactions for later
  commits, but still needs t...
Separate Logs (slogs)
ZIL competes with pool for iops
   Applications will wait for sync writes to be on nonvolatile media...
Synchronous Write Destination
                  Without separate log
        Sync I/O size >
        zfs_immediate_write_s...
Disabling the ZIL
Rule 0: Don’t disable the ZIL
If you love your data, do not disable the ZIL
You can find references to t...
DSL – Dataset and
 Snapshot Layer


                    67
flash
                         Copy on Write
 1. Initial block tree     2. COW some data




 3. COW metadata         4. U...
zfs snapshot
Create a read-only, point-in-time window into the dataset (file system
  or Zvol)
Computationally free, becau...
Snapshot
                                       Current tree root
  Snapshot tree root




Create a snapshot by not free'i...
Clones
Snapshots are read-only
Clones are read-write based upon a snapshot
Child depends on parent
  Cannot destroy parent...
zfs clone
Create a read-write file system from a read-only snapshot
Used extensively for OpenSolaris upgrades


OS rev1   ...
zfs rollback


      OS b104                            OS b104

   rpool/ROOT/b104                    rpool/ROOT/b104

  ...
Commands


           74
zpool(1m)

raw   swap dump iSCSI   ??      ZFS   NFS CIFS        ??


ZFS Volume Emulator (Zvol)       ZFS POSIX Layer (ZP...
Dataset & Snapshot Layer
Object
   Allocated storage
   dnode describes collection of        Dataset Directory
      block...
zpool create
zpool create poolname vdev-configuration
   vdev-configuration examples
      mirror c0t0d0 c3t6d0
      mirr...
zpool add
Adds a device to the pool as a top-level vdev
zpool add poolname vdev-configuration
vdev-configuration can be an...
zpool remove
Remove a top-level vdev from the pool
zpool remove poolname vdev
Today, you can only remove the following vde...
zpool attach
Attach a vdev as a mirror to an existing vdev
zpool attach poolname existing-vdev vdev
Attaching vdev must be...
zpool import
Import a pool and mount all mountable datasets
Import a specific pool
   zpool import poolname
   zpool impor...
zpool history
  Show history of changes made to the pool

# zpool history rpool
History for 'rpool':
2009-03-04.07:29:46 z...
zpool status
 Shows the status of the current pools, including their configuration
 Important troubleshooting step

# zpoo...
zpool iostat
Show pool physical I/O activity, in an iostat-like manner
Solaris: fsstat will show I/O activity looking into...
zfs(1m)

raw   swap dump iSCSI   ??      ZFS   NFS CIFS        ??


ZFS Volume Emulator (Zvol)       ZFS POSIX Layer (ZPL)...
zfs create, destroy
By default, a file system with the same name as the pool is created by
  zpool create
Name format is: ...
zfs list
List mounted datasets
Old versions: listed everything
After b108: do not list snapshots
   See zpool listsnapshot...
zfs send, receive
Send
  send a snapshot to stdout
  data is decompressed
Receive
  receive a snapshot from stdin
  receiv...
Sharing


          89
Sharing
zfs share dataset
Type of sharing set by parameters
  shareiscsi = [on | off]
  sharenfs = [on | off | options]
  ...
NFS
ZFS file systems work as expected
   use ACLs based on NFSv4 ACLs
Parallel NFS, aks pNFS, aka NFSv4.1
   Still a work-...
CIFS
UID mapping
casesensitivity parameter
  Good idea, set when file system is created
  zfs create -o casesensitivity=in...
iSCSI
SCSI over IP
Block-level protocol
Uses Zvols as storage
Solaris has 2 iSCSI target implementations
   shareiscsi ena...
Properties


             94
Properties
Properties are stored in an nvlist
By default, are inherited
Some properties are common to all datasets, but a ...
User-defined Properties
Names
   Must include colon ':'
   Can contain lower case alphanumerics or “+” “.” “_”
   Max leng...
set & get properties
Set
   zfs set compression=on export/home/relling
Get
  zfs get compression export/home/relling
Reset...
Pool Properties
Property      Change? Brief Description
altroot                  Alternate root directory (ala chroot)
aut...
More Pool Properties
Property        Change?    Brief Description
guid            readonly   Unique identifier
health     ...
Property        Change?    Brief Description
available       readonly   Space available to dataset & children
checksum    ...
More Common Dataset Properties
Property         Change? Brief Description
refreservation              Max space guaranteed...
More Common Dataset Properties
Property            Change? Brief Description
used                readonly    Sum of usedby...
Volume Dataset Properties
Property        Change? Brief Description
shareiscsi                 iSCSI service (not COMSTAR)...
File System Properties
Property        Change?    Brief Description
aclinherit                 ACL inheritance policy, whe...
More File System Properties
Property      Change? Brief Description
nbmand        export/    File system should be mounted...
File System Properties
Property   Change? Brief Description
snapdir               Controls whether .zfs directory is hidde...
More Goodies...


                  107
Dataset Space Accounting
 used = usedbydataset + usedbychildren + usedbysnapshots +
   usedbyrefreservation
 Lazy updates,...
zfs vs zpool Space Accounting
zfs list != zpool list
zfs list shows space used by the dataset plus space for internal
  ac...
Accessing Snapshots
By default, snapshots are accessible in .zfs directory
Visibility of .zfs directory is tunable via sna...
Time-based Resilvering
Block pointers contain birth txg
   number
Resilvering begins with oldest
  blocks first           ...
Time Slider - Automatic Snapshots
Underpinnings for Solaris feature similar to OSX's Time Machine
SMF service for managing...
Nautilus
File system views which can go back in time




                                                 113
ACL – Access Control List
Based on NFSv4 ACLs
Similar to Windows NT ACLs
Works well with CIFS services
Supports ACL inheri...
Checksums for Data
DVA contains 256 bits for checksum
Checksum is in the parent, not in the block itself
Types
   none
   ...
Checksum Use
Pool               Algorithm          Notes
Uberblock          SHA-256            self-checksummed
Metadata  ...
Compression
Builtin
   lzjb, Lempel-Ziv by Jeff Bonwick
   gzip, levels 1-9
Extensible
  new compressors can be added
  ba...
Encryption
Placeholder – details TBD
http://opensolaris.org/os/project/zfs-crypto
Complicated by:
   Block pointer rewrite...
Quotas
File system quotas
  quota includes descendants (snapshots, clones)
  refquota does not include descendants
User an...
zpool.cache
Old way
  mount /
  read /etc/[v]fstab
  mount file systems
ZFS
  import pool(s)
  find mountable datasets and...
Mounting ZFS File Systems
By default, mountable file systems are mounted when the pool is
  imported
   Controlled by canm...
recordsize
Dynamic
   Max 128 kBytes
   Min 512 Bytes
   Power of 2
For most workloads, don't worry about it
For fixed siz...
Delegated Administration
Fine grain control
   users or groups of users
   subcommands, parameters, or sets
Similar to Sol...
Delegatable Subcommands
allow                 receive
clone                 rename
create                rollback
destroy ...
Delegatable Parameters
aclinherit         nbmand           sharesmb
aclmode            normalization    snapdir
atime     ...
Browser User Interface
Solaris 10 – WebConsole
Nexenta
OpenStorage




                                      126
Solaris WebConsole




             127
Solaris WebConsole




             128
Nexenta




www.nexenta.com/corp/images/stories/pdfs/nexentastor%20briefing%206%2030%20final%20june%2029%2009.pdf

       ...
OpenStorage




      130
Solaris Swap and Dump
Swap
  Solaris does not have automatic swap resizing
  Swap as a separate dataset
  Swap device is r...
Performance



              132
General Comments
In general, performs well out of the box
Standard performance improvement techniques apply
Lots of DTrace...
ZIL Performance : NFS
Big performance increases demonstrated
  especially with SSDs
  for RAID arrays with nonvolatile RAM...
ZIL Performance : Databases
The logbias property can be set on a dataset to control threshold for
  writing to pool when a...
More ZIL Performance : Databases
I/O size inflation
   Once a file grows to use a block size, it will keep that block size...
vdev Cache
vdev cache occurs at the SPA level
   readahead
   10 MBytes per vdev
   only caches metadata (b70 or later)
St...
Intelligent Prefetching
Intelligent file-level prefetching occurs at the DMU level
Feeds the ARC
In a nutshell, prefetch h...
Unintelligent Prefetch?
Some workloads don't do so well with intelligent prefetch
   CR6859997, zfs caching performance pr...
I/O Queues
By default, for devices which can support multiple I/Os, up to 35 I/Os
  are queued to each vdev
   Tunable wit...
COW Penalty
COW can negatively affect workloads which have updates and
  sequential reads
   Initial writes will be sequen...
COW Penalty




            Performance seems to level at about 25% penalty

Results compliments of Allan Packer & Neelaka...
About Disks...
 Disks still the most important performance bottleneck
    Modern processors are multi-core
    Default che...
DirectIO
UFS forcedirectio option brought the early 1980s design of UFS up to
  the 1990s
ZFS designed to run on modern mu...
Hybrid Storage Pool

                                     SPA

                   separate log                   L2ARC
   ...
RAID-Z Bandwidth
Traditional RAID-Z had a “mind the gap” feature
Impacts possible bandwidth
Mirrors could show higher band...
Troubleshooting


                  147
Checking Status
zpool status
zpool status -v
Solaris
   fmadm faulty
   fmdump
  fmdump -ev or fmdump -eV
  format or rmfo...
flash
                                     Copy on Write
  1. Initial block tree                2. COW some data




  3. ...
What if flush is ignored?
Some devices ignore cache flush commands (!)
   Virtualization default=ignore flush: VirtualBox,...
Can't Import Pool?
Check device paths with zpool import
   Be aware of /etc/zfs/zpool.cache
  May need zpool -d directory ...
Slow Pool Import?
Case: zvols with snapshots
Symptom: reboot or zpool import is really slllooooowwwwwww...
Cause: ineffici...
File System Mounts B0rken?
Prevention
  Avoid complex heirarchies (KISS)
  Be aware of legacy mounts
  Be aware of alterna...
Can't Boot?
Check if BIOS/OBP supports booting from device
Make sure LUN has SMI label, not EFI
   Common mistake when mir...
Future Plans
Announced enhancements in the pipeline from
  Kernel Conference Australia, July 15-17 2009
   Encryption
   D...
More Future Plans
Snapshot holds (b124)
Access-based enumeration (b125)
Multiple mount protection
Separate log offlining (...
Now you know...
ZFS structure: pools, datasets
Data redundancy: mirrors, RAIDZ, copies
Data verification: checksums
Data r...
Its a wrap!



      Thank You!
       Questions?
Richard.Elling@RichardElling.com



                                   1...
Upcoming SlideShare
Loading in...5
×

ZFS Tutorial USENIX LISA09 Conference

10,739

Published on

This presentation is from the ZFS Tutorial presented at the USENIX LISA09 Conference at Baltimore, Maryland in November 2009.

Later versions are available on slideshare.net, too.

Published in: Technology
1 Comment
30 Likes
Statistics
Notes
No Downloads
Views
Total Views
10,739
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
0
Comments
1
Likes
30
Embeds 0
No embeds

No notes for slide
  • ZFS Tutorial USENIX LISA09 Conference

    1. 1. USENIX LISA09 ZFS Tutorial Richard.Elling@RichardElling.com
    2. 2. Agenda Overview Foundations Pooled Storage Layer Transactional Object Layer Commands zpool zfs Sharing Properties Performance Troubleshooting Wrap 2
    3. 3. Ground Rules No religilous discussion No licensing discussion No “future of <company>” discussion No zones/containers/jails discussion No “when is it going to be in Solaris 10” discussion... ok maybe a few... 3
    4. 4. History Announced September 14, 2004 Integration history SXCE b27 (November 2005) FreeBSD (April 2007) Mac OSX Leopard Preview shown, but removed from Snow Leopard Disappointed community reforming as the zfs-macos google group (Oct 2009) OpenSolaris 2008.05 Solaris 10 6/06 (June 2006) Linux FUSE (summer 2006) greenBytes ZFS+ (September 2008) More than 45 patents, contributed to the CDDL Patents Common 4
    5. 5. Brief List of Features Future-proof “No silent data corruption ever” Cutting-edge data integrity “Mind-boggling scalability” High performance “Breathtaking speed” Simplified administration “Near zero administration” Eliminates need for volume “Radical new architecture” managers “Greatly simplifies support Reduced costs issues” Compatibility with POSIX file “RAIDZ saves money” system & block devices Self-healing Marketing: 2 drink minimum 5
    6. 6. ZFS Design Goals Figure out why storage has gotten so complicated Blow away 20+ years of obsolete assumptions Gotta replace UFS Design an integrated system from scratch End the suffering 6
    7. 7. Limits 248 — Number of entries in any individual directory 256 — Number of attributes of a file [1] 256 — Number of files in a directory [1] 16 EiB (264 bytes) — Maximum size of a file system 16 EiB — Maximum size of a single file 16 EiB — Maximum size of any attribute 264 — Number of devices in any pool 264 — Number of pools in a system 264 — Number of file systems in a pool 264 — Number of snapshots of any file system 256 ZiB (278 bytes) — Maximum size of any pool [1] actually constrained to 248 for the number of files in a ZFS file system 7
    8. 8. Sidetrack: Understanding Builds Build is often referenced when speaking of feature/bug integration Short-hand notation: b# OpenSolaris and SXCE are based on NV SXCE will soon end OpenSolaris carries forward ZFS development done for NV Bi-weekly build cycle Schedule at http://opensolaris.org/os/community/on/schedule/ ZFS is ported to Solaris 10 and other OSes 8
    9. 9. Foundations 9
    10. 10. Overhead View of a Pool Pool File System Configuration Information Volume File System Volume Dataset 10
    11. 11. Layer View raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? 11
    12. 12. Source Code Structure File system Device GUI Mgmt Consumer Consumer JNI User libzfs Kernel Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration 12
    13. 13. Acronyms ARC – Adaptive Replacement Cache DMU – Data Management Unit DSL – Dataset and Snapshot Layer JNI – Java Native InterfaceZPL – ZFS POSIX Layer (traditional file system interface) VDEV – Virtual Device layer ZAP – ZFS Attribute Processor ZIL – ZFS Intent Log ZIO – ZFS I/O layer Zvol – ZFS volume (raw/cooked block device interface) 13
    14. 14. nvlists name=value pairs libnvpair(3LIB) Allows ZFS capabilities to change without changing the physical on- disk format Data stored is XDR encoded A good thing, used often 14
    15. 15. Versioning Features can be added and identified by nvlist entries Change in pool or dataset versions do not change physical on-disk format (!) does change nvlist parameters Older-versions can be used might see warning messages, but harmless Available versions and features can be easily viewed zpool upgrade -v zfs upgrade -v Online references zpool: www.opensolaris.org/os/community/zfs/version/N zfs: www.opensolaris.org/os/community/zfs/version/zpl/N Don't confuse zpool and zfs versions 15
    16. 16. zpool versions VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit support 15 user/group space accounting 16 stmf property support 17 Triple-parity RAID-Z 18 snapshot user holds 19 Log device removal 16
    17. 17. zfs versions VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS filesystem version 2 Enhanced directory entries 3 Case insensitive and File system unique identifier (FUID) 4 userquota, groupquota properties 17
    18. 18. Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free 18
    19. 19. COW Notes COW works on blocks, not files ZFS reserves 32 MBytes or 1/64 of pool size COWs need some free space to remove files need space for ZIL For fixed-record size workloads “fragmentation” and “poor performance” can occur if the recordsize is not matched Spatial distribution is good fodder for performance speculation affects HDDs moot for SSDs 19
    20. 20. Pooled Storage Layer raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? 20
    21. 21. vdevs – Virtual Devices Logical vdevs root vdev top-level vdev top-level vdev children[0] children[1] mirror mirror vdev vdev vdev vdev type=disk type=disk type=disk type=disk children[0] children[1] children[0] children[1] Physical or leaf vdevs 21
    22. 22. vdev Labels vdev labels != disk labels Four 256 kByte labels written to every physical vdev Two-stage update process write label0 & label2 flush cache & check for errors write label1 & label3 flush cache & check for errors N = 256k * (size % 256k) 0 256k 512k 4M N-512k N-256k N label0 label1 boot block label2 label3 Blank Boot Name=Value ... header Pairs 128-slot Uberblock Array 0 8k 16k 128k 256k 22
    23. 23. Observing Labels # zdb -l /dev/rdsk/c0t0d0s0 -------------------------------------------- LABEL 0 -------------------------------------------- version=14 name='rpool' state=0 txg=13152 pool_guid=17111649328928073943 hostid=8781271 hostname='' top_guid=11960061581853893368 guid=11960061581853893368 vdev_tree type='disk' id=0 guid=11960061581853893368 path='/dev/dsk/c0t0d0s0' devid='id1,sd@SATA_____ST3500320AS_________________9QM3FWFT/a' phys_path='/pci@0,0/pci1458,b002@11/disk@0,0:a' whole_disk=0 metaslab_array=24 metaslab_shift=30 ashift=9 asize=157945167872 is_log=0 23
    24. 24. To fsck or not to fsck fsck was created to fix known inconsistencies in file system metadata UFS is not transactional metadata inconsistencies must be reconciled does NOT repair data – how could it? ZFS doesn't need fsck, as-is all on-disk changes are transactional COW means previously existing, consistent metadata is not overwritten ZFS can repair itself metadata is at least dual-redundant data can also be redundant Reality check – this does not mean that ZFS is not susceptible to corruption nor is any other file system 24
    25. 25. VDEV 25
    26. 26. Dynamic Striping  RAID-0 − SNIA definition: fixed-length sequences of virtual disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern  Dynamic Stripe − Data is dynamically mapped to member disks − No fixed-length sequences − Allocate up to ~1 MByte/vdev before changing vdev − vdevs can be different size − Good combination of the concatenation feature with RAID-0 performance 26
    27. 27. Dynamic Striping RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes ZFS Dynamic Stripe recordsize = 128 kBytes Total write size = 2816 kBytes 27
    28. 28. Mirroring  Straightforward: put N copies of the data on N vdevs  Unlike RAID-1 − No 1:1 mapping at the block level − vdev labels are still at beginning and end − vdevs can be of different size  effective space is that of smallest vdev  Arbitration: ZFS does not blindly trust either side of mirror − Most recent, correct view of data wins − Checksums validate data 28
    29. 29. Mirroring 29
    30. 30. Dynamic vdev Replacement  zpool replace poolname vdev [vdev]  Today, replacing vdev must be same size or larger − Before b117: as measured by blocks − After b117: as measured by metaslabs  Replacing all vdevs in a top-level vdev with larger vdevs results in top-level vdev resizing  Policy controlled by zpool autoexpand property 15G 10G 10G 15G 10G 20G 15G 20G 10G 15G 20G 20G 20G 20G 10G 10G Mirror 10G Mirror 15G Mirror 15G Mirror 20G Mirror 30
    31. 31. RAIDZ  RAID-5 − Parity check data is distributed across the RAID array's disks − Must read/modify/write when data is smaller than stripe width  RAIDZ − Dynamic data placement − Parity added as needed − Writes are full-stripe writes − No read/modify/write (write hole)  Arbitration: ZFS does not blindly trust any device − Does not rely on disk reporting read error − Checksums validate data − If checksum fails, read parity Space used is dependent on how used 31
    32. 32. RAID-5 vs RAIDZ DiskA DiskB DiskC DiskD DiskE D0:0 D0:1 D0:2 D0:3 P0 RAID-5 P1 D1:0 D1:1 D1:2 D1:3 D2:3 P2 D2:0 D2:1 D2:2 D3:2 D3:3 P3 D3:0 D3:1 DiskA DiskB DiskC DiskD DiskE P0 D0:0 D0:1 D0:2 D0:3 RAIDZ P1 D1:0 D1:1 P2:0 D2:0 D2:1 D2:2 D2:3 P2:1 D2:4 D2:5 Gap P3 D3:0 D3:1 32
    33. 33. RAID-5 Write Hole  Occurs when data to be written is smaller than stripe size  Must read unallocated columns to recalculate the parity or the parity must be read/modify/write  Read/modify/write is risky for consistency − Multiple disks − Reading independently − Writing independently − System failure before all writes are complete to media could result in data loss  Effects can be hidden from host using RAID array with nonvolatile write cache, but extra I/O cannot be hidden from disks 33
    34. 34. RAIDZ2 and RAIDZ3  RAIDZ2 = double parity RAIDZ  RAIDZ3 = triple parity RAIDZ  Sorta like RAID-6 − Parity 1: XOR − Parity 2: another Reed-Soloman syndrome − Parity 3: yet another Reed-Soloman syndrome  Arbitration: ZFS does not blindly trust any device − Does not rely on disk reporting read error − Checksums validate data − If data not valid, read parity − If data still not valid, read other parity Space used is dependent on how used 34
    35. 35. Evaluating Data Retention  MTTDL = Mean Time To Data Loss  Note: MTBF is not constant in the real world, but keeps math simple  MTTDL[1] is a simple MTTDL model  No parity (single vdev, striping, RAID-0) − MTTDL[1] = MTBF / N  Single Parity (mirror, RAIDZ, RAID-1, RAID-5) − MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)  Double Parity (3-way mirror, RAIDZ2, RAID-6) − MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2) 35
    36. 36. Another MTTDL Model  MTTDL[1] model doesn't take into account unrecoverable read  But unrecoverable reads (UER) are becoming the dominant failure mode − UER specifed as errors per bits read − More bits = higher probability of loss per vdev  MTTDL[2] model considers UER 36
    37. 37. Why Worry about UER?  Richard's study − 3,684 hosts with 12,204 LUNs − 11.5% of all LUNs reported read errors  Bairavasundaram et.al. FAST08 www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf − 1.53M LUNs over 41 months − RAID reconstruction discovers 8% of checksum mismatches − 4% of disks studies developed checksum errors over 17 months 37
    38. 38. MTTDL[2] Model  Probability that a reconstruction will fail − Precon_fail = (N-1) * size / UER  Model doesn't work for non-parity schemes (single vdev, striping, RAID-0)  Single Parity (mirror, RAIDZ, RAID-1, RAID-5) − MTTDL[2] = MTBF / (N * Precon_fail)  Double Parity (3-way mirror, RAIDZ2, RAID-6) − MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail) 38
    39. 39. Practical View of MTTDL[1] 39
    40. 40. MTTDL Models: Mirror 40
    41. 41. MTTDL Models: RAIDZ2 41
    42. 42. Ditto Blocks Recall that each blkptr_t contains 3 DVAs Dataset property used to indicate how many copies (aka ditto blocks) of data is desired Write all copies Read any copy Recover corrupted read from a copy Not a replacement for mirroring Easier to describe in pictures... copies parameter Data copies Metadata copies copies=1 (default) 1 2 copies=2 2 3 copies=3 3 3 42
    43. 43. Copies in Pictures 43
    44. 44. Copies in Pictures 44
    45. 45. ZIO – ZFS I/O Layer 45
    46. 46. ZIO Framework All physical disk I/O goes through ZIO Framework Translates DVAs into Logical Block Address (LBA) on leaf vdevs Keeps free space maps (spacemap) If contiguous space is not available: Allocate smaller blocks (the gang) Allocate gang block, pointing to the gang Implemented as multi-stage pipeline Allows extensions to be added fairly easily Handles I/O errors 46
    47. 47. SpaceMap from Space 47
    48. 48. ZIO Write Pipeline ZIO State Compression Crypto Checksum DVA vdev I/O open compress if savings > 12.5% encrypt generate allocate start start start done done done assess assess assess done Gang activity elided, for clarity 48
    49. 49. ZIO Read Pipeline ZIO State Compression Crypto Checksum DVA vdev I/O open start start start done done done assess assess assess verify decrypt decompress done Gang activity elided, for clarity 49
    50. 50. VDEV – Virtual Device Subsytem Where mirrors, RAIDZ, and RAIDZ2 are implemented Name Priority Surprisingly few lines of code NOW 0 needed to implement RAID SYNC_READ 0 Leaf vdev (physical device) I/O SYNC_WRITE 0 management FREE 0 Number of outstanding iops CACHE_FILL 0 Read-ahead cache LOG_WRITE 0 Priority scheduling ASYNC_READ 4 ASYNC_WRITE 4 RESILVER 10 SCRUB 20 50
    51. 51. ARC – Adaptive Replacement Cache 51
    52. 52. Object Cache UFS uses page cache managed by the virtual memory system ZFS does not use the page cache, except for mmap'ed files ZFS uses a Adaptive Replacement Cache (ARC) ARC used by DMU to cache DVA data objects Only one ARC per system, but caching policy can be changed on a per-dataset basis Seems to work much better than page cache ever did for UFS 52
    53. 53. Traditional Cache Works well when data being accessed was recently added Doesn't work so well when frequently accessed data is evicted Misses cause insert MRU Dynamic caches can change Cache size size by either not evicting or aggressively evicting LRU Evict the oldest 53
    54. 54. ARC – Adaptive Replacement Cache Evict the oldest single-use entry LRU Recent Cache Miss MRU Evictions and dynamic MRU size resizing needs to choose best Hit cache to evict (shrink) Frequent Cache LRU Evict the oldest multiple accessed entry 54
    55. 55. ZFS ARC – Adaptive Replacement Cache with Locked Pages Evict the oldest single-use entry Cannot evict LRU locked pages! Recent Cache Miss MRU MRU Hit size Frequent If hit occurs Cache within 62 ms LRU Evict the oldest multiple accessed entry ZFS ARC handles mixed-size pages 55
    56. 56. ARC Directory Each ARC directory entry contains arc_buf_hdr structs Info about the entry Pointer to the entry Directory entries have size, ~200 bytes ZFS block size is dynamic, 512 bytes – 128 kBytes Disks are large Suppose we use a Seagate LP 2 TByte disk for the L2ARC Disk has 3,907,029,168 512 byte sectors, guaranteed Workload uses 8 kByte fixed record size RAM needed for arc_buf_hdr entries Need = (3,907,029,168 - 9,232) * 200 / 16 = ~48 GBytes Don't underestimate the RAM needed for large L2ARCs 56
    57. 57. L2ARC – Level 2 ARC ARC evictions are sent to cache vdev ARC directory remains in memory Works well when cache vdev is optimized for fast reads ARC lower latency than pool disks inexpensive way to “increase memory” Content considered volatile, no ZFS data evicted protection allowed data Monitor usage with zpool iostat “cache” “cache” “cache” vdev vdev vdev 57
    58. 58. ARC Tips In general, it seems to work well for most workloads ARC size will vary, based on usage Default max is 3/4 of memory or memory - 1 GByte Min is 64 MB Metadata capped at 1/4 of max ARC size Internals tracked by kstats in Solaris Use memory_throttle_count to observe pressure to evict Can limit at boot time Solaris – set zfs:zfs_arc_max in /etc/system Performance Prior to b107, L2ARC fill rate was limited to 8 MBytes/s L2ARC keeps its directory in kernel memory 58
    59. 59. Transactional Object Layer 59
    60. 60. flash Source Code Structure File system Device GUI Mgmt Consumer Consumer JNI User libzfs Kernel Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration 60
    61. 61. DMU – Data Management Layer Datasets issue transactions to the DMU Transactional based object model Transactions are Atomic Grouped (txg = transaction group) Responsible for on-disk data ZFS Attribute Processor (ZAP) Dataset and Snapshot Layer (DSL) ZFS Intent Log (ZIL) 61
    62. 62. Transaction Engine Manages physical I/O Transactions grouped into transaction group (txg) txg updates All-or-nothing Commit interval Older versions: 5 seconds Now: 30 seconds max, dynamically scale based on time required to commit txg Delay committing data to physical storage Improves performance A bad thing for sync workloads – hence the ZFS Intent Log (ZIL) 30 second delay can impact failure detection time 62
    63. 63. ZIL – ZFS Intent Log DMU is transactional, and likes to group I/O into transactions for later commits, but still needs to handle “write it now” desire of sync writers NFS Databases ZIL recordsize inflation can occur for some workloads May cause larger than expected actual I/O for sync workloads Oracle redo logs Can tune zfs_immediate_write_sz, but after b122 use logbias property instead Never read, except at import (eg reboot), when transactions may need to be rolled forward 63
    64. 64. Separate Logs (slogs) ZIL competes with pool for iops Applications will wait for sync writes to be on nonvolatile media Very noticeable on HDD JBODs Put ZIL on separate vdev, outside of pool ZIL writes tend to be sequential No competition with pool for IOPS Downside: slog device required to be operational at import b125 adds slog device removal support Size of separate log < than size of RAM (duh) 10x or more performance improvements possible Use write-optimized SSD or non-volatile write cache on RAID array Use zilstat to observe ZIL activity 64
    65. 65. Synchronous Write Destination Without separate log Sync I/O size > zfs_immediate_write_sz ? ZIL Destination no ZIL log yes bypass to pool With separate log Sync I/O size > zfs_immediate_write_sz ? logbias? ZIL Destination no log device yes prior to logbias (b122) log device latency (default) log device throughput bypass to pool + Default zfs_immediate_write_sz = 32 kBytes 65
    66. 66. Disabling the ZIL Rule 0: Don’t disable the ZIL If you love your data, do not disable the ZIL You can find references to this as a way to speed up ZFS NFS workloads “tar -x” benchmarks Golden Rule: Don’t disable the ZIL Can set via mdb, but need to remount the file system under test Friends don’t let friends disable the ZIL Solaris - can set in /etc/system *** TEMPORARY disable ZIL for non-production use *** disabled by <your name> on <date> set zfs:zil_disable=1 Nostradamus wrote, “disabling the ZIL will lead to the apocalypse” 66
    67. 67. DSL – Dataset and Snapshot Layer 67
    68. 68. flash Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free 68
    69. 69. zfs snapshot Create a read-only, point-in-time window into the dataset (file system or Zvol) Computationally free, because of COW architecture Very handy feature Patching/upgrades Basis for Time Slider 69
    70. 70. Snapshot Current tree root Snapshot tree root Create a snapshot by not free'ing COWed blocks Snapshot creation is fast and easy Number of snapshots determined by use – no hardwired limit Recursive snapshots also possible 70
    71. 71. Clones Snapshots are read-only Clones are read-write based upon a snapshot Child depends on parent Cannot destroy parent without destroying all children Can promote children to be parents Good ideas OS upgrades Change control Replication zones virtual disks 71
    72. 72. zfs clone Create a read-write file system from a read-only snapshot Used extensively for OpenSolaris upgrades OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 snapshot snapshot snapshot OS rev1 upgrade OS rev2 clone boot manager Origin snapshot cannot be destroyed, if clone exists 72
    73. 73. zfs rollback OS b104 OS b104 rpool/ROOT/b104 rpool/ROOT/b104 OS b104 OS b104 snapshot rollback snapshot rpool/ROOT/b104@today rpool/ROOT/b104@today 73
    74. 74. Commands 74
    75. 75. zpool(1m) raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? 75
    76. 76. Dataset & Snapshot Layer Object Allocated storage dnode describes collection of Dataset Directory blocks Dataset Object Set Object Set Childmap Group of related objects Dataset Object Object Object Properties Snapmap: snapshot relationships Snapmap Space usage Dataset directory Childmap: dataset relationships Properties 76
    77. 77. zpool create zpool create poolname vdev-configuration vdev-configuration examples mirror c0t0d0 c3t6d0 mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6 mirror disk1s0 disk2s0 cache disk4s0 log disk5 raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0 Solaris Additional checks to see if disk/slice overlaps or is currently in use Whole disks are given EFI labels Can set initial pool or dataset properties By default, creates a file system with the same name poolname pool → /poolname file system People get confused by a file system with same name as the pool 77
    78. 78. zpool add Adds a device to the pool as a top-level vdev zpool add poolname vdev-configuration vdev-configuration can be any combination also used for zpool create Complains if the added vdev-configuration would cause a different data protection scheme than is already in use – use “-f” to override Good idea: try with “-n” flag first – will show final configuration without actually performing the add Do not add a device which is in use as a quorum device 78
    79. 79. zpool remove Remove a top-level vdev from the pool zpool remove poolname vdev Today, you can only remove the following vdevs: cache hot spare separate log (b124) An RFE is open to allow removal of other top-level vdevs Don't confuse “remove” with “detach” 79
    80. 80. zpool attach Attach a vdev as a mirror to an existing vdev zpool attach poolname existing-vdev vdev Attaching vdev must be the same size or larger than the existing vdev Note: today, not available for RAIDZ, RAIDZ2, or RAIDZ3 vdevs vdev Configurations ok simple vdev → mirror ok mirror ok log → mirrored log no RAIDZ no RAIDZ2 no RAIDZ3 “Same size” literally means the same number of blocks until b117. Beware that many “same size” disks have different number of available blocks. 80
    81. 81. zpool import Import a pool and mount all mountable datasets Import a specific pool zpool import poolname zpool import GUID Scan LUNs for pools which may be imported zpool import Can set options, such as alternate root directory or other properties Beware of zpool.cache interactions Beware of artifacts, especially partial artifacts 81
    82. 82. zpool history Show history of changes made to the pool # zpool history rpool History for 'rpool': 2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -o cachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s0 2009-03-04.07:29:47 zfs set canmount=noauto rpool 2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool 2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT 2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap 2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump 2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_106 2009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_106 2009-03-04.07:29:51 zfs set canmount=on rpool 2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export 2009-03-04.07:29:51 zfs create rpool/export/home 2009-03-04.00:21:42 zpool import -f -R /a 17111649328928073943 2009-03-04.00:21:42 zpool export rpool 2009-03-04.08:47:08 zpool set bootfs=rpool rpool 2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b108 2009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/ snv_b108 ... 82
    83. 83. zpool status Shows the status of the current pools, including their configuration Important troubleshooting step # zpool status … pool: zwimming state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM zwimming ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0s0 ONLINE 0 0 0 c0t0d0s7 ONLINE 0 0 0 errors: No known data errors Understanding status output error messages can be tricky 83
    84. 84. zpool iostat Show pool physical I/O activity, in an iostat-like manner Solaris: fsstat will show I/O activity looking into a ZFS file system Especially useful for showing slog activity # zpool iostat -v capacity operations bandwidth pool used avail read write read write ------------ ----- ----- ----- ----- ----- ----- rpool 16.5G 131G 0 0 1.16K 2.80K c0t0d0s0 16.5G 131G 0 0 1.16K 2.80K ------------ ----- ----- ----- ----- ----- ----- zwimming 135G 14.4G 0 5 2.09K 27.3K mirror 135G 14.4G 0 5 2.09K 27.3K c0t2d0s0 - - 0 3 1.25K 27.5K c0t0d0s7 - - 0 2 1.27K 27.5K ------------ ----- ----- ----- ----- ----- ----- Unlike iostat, does not show latency 84
    85. 85. zfs(1m) raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? 85
    86. 86. zfs create, destroy By default, a file system with the same name as the pool is created by zpool create Name format is: pool/name[/name ...] File system zfs create fs-name zfs destroy fs-name Zvol zfs create -V size vol-name zfs destroy vol-name Parameters can be set at create time 86
    87. 87. zfs list List mounted datasets Old versions: listed everything After b108: do not list snapshots See zpool listsnapshots property Examples zfs list zfs list -t snapshot zfs list -H -o name 87
    88. 88. zfs send, receive Send send a snapshot to stdout data is decompressed Receive receive a snapshot from stdin receiving file system parameters apply (compression, et.al) Can incrementally send snapshots in time order Handy way to replicate dataset snapshots Only method for replicating dataset properties, except quotas NOT a replacement for traditional backup solutions All-or-nothing design per snapshot In general, does not send files (!) Send streams from b35 (or older) no longer supported after b89 88
    89. 89. Sharing 89
    90. 90. Sharing zfs share dataset Type of sharing set by parameters shareiscsi = [on | off] sharenfs = [on | off | options] sharesmb = [on | off | options] Shortcut to manage sharing Uses external services (nfsd, iscsi target, smbshare, etc) Importing pool will also share May vary by OS 90
    91. 91. NFS ZFS file systems work as expected use ACLs based on NFSv4 ACLs Parallel NFS, aks pNFS, aka NFSv4.1 Still a work-in-progress http://opensolaris.org/os/project/nfsv41/ zfs create -t pnfsdata mypnfsdata pNFS Client pNFS Data Server pNFS Data Server pnfsdata pnfsdata pNFS dataset dataset Metadata Server pool pool 91
    92. 92. CIFS UID mapping casesensitivity parameter Good idea, set when file system is created zfs create -o casesensitivity=insensitive mypool/Shared Shadow Copies for Shared Folders (VSS) supported CIFS clients cannot create shadow remotely (yet) CIFS features vary by OS, Samba, etc. 92
    93. 93. iSCSI SCSI over IP Block-level protocol Uses Zvols as storage Solaris has 2 iSCSI target implementations shareiscsi enables old, user-land iSCSI target To use COMSTAR, enable using itadm(1m) b116 more closely integrates COMSTAR (zpool version 16) iSCSI performance hiccup Prior to b107, iSCSI over Zvols didn’t properly handle sync writes b107-b113, iSCSI over Zvols made all writes sync (read: slow) Workaround: enable write cache enable in the iSCSI target, see CR6770534 OpenSolaris 2009.06 is b111 b114, write cache enable works automatically iSCSI over Zvol 93
    94. 94. Properties 94
    95. 95. Properties Properties are stored in an nvlist By default, are inherited Some properties are common to all datasets, but a specific dataset type may have additional properties Easily set or retrieved via scripts In general, properties affect future file system activity zpool get doesn't script as nicely as zfs get 95
    96. 96. User-defined Properties Names Must include colon ':' Can contain lower case alphanumerics or “+” “.” “_” Max length = 256 characters By convention, module:property com.sun:auto-snapshot Values Max length = 1024 characters Examples com.sun:auto-snapshot=true com.richardelling:important_files=true 96
    97. 97. set & get properties Set zfs set compression=on export/home/relling Get zfs get compression export/home/relling Reset to inherited value zfs inherit compression export/home/relling Clear user-defined parameter zfs inherit com.sun:auto-snapshot export/home/ relling 97
    98. 98. Pool Properties Property Change? Brief Description altroot Alternate root directory (ala chroot) autoexpand Policy for expanding when vdev size changes autoreplace vdev replacement policy available readonly Available storage space bootfs Default bootable dataset for root pool cachefile Cache file to use other than /etc/zfs/ zpool.cache capacity readonly Percent of pool space used delegation Master pool delegation switch failmode Catastrophic pool failure policy 98
    99. 99. More Pool Properties Property Change? Brief Description guid readonly Unique identifier health readonly Current health of the pool listsnapshots zfs list policy size readonly Total size of pool used readonly Amount of space used version readonly Current on-disk version 99
    100. 100. Property Change? Brief Description available readonly Space available to dataset & children checksum Checksum algorithm compression Compression algorithm compressratio readonly Compression ratio – logical size:referenced physical copies Number of copies of user data creation readonly Dataset creation time logbias Separate log write policy origin readonly For clones, origin snapshot primarycache ARC caching policy readonly Is dataset in readonly mode? referenced readonly Size of data accessible by this dataset 100
    101. 101. More Common Dataset Properties Property Change? Brief Description refreservation Max space guaranteed to a dataset, excluding descendants (snapshots & clones) reservation Minimum space guaranteed to dataset, including descendants secondarycache L2ARC caching policy type readonly Type of dataset (filesystem, snapshot, volume) 101
    102. 102. More Common Dataset Properties Property Change? Brief Description used readonly Sum of usedby* (see below) usedbychildren readonly Space used by descendants usedbydataset readonly Space used by dataset usedbyrefreservation readonly Space used by a refreservation for this dataset usedbysnapshots readonly Space used by all snapshots of this dataset zoned readonly Is dataset added to non-global zone (Solaris) 102
    103. 103. Volume Dataset Properties Property Change? Brief Description shareiscsi iSCSI service (not COMSTAR) volblocksize creation fixed block size volsize Implicit quota zoned readonly Set if dataset delegated to non-global zone (Solaris) 103
    104. 104. File System Properties Property Change? Brief Description aclinherit ACL inheritance policy, when files or directories are created aclmode ACL modification policy, when chmod is used atime Disable access time metadata updates canmount Mount policy casesensitivity creation Filename matching algorithm (CIFS client feature) devices Device opening policy for dataset exec File execution policy for dataset mounted readonly Is file system currently mounted? 104
    105. 105. More File System Properties Property Change? Brief Description nbmand export/ File system should be mounted with non- import blocking mandatory locks (CIFS client feature) normalization creation Unicode normalization of file names for matching quota Max space dataset and descendants can consume recordsize Suggested maximum block size for files refquota Max space dataset can consume, not including descendants setuid setuid mode policy sharenfs NFS sharing options sharesmb Files system shared with CIFS 105
    106. 106. File System Properties Property Change? Brief Description snapdir Controls whether .zfs directory is hidden utf8only creation UTF-8 character file name policy vscan Virus scan enabled xattr Extended attributes policy 106
    107. 107. More Goodies... 107
    108. 108. Dataset Space Accounting used = usedbydataset + usedbychildren + usedbysnapshots + usedbyrefreservation Lazy updates, may not be correct until txg commits ls and du will show size of allocated files which includes all copies of a file Shorthand report available $ zfs list -o space NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD rpool 126G 18.3G 0 35.5K 0 18.3G rpool/ROOT 126G 15.3G 0 18K 0 15.3G rpool/ROOT/snv_106 126G 86.1M 0 86.1M 0 0 rpool/ROOT/snv_b108 126G 15.2G 5.89G 9.28G 0 0 rpool/dump 126G 1.00G 0 1.00G 0 0 rpool/export 126G 37K 0 19K 0 18K rpool/export/home 126G 18K 0 18K 0 0 rpool/swap 128G 2G 0 193M 1.81G 0 108
    109. 109. zfs vs zpool Space Accounting zfs list != zpool list zfs list shows space used by the dataset plus space for internal accounting zpool list shows physical space available to the pool For simple pools and mirrors, they are nearly the same For RAIDZ, RAIDZ2, or RAIDZ3, zpool list will show space available for parity Users will be confused about reported space available 109
    110. 110. Accessing Snapshots By default, snapshots are accessible in .zfs directory Visibility of .zfs directory is tunable via snapdir property Don't really want find to find the .zfs directory Windows CIFS clients can see snapshots as Shadow Copies for Shared Folders (VSS) # zfs snapshot rpool/export/home/relling@20090415 # ls -a /export/home/relling … .Xsession .xsession-errors # ls /export/home/relling/.zfs shares snapshot # ls /export/home/relling/.zfs/snapshot 20090415 # ls /export/home/relling/.zfs/snapshot/20090415 Desktop Documents Downloads Public 110
    111. 111. Time-based Resilvering Block pointers contain birth txg number Resilvering begins with oldest blocks first 73 73 Interrupted resilver will still result in a valid file system view 73 55 73 27 68 73 27 27 73 68 Birth txg = 27 Birth txg = 68 Birth txg = 73 111
    112. 112. Time Slider - Automatic Snapshots Underpinnings for Solaris feature similar to OSX's Time Machine SMF service for managing snapshots SMF properties used to specify policies: frequency (interval) and number to keep Creates cron jobs GUI tool makes it easy to select individual file systems Tip: take additional snapshots for important milestones to avoid automatic snapshot deletion Service Name Interval (default) Keep (default) auto-snapshot:frequent 15 minutes 4 auto-snapshot:hourly 1 hour 24 auto-snapshot:daily 1 day 31 auto-snapshot:weekly 7 days 4 auto-snapshot:monthly 1 month 12 112
    113. 113. Nautilus File system views which can go back in time 113
    114. 114. ACL – Access Control List Based on NFSv4 ACLs Similar to Windows NT ACLs Works well with CIFS services Supports ACL inheritance Change using chmod View using ls 114
    115. 115. Checksums for Data DVA contains 256 bits for checksum Checksum is in the parent, not in the block itself Types none fletcher2: truncated 2nd order Fletcher-like algorithm (default prior to b114) fletcher4: 4th order Fletcher-like algorithm (default, starting b114) SHA-256 There are open proposals for better algorithms 115
    116. 116. Checksum Use Pool Algorithm Notes Uberblock SHA-256 self-checksummed Metadata fletcher4 Labels SHA-256 Gang block SHA-256 self-checksummed Dataset Algorithm Notes Metadata fletcher4 Data fletcher4 (default) zfs checksum parameter ZIL log fletcher2 self-checksummed Send stream fletcher4 Note: fletcher2 was the default for data prior to b114 Note: ZIL log has additional checking beyond the checksum 116
    117. 117. Compression Builtin lzjb, Lempel-Ziv by Jeff Bonwick gzip, levels 1-9 Extensible new compressors can be added backwards compatibility issues Uses taskqs to take advantage of multi-processor systems Do you have a better compressor in mind? http://richardelling.blogspot.com/2009/08/justifying-new- compression-algorithms.html Cannot boot from gzip compressed root (RFE is open) 117
    118. 118. Encryption Placeholder – details TBD http://opensolaris.org/os/project/zfs-crypto Complicated by: Block pointer rewrites Deduplication 118
    119. 119. Quotas File system quotas quota includes descendants (snapshots, clones) refquota does not include descendants User and group quotas b114, Solaris 10 10/09 (patch 141444-03 or 141445-03) Works like refquota, descendants don't count Not inherited zfs userspace and groupspace subcommands show quotas Users can only see their own and group quota, but can delegate Managed like properties [user|group]quota@[UID|username|SID name|SID number] not visible via zfs get all 119
    120. 120. zpool.cache Old way mount / read /etc/[v]fstab mount file systems ZFS import pool(s) find mountable datasets and mount them /etc/zpool.cache is a cache of pools to be imported at boot time No scanning of all available LUNs for pools to import Binary: dump contents with zdb -C cachefile property permits selecting an alternate zpool.cache Useful for OS installers Useful for clusters, where you don't want a booting node to automatically import a pool Not persistent (!) 120
    121. 121. Mounting ZFS File Systems By default, mountable file systems are mounted when the pool is imported Controlled by canmount policy (not inherited) on – (default) file system is mountable off – file system is not mountable if you want children to be mountable, but not the parent noauto – file system must be explicitly mounted (boot environment) Can zfs set mountpoint=legacy to use /etc/vfstab By default, cannot mount on top of non-empty directory Can override explicitly using zfs mount -O or legacy mountpoint Mount properties are persistent, use zfs mount -o for temporary changes Imports are done in parallel, beware of mountpoint races prior to b104 121
    122. 122. recordsize Dynamic Max 128 kBytes Min 512 Bytes Power of 2 For most workloads, don't worry about it For fixed size workloads, can set to match workloads Databases iSCSI Zvols serving NTFS or ext3 (use 4 KB) File systems or Zvols zfs set recordsize=8k dataset 122
    123. 123. Delegated Administration Fine grain control users or groups of users subcommands, parameters, or sets Similar to Solaris' Role Based Access Control (RBAC) Enable/disable at the pool level zpool set delegation=on mypool (default) Allow/unallow at the dataset level zfs allow relling snapshot mypool/relling zfs allow @backupusers snapshot,send mypool/sw zfs allow mypool/relling 123
    124. 124. Delegatable Subcommands allow receive clone rename create rollback destroy send groupquota share groupused snapshot mount userquota promote userused 124
    125. 125. Delegatable Parameters aclinherit nbmand sharesmb aclmode normalization snapdir atime quota userprop canmount readonly utf8only casesensitivity recordsize version checksum refquota volsize compression refreservation vscan copies reservation xattr devices setuid zoned exec shareiscsi mountpoint sharenfs 125
    126. 126. Browser User Interface Solaris 10 – WebConsole Nexenta OpenStorage 126
    127. 127. Solaris WebConsole 127
    128. 128. Solaris WebConsole 128
    129. 129. Nexenta www.nexenta.com/corp/images/stories/pdfs/nexentastor%20briefing%206%2030%20final%20june%2029%2009.pdf 129
    130. 130. OpenStorage 130
    131. 131. Solaris Swap and Dump Swap Solaris does not have automatic swap resizing Swap as a separate dataset Swap device is raw, with a refreservation Blocksize matched to pagesize: 8 kB SPARC, 4 kB x86 Don't really need or want snapshots or clones Can resize while online, manually Dump Only used during crash dump Preallocated No refreservation Checksum off Compression off (dumps are already compressed) 131
    132. 132. Performance 132
    133. 133. General Comments In general, performs well out of the box Standard performance improvement techniques apply Lots of DTrace knowledge available Typical areas of concern: ZIL check with zilstat, improve with slogs COW “fragmentation” check iostat, improve with L2ARC Memory consumption check with arcstat set primarycache property can be capped can compete with large page aware apps Compression, or lack thereof 133
    134. 134. ZIL Performance : NFS Big performance increases demonstrated especially with SSDs for RAID arrays with nonvolatile RAM cache, not so much NFS servers 32kByte threshold (zfs_immediate_write_sz) also corresponds to NFSv3 write size May cause more work than needed See CR6686887 134
    135. 135. ZIL Performance : Databases The logbias property can be set on a dataset to control threshold for writing to pool when a slog is used logbias=latency (default) all writes go to slog logbias=throughput, writes > zfs_immediate_write_sz go to pool Settable on-the-fly Consider changing policy during database loads Can have different sync policies for logs and data Oracle, separate latency-sensitive redo log traffic from Redo logs: logbias=latency Indexes: logbias=latency Data files: logbias=throughput MySQL with InnoDB logbias=latency 135
    136. 136. More ZIL Performance : Databases I/O size inflation Once a file grows to use a block size, it will keep that block size Block size is capped by recordsize recordsize is a power of 2: 512 bytes, 1 KB, 2 KB, 4 KB, ... 128 KB Can be inefficient if the workload is sync and writes variable sized data Oracle performance work: Roch reports 40% improvement for JBOD (HDD) + separate log (SSD) with: File system or Zvol Role recordsize logbias data files 8 KB throughput redo logs 128 KB (default) latency (default) indices 8-32 KB? latency (default) 136
    137. 137. vdev Cache vdev cache occurs at the SPA level readahead 10 MBytes per vdev only caches metadata (b70 or later) Stats collected as Solaris kstats # kstat -n vdev_cache_stats module: zfs instance: 0 name: vdev_cache_stats class: misc crtime 38.83342625 delegations 14030 hits 105169 misses 59452 snaptime 4564628.18130739 Hit rate = 59%, not bad... 137
    138. 138. Intelligent Prefetching Intelligent file-level prefetching occurs at the DMU level Feeds the ARC In a nutshell, prefetch hits cause more prefetching Read a block, prefetch a block If we used the prefetched block, read 2 more blocks Up to 256 blocks Recognizes strided reads 2 sequential reads of same length and a fixed distance will be coalesced Fetches backwards Seems to work pretty well, as-is, for most workloads 138
    139. 139. Unintelligent Prefetch? Some workloads don't do so well with intelligent prefetch CR6859997, zfs caching performance problem, fixed in NV b124 Look for time spent in zfetch_* functions using lockstat lockstat -I sleep 10 Easy to disable in mdb for testing on Solaris echo zfs_prefetch_disable/W0t1 | mdb -kw Re-enable with echo zfs_prefetch_disable/W0t0 | mdb -kw Set via /etc/system set zfs:zfs_prefetch_disable = 1 139
    140. 140. I/O Queues By default, for devices which can support multiple I/Os, up to 35 I/Os are queued to each vdev Tunable with zfs_vdev_max_pending, set to 10 with: echo zfs_vdev_max_pending/W0t10 | mdb -kw Implies that more vdevs is better Consider avoiding RAID array with a single, large LUN ZFS I/O scheduler loses control once iops are queued CR6471212 proposes reserved slots for high-priority iops May need to match queues for the entire data path zfs_vdev_max_pending Fibre channel, SCSI, SAS, SATA driver RAID array controller Fast disks → small queues, slow disks → larger queues 140
    141. 141. COW Penalty COW can negatively affect workloads which have updates and sequential reads Initial writes will be sequential Updates (writes) will cause seeks to read data Lots of people seem to worry a lot about this Only affects HDDs Very difficult to speculate about the impact on real-world apps Large sequential scans of random data hurt anyway Reads are cached in many places in the data path Databases can COW, too Sysbench benchmark used to test on MySQL w/InnoDB engine One hour read/write test select count(*) repeat, for a week 141
    142. 142. COW Penalty Performance seems to level at about 25% penalty Results compliments of Allan Packer & Neelakanth Nadgir http://blogs.sun.com/realneel/resource/MySQL_Conference_2009_ZFS_MySQL.pdf 142
    143. 143. About Disks... Disks still the most important performance bottleneck Modern processors are multi-core Default checksums and compression are computationally efficient Average Max Size Rotational Average Seek Disk Size RPM (GBytes) Latency (ms) (ms) HDD 2.5” 5,400 500 5.5 11 HDD 3.5” 5,900 2,000 5.1 16 HDD 3.5” 7,200 1,500 4.2 8 - 8.5 HDD 2.5” 10,000 300 3 4.2 - 4.6 HDD 2.5” 15,000 146 2 3.2 - 3.5 SSD 2.5” N/A 73 0 0.02 - 0.15 (w) (r) SSD 2.5” N/A 500 0 0.02 - 0.15 143
    144. 144. DirectIO UFS forcedirectio option brought the early 1980s design of UFS up to the 1990s ZFS designed to run on modern multiprocessors Databases or applications which manage their data cache may benefit by disabling file system caching Expect L2ARC to improve random reads (secondarycache) Prefetch disabled by primarycache=none|metadata UFS DirectIO ZFS Unbuffered I/O primarycache=metadata primarycache=none Concurrency Available at inception Improved Async I/O code path Available at inception 144
    145. 145. Hybrid Storage Pool SPA separate log L2ARC Main Pool device cache device Write optimized HDD HDD Read optimized device (SSD) HDD device (SSD) Size (GBytes) < 1 GByte large big Cost write iops/$ size/$ size/$ Performance low-latency - low-latency writes reads 145
    146. 146. RAID-Z Bandwidth Traditional RAID-Z had a “mind the gap” feature Impacts possible bandwidth Mirrors could show higher bandwidth Now RAID-Z shows better bandwidth, when channel bandwidth is the constrained resource Implementation caused spurious errors for b118-b123 146
    147. 147. Troubleshooting 147
    148. 148. Checking Status zpool status zpool status -v Solaris fmadm faulty fmdump fmdump -ev or fmdump -eV format or rmformat 148
    149. 149. flash Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free What if the uberblock is updated prior to leaves? 149
    150. 150. What if flush is ignored? Some devices ignore cache flush commands (!) Virtualization default=ignore flush: VirtualBox, others? Some USB/Firewire to IDE/SATA converters Problem: uberblock could be updated before leaves Symptom: can’t import pool, uberblock points to random data Affected systems Many OSes and file systems Laptops - rarely because of battery Enterprise-class systems - rarely because of power redundancy and solid design Desktops - more frequently Solution (pending further automation) Check integrity of recent transaction groups If damaged, rollback to older uberblock Today, can do this by hand, but process is tedious 150
    151. 151. Can't Import Pool? Check device paths with zpool import Be aware of /etc/zfs/zpool.cache May need zpool -d directory option “phantom paths”? Check for 4 labels zdb -l /dev/dsk/c0t0d0s0 Beware of device short names: c0d0 != c0d0s0 151
    152. 152. Slow Pool Import? Case: zvols with snapshots Symptom: reboot or zpool import is really slllooooowwwwwww... Cause: inefficient incrementing over all zvols creating entries in /dev/zvol/dsk Cure: CR6761786 integrated in b125 152
    153. 153. File System Mounts B0rken? Prevention Avoid complex heirarchies (KISS) Be aware of legacy mounts Be aware of alternate boot environments (Solaris) Check mountpoint properties zfs list -o name,mountpoint Shared file systems Be aware of inherited shares Some clients do not mirror mount (Linux) NFS version differences? Check name services 153
    154. 154. Can't Boot? Check if BIOS/OBP supports booting from device Make sure LUN has SMI label, not EFI Common mistake when mirroring root OK: zpool attach rpool c0t0d0s0 c0t1d0s7 Not OK: zpool attach rpool c0t0d0s0 c0t1d0 installboot? grub issues Boot environments usually handled by grub Check grub menu.lst Know how to do a failsafe boot Be aware of LiveCD import Be aware of zpool.cache interactions 154
    155. 155. Future Plans Announced enhancements in the pipeline from Kernel Conference Australia, July 15-17 2009 Encryption Deduplication Block pointer rewrite Shadow migration More performance tweeks New block allocator Pipeline improvements Raw scrub Scrub prefetch Just in time decompression or decryption Native iSCSI (COMSTAR) Zero copy I/O Parallel device open 155
    156. 156. More Future Plans Snapshot holds (b124) Access-based enumeration (b125) Multiple mount protection Separate log offlining (b125) (removal later) 156
    157. 157. Now you know... ZFS structure: pools, datasets Data redundancy: mirrors, RAIDZ, copies Data verification: checksums Data replication: snapshots, clones, send, receive Hybrid storage: separate logs, cache devices, ARC Security: allow, deny, encryption Resource management: quotas, references, I/O scheduler Performance: latency, COW, zilstat, arcstat, logbias, recordsize Troubleshooting: FMA, zdb, importance of cache flushes 157
    158. 158. Its a wrap! Thank You! Questions? Richard.Elling@RichardElling.com 158

    ×