USENIX 2009

   ZFS Tutorial
Richard.Elling@RichardElling.com
Agenda
 ●   Overview
 ●   Foundations
 ●   Pooled Storage Layer
 ●   Transactional Object Layer
 ●   Commands
        –   ...
History

●   Announced September 14, 2004
●   Integration history
     –   SXCE b27 (November 2005)
     –   FreeBSD (Apri...
Brief List of Features
●   Future-proof                               ●   “No silent data corruption ever”
●   Cutting-edg...
ZFS Design Goals

●   Figure out why storage has gotten so complicated
●   Blow away 20+ years of obsolete assumptions
●  ...
Limits

   248 — Number of entries in any individual directory
   256 — Number of attributes of a f le [1]
               ...
Sidetrack: Understanding Builds

●   Build is often referenced when speaking of feature/bug integration
●   Short-hand not...
Foundations


June 13, 2009      © 2009 Richard Elling   8
Overhead View of a Pool

                                       Pool
                                                     ...
Layer View

raw     swap dump iSCSI   ??         ZFS     NFS CIFS        ??


  ZFS Volume Emulator (Zvol)         ZFS POS...
Source Code Structure
                                File system        Device         GUI        Mgmt
                  ...
Acronyms
 ●   ARC – Adaptive Replacement Cache
 ●   DMU – Data Management Unit
 ●   DSL – Dataset and Snapshot Layer
 ●   ...
nvlists

●   name=value pairs
●   libnvpair(3LIB)
●   Allows ZFS capabilities to change without changing the physical on-
...
Versioning
 ●   Features can be added and identified by nvlist entries
 ●   Change in pool or dataset versions do not chan...
zpool versions
VER    DESCRIPTION
---    --------------------------------------------------------
 1     Initial ZFS versi...
zfs versions
VER    DESCRIPTION
---    --------------------------------------------------------
 1     Initial ZFS filesys...
Copy on Write
                1. Initial block tree                             2. COW some data




                3. CO...
COW Notes
 ●   COW works on blocks, not files
 ●   ZFS reserves 32 MBytes or
     1/64 of pool size
        –   COWs need ...
Pooled Storage Layer

  raw     swap dump iSCSI   ??         ZFS   NFS CIFS       ??


   ZFS Volume Emulator (Zvol)      ...
vdevs – Virtual Devices
                                  Logical vdevs

                                        root vdev...
vdev Labels

●   vdev labels != disk labels
●   4 labels written to every physical vdev
●   Label size = 256kBytes
●   Two...
vdev Label Contents

0            256k      512k              4M                            N-512k N-256k             N

 ...
Observing Labels
# zdb -l /dev/rdsk/c0t0d0s0
--------------------------------------------
LABEL 0
------------------------...
Uberblocks

●   1 kByte
●   Stored in 128-entry circular queue
●   Only one uberblock is active at any time
     –   highe...
MOS – Meta Object Set

●   Only one MOS per pool
●   Contains object directory pointers
     –   root_dataset – references...
Block Pointers
 ●   blkptr_t structure
 ●   128 bytes
 ●   contents:
        –   3x data virtual address (DVA)
        –  ...
DVA – Data Virtual Address

●   Contains
     –   vdev id
     –   offset in sectors
     –   grid (future)
     –   alloc...
Gang Blocks

●   Gang blocks contain block pointers
●   Used when space requested is not available in a contiguous block
●...
To fsck or not to fsck
●   fsck was created to fix known inconsistencies in file system metadata
      –   UFS is not tran...
VDEV


June 13, 2009   © 2009 Richard Elling   30
Dynamic Striping
 ●   RAID-0
        –   SNIA definition: fixed-length sequences of virtual disk data
             address...
Dynamic Striping

    RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes




    ZFS Dynamic Stripe recordsize = 1...
Mirroring
 ●   Straightforward: put N copies of the data on N vdevs
 ●   Unlike RAID-1
        –   No 1:1 mapping at the b...
Mirroring




June 13, 2009   © 2009 Richard Elling           34
Dynamic vdev Replacement

●    zpool replace poolname vdev [vdev]
●    Today, replacing vdev must be same size or larger (...
RAIDZ
 ●   RAID-5
        –   Parity check data is distributed across the RAID array's disks
        –   Must read/modify/...
RAID-5 vs RAIDZ

                DiskA   DiskB     DiskC      DiskD      DiskE
                 D0:0    D0:1      D0:2    ...
RAID-5 Write Hole
 ●   Occurs when data to be written is smaller than stripe size
 ●   Must read unallocated columns to re...
RAIDZ2
 ●   RAIDZ2 = double parity RAIDZ
        –   Can recover data if any 2 leaf vdevs fail
 ●   Sorta like RAID-6
    ...
Evaluating Data Retention

●   MTTDL = Mean Time To Data Loss
●   Note: MTBF is not constant in the real world, but keeps ...
Another MTTDL Model
 ●   MTTDL[1] model doesn't take into account unrecoverable read
 ●   But unrecoverable reads (UER) ar...
Why Worry about UER?

●   Richard's study
     –   3,684 hosts with 12,204 LUNs
     –   11.5% of all LUNs reported read e...
Why Worry about UER?

●   RAID array study




June 13, 2009            © 2009 Richard Elling   43
MTTDL[2] Model

●   Probability that a reconstruction will fail
     –   Precon_fail = (N-1) * size / UER
●   Model doesn'...
Practical View of MTTDL[1]




June 13, 2009        © 2009 Richard Elling   45
MTTDL Models: Mirror




June 13, 2009    © 2009 Richard Elling   46
MTTDL Models: RAIDZ2




June 13, 2009     © 2009 Richard Elling   47
Ditto Blocks

●   Recall that each blkptr_t contains 3 DVAs
●   Allows up to 3 physical copies of the data



       ZFS c...
Copies

●   Dataset property used to indicate how many copies (aka ditto blocks)
    of data is desired
     –   Write all...
Copies in Pictures




June 13, 2009   © 2009 Richard Elling      50
Copies in Pictures




June 13, 2009   © 2009 Richard Elling      51
ZIO – ZFS I/O Layer


June 13, 2009          © 2009 Richard Elling   52
ZIO Framework
 ●   All physical disk I/O goes through ZIO Framework
 ●   Translates DVAs into Logical Block Address (LBA) ...
SpaceMap from Space




June 13, 2009    © 2009 Richard Elling   54
ZIO Write Pipeline
ZIO State        Compression          Crypto       Checksum     DVA       vdev I/O

   open
           ...
ZIO Read Pipeline
ZIO State       Compression       Crypto       Checksum     DVA   vdev I/O

   open

                   ...
VDEV – Virtual Device Subsytem
 ●   Where mirrors, RAIDZ, and RAIDZ2 are implemented
        –   Surprisingly few lines of...
ARC – Adaptive
                Replacement Cache


June 13, 2009         © 2009 Richard Elling   58
Object Cache
 ●   UFS uses page cache managed by the virtual memory system
 ●   ZFS does not use the page cache, except fo...
Traditional Cache

●   Works well when data being accessed was recently added
●   Doesn't work so well when frequently acc...
ARC – Adaptive Replacement
                                          Cache
                   Evict the oldest single-use ...
ZFS ARC – Adaptive Replacement
            Cache with Locked Pages
                                Evict the oldest single...
ARC Directory
 ●   Each ARC directory entry contains arc_buf_hdr structs
        –   Info about the entry
        –   Poin...
L2ARC – Level 2 ARC

●   ARC evictions are sent to cache vdev
●   ARC directory remains in memory
                        ...
ARC Tips
 ●   In general, it seems to work well for most workloads
 ●   ARC size will vary, based on usage
 ●   Internals ...
Transactional Object
                      Layer


June 13, 2009          © 2009 Richard Elling   66
flash
                                       Source Code Structure
                                   File system        D...
ZAP – ZFS Attribute Processor
●   Module sits on top of DMU
●   Important component for managing everything
●   Operates o...
DMU – Data Management Layer

●   Datasets issue transactions to the DMU
●   Transactional based object model
●   Transacti...
Transaction Engine
 ●   Manages physical I/O
 ●   Transactions grouped into transaction group (txg)
        –    txg updat...
ZIL – ZFS Intent Log

●   DMU is transactional, and likes to group I/O into transactions for later
    commits, but still ...
Separate Logs (slogs)

●   ZIL competes with pool for iops
     –   Applications will wait for sync writes to be on nonvol...
DSL – Dataset and
                 Snapshot Layer


June 13, 2009         © 2009 Richard Elling   73
flash
                                                                 Copy on Write
                 1. Initial block tre...
zfs snapshot
 ●   Create a read-only, point-in-time window into the dataset (file system
     or Zvol)
 ●   Computationall...
Snapshot

          Snapshot tree root                           Current tree root




 ●   Create a snapshot by not free'...
Clones

●   Snapshots are read-only
●   Clones are read-write based upon a snapshot
●   Child depends on parent
     –   C...
zfs clone
 ●   Create a read-write file system from a read-only snapshot
 ●   Used extensively for OpenSolaris upgrades


...
zfs promote


            OS b104                           OS b104
                                                clone ...
zfs rollback


            OS b104                                    OS b104

      rpool/ROOT/b104                      ...
Commands


June 13, 2009     © 2009 Richard Elling   81
zpool(1m)

  raw     swap dump iSCSI   ??         ZFS   NFS CIFS       ??


   ZFS Volume Emulator (Zvol)          ZFS POS...
Dataset & Snapshot Layer
 ●   Object
        –   Allocated storage
        –   dnode describes collection                 ...
zpool create
 ●   zpool create poolname vdev-configuration
        –    vdev-configuration examples
                ●   mi...
zpool destroy
 ●   Destroy the pool and all datasets therein
 ●   zpool destroy poolname
 ●   Can (try to) force with “-f”...
zpool add
 ●   Adds a device to the pool as a top-level vdev
 ●   zpool add poolname vdev-configuration
 ●   vdev-configur...
zpool remove
 ●   Remove a top-level vdev from the pool
 ●   zpool remove poolname vdev
 ●   Today, you can only remove th...
zpool attach
 ●   Attach a vdev as a mirror to an existing vdev
 ●   zpool attach poolname existing-vdev vdev
 ●   Attachi...
zpool detach
 ●   Detach a vdev from a mirror
 ●   zpool detach poolname vdev
 ●   A resilvering vdev will wait until resi...
zpool replace
 ●   Replaces an existing vdev with a new vdev
 ●   zpool replace poolname existing-vdev vdev
 ●   Effective...
zpool import
 ●   Import a pool and mount all mountable datasets
 ●   Import a specific pool
      – zpool import poolname...
zpool export
 ●   Unmount datasets and export the pool
 ●   zpool export poolname
 ●   Removes pool entry from zpool.cache...
zpool upgrade
 ●   Display current versions
      – zpool upgrade
 ●   View available upgrade versions, with features, but...
zpool history
 ●   Show history of changes made to the pool

  # zpool history rpool
  History for 'rpool':
  2009-03-04.0...
zpool status
 ●   Shows the status of the current pools, including their configuration
 ●   Important troubleshooting step...
zpool clear
 ●   Clears device errors
 ●   Clears device error counters
 ●   Improves sysadmin sanity and reduces sweating...
zpool iostat
 ●   Show pool physical I/O activity, in an iostat-like manner
 ●   Solaris: fsstat will show I/O activity lo...
zpool scrub
 ●   Manually starts scrub
      – zpool scrub poolname
 ●   Scrubbing performed in background
 ●   Use zpool ...
zfs(1m)
 ●    Manages file systems (ZPL) and Zvols
 ●    Can proxy to other, related commands
        –    iSCSI, NFS, CIF...
zfs create, destroy

●   By default, a file system with the same name as the pool is created by
    zpool create
●   Name ...
zfs mount, unmount

●   Note: mount point is a file system parameter
     –   zfs get mountpoint fs-name
●   Rarely used s...
zfs list
 ●   List mounted datasets
 ●   Old versions: listed everything
 ●   New versions: do not list snapshots
 ●   Exa...
zfs send, receive
 ●   Send
        –   send a snapshot to stdout
        –   data is decompressed
 ●   Receive
        – ...
zfs rename
 ●   Renames a file system, volume,or snapshot
        –   zfs rename export/home/relling export/home/richard

...
zfs upgrade
 ●   Display current versions
      – zfs upgrade
 ●   View available upgrade versions, with features, but don...
Sharing


June 13, 2009    © 2009 Richard Elling   106
Sharing

●   zfs share dataset
●   Type of sharing set by parameters
     –   shareiscsi = [on | off]
     –   sharenfs = ...
NFS
 ●   ZFS file systems work as expected
        –   use ACLs based on NFSv4 ACLs
 ●   Parallel NFS, aks pNFS, aka NFSv4...
CIFS

●   UID mapping
●   casesensitivity parameter
     –   Good idea, set when file system is created
     –   zfs creat...
iSCSI
 ●   SCSI over IP
 ●   Block-level protocol
 ●   Uses Zvols as storage
 ●   Solaris has 2 iSCSI target implementatio...
Properties


June 13, 2009     © 2009 Richard Elling   111
Properties
 ●   Properties are stored in an nvlist
 ●   By default, are inherited
 ●   Some properties are common to all d...
User-defined Properties
 ●   Names
        –   Must include colon ':'
        –   Can contain lower case alphanumerics or ...
set & get properties
 ●   Set
        –   zfs set compression=on export/home/relling
 ●   Get
        –   zfs get compress...
Pool Properties
       Property   Change?                          Brief Description
  altroot                    Alternat...
Common Dataset Properties
       Property    Change?                          Brief Description
  available        readonl...
Common Dataset Properties
          Property       Change?                        Brief Description
  secondarycache      ...
File System Dataset Properties
      Property     Change?                         Brief Description
 aclinherit           ...
File System Dataset Properties
      Property    Change?                        Brief Description
 quota                  ...
More Goodies...


June 13, 2009        © 2009 Richard Elling   120
Dataset Space Accounting
 ●   used = usedbydataset + usedbychildren + usedbysnapshots +
     usedbyrefreservation
 ●   Laz...
zfs vs zpool Space Accounting
 ●   zfs list != zpool list
 ●   zfs list shows space used by the dataset plus space for int...
Testing

●   ztest
●   fstest




June 13, 2009   © 2009 Richard Elling         123
Accessing Snapshots
 ●   By default, snapshots are accessible in .zfs directory
 ●   Visibility of .zfs directory is tunab...
Resilver & Scrub
 ●   Can be read iops bound
 ●   Resilver can also be bandwidth bound to the resilvering device
 ●   Both...
Time-based Resilvering
 ●   Block pointers contain birth txg
     number
 ●   Resilvering begins with oldest
     blocks f...
Time Slider – Automatic Snapshots
●   Underpinnings for Solaris feature similar to OSX's Time Machine
●   SMF service for ...
Nautilus

●   File system views which can go back in time




June 13, 2009                  © 2009 Richard Elling        ...
ACL – Access Control List
 ●   Based on NFSv4 ACLs
 ●   Similar to Windows NT ACLs
 ●   Works well with CIFS services
 ●  ...
Checksums
●   DVA contains 256 bits for checksum
●   Checksum is in the parent, not in the block itself
●   Types
      – ...
Checksum Use


                Use         Algorithm                       Notes
         Uberblock    SHA-256            ...
Checksum Performance
 ●   Metadata – you won't notice
 ●   Data
        –   LZJB is barely noticeable
        –   gzip-9 c...
Compression

●   Builtin
     –   lzjb, Lempel-Ziv by Jeff Bonwick
     –   gzip, levels 1-9
●   Extensible
     –   new c...
Encryption
 ●   Placeholder – details TBD
 ●   http://opensolaris.org/os/project/zfs-crypto
 ●   Complicated by:
        –...
Impedance Matching
 ●   RAID arrays & columns
 ●   Label offsets
        –   Older Solaris starting block = 34
        –  ...
Quotas

●   File system quotas
     –   quota includes descendants (snapshots, clones)
     –   refquota does not include ...
zpool.cache
●   Old way
      –   mount /
      –   read /etc/[v]fstab
      –   mount file systems
●   ZFS
      –   impo...
Mounting ZFS File Systems
 ●   By default, mountable file systems are mounted when the pool is
     imported
        –    ...
recordsize

●   Dynamic
     –   Max 128 kBytes
     –   Min 512 Bytes
     –   Power of 2
●   For most workloads, don't w...
Delegated Administration

●   Fine grain control
     –   users or groups of users
     –   subcommands, parameters, or se...
Delegation Inheritance
     Beware of inheritance

 ●   Local
        –    zfs allow -l relling snapshot mypool
 ●   Local...
Delegatable Subcommands
 ●   allow                          ●   receive
 ●   clone                          ●   rename
 ● ...
Delegatable Parameters

●   aclinherit        ●   nbmand                     ●   sharenfs
●   aclmode           ●   normal...
Browser User Interface
 ●   Solaris – WebConsole
 ●   Nexenta -
 ●   OSX -
 ●   OpenStorage -




June 13, 2009           ...
Solaris WebConsole




June 13, 2009   © 2009 Richard Elling   145
Solaris WebConsole




June 13, 2009   © 2009 Richard Elling   146
Solaris Swap and Dump
 ●   Swap
        –   Solaris does not have automatic swap resizing
        –   Swap as a separate d...
Performance



June 13, 2009      © 2009 Richard Elling   148
General Comments
 ●   In general, performs well out of the box
 ●   Standard performance improvement techniques apply
 ●  ...
ZIL Performance
 ●   Big performance increases demonstrated, especially with SSDs
 ●   NFS servers
        –   32kByte thr...
vdev Cache
 ●   vdev cache occurs at the SPA level
        –   readahead
        –   10 MBytes per vdev
        –   only c...
Intelligent Prefetching
 ●   Intelligent file-level prefetching occurs at the DMU level
 ●   Feeds the ARC
 ●   In a nutsh...
I/O Queues
 ●   By default, for devices which can support it, 35 iops are queued to
     each vdev
        –   Tunable wit...
COW Penalty
 ●   COW can negatively affect workloads which have updates and
     sequential reads
        –   Initial writ...
COW Penalty




                Performance seems to level at about 25% penalty

 Results compliments of Allan Packer & Ne...
About Disks...
 ●    Disks still the most important performance bottleneck
        –    Modern processors are multi-core
 ...
DirectIO
 ●   UFS forcedirectio option brought the early 1980s design of UFS up to
     the 1990s
 ●   ZFS designed to run...
Hybrid Storage Pool

                                                  SPA

                          separate log        ...
Future Plans
 ●   Announced enhancements
     OpenSolaris Town Hall 2009.06
        –   de-duplication (see also GreenByte...
Its a wrap!



                      Thank You!
                       Questions?
                Richard.Elling@RichardEl...
Upcoming SlideShare
Loading in...5
×

ZFS Tutorial USENIX June 2009

5,366

Published on

ZFS presentation delivered as a tutorial at the 2009 USENIX technical conference by Richard Elling

Published in: Technology
0 Comments
19 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,366
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
19
Embeds 0
No embeds

No notes for slide

Transcript of "ZFS Tutorial USENIX June 2009"

  1. 1. USENIX 2009 ZFS Tutorial Richard.Elling@RichardElling.com
  2. 2. Agenda ● Overview ● Foundations ● Pooled Storage Layer ● Transactional Object Layer ● Commands – zpool – zfs ● Sharing ● Properties ● More goodies ● Performance ● Wrap June 13, 2009 © 2009 Richard Elling 2
  3. 3. History ● Announced September 14, 2004 ● Integration history – SXCE b27 (November 2005) – FreeBSD (April 2007) – Mac OSX Leopard (~ June 2007) – OpenSolaris 2008.05 – Solaris 10 6/06 (June 2006) – Linux FUSE (summer 2006) – greenBytes ZFS+ (September 2008) ● More than 45 patents, contributed to the CDDL Patents Common June 13, 2009 © 2009 Richard Elling 3
  4. 4. Brief List of Features ● Future-proof ● “No silent data corruption ever” ● Cutting-edge data integrity ● “Mind-boggling scalability” ● High performance ● “Breathtaking speed” ● Simplified administration ● “Near zero administration” ● Eliminates need for volume ● “Radical new architecture” managers ● “Greatly simplifies support ● Reduced costs issues” ● Compatibility with POSIX file ● “RAIDZ saves money” system & block devices ● Self-healing Marketing: 2 drink minimum June 13, 2009 © 2009 Richard Elling 4
  5. 5. ZFS Design Goals ● Figure out why storage has gotten so complicated ● Blow away 20+ years of obsolete assumptions ● Gotta replace UFS ● Design an integrated system from scratch ● End the suffering June 13, 2009 © 2009 Richard Elling 5
  6. 6. Limits 248 — Number of entries in any individual directory 256 — Number of attributes of a f le [1] i 256 — Number of f les in a directory [1] i 16 EiB (264 bytes) — Maximum size of a f le system i 16 EiB — Maximum size of a single file 16 EiB — Maximum size of any attribute 264 — Number of devices in any pool 264 — Number of pools in a system 264 — Number of f le systems in a pool i 264 — Number of snapshots of any f le system i 256 ZiB (278 bytes) — Maximum size of any pool [1] actually constrained to 248 for the number of f les in a ZFS f le system i i June 13, 2009 © 2009 Richard Elling 6
  7. 7. Sidetrack: Understanding Builds ● Build is often referenced when speaking of feature/bug integration ● Short-hand notation: b# ● OpenSolaris and SXCE are based on NV ● ZFS development done for NV – Bi-weekly build cycle – Schedule at http://opensolaris.org/os/community/on/schedule/ ● ZFS is ported to Solaris 10 and other OSes June 13, 2009 © 2009 Richard Elling 7
  8. 8. Foundations June 13, 2009 © 2009 Richard Elling 8
  9. 9. Overhead View of a Pool Pool File System Configuration Information Volume File System Volume Dataset June 13, 2009 © 2009 Richard Elling 9
  10. 10. Layer View raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? June 13, 2009 © 2009 Richard Elling 10
  11. 11. Source Code Structure File system Device GUI Mgmt Consumer Consumer JNI User libzfs Kernel Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration June 13, 2009 © 2009 Richard Elling 11
  12. 12. Acronyms ● ARC – Adaptive Replacement Cache ● DMU – Data Management Unit ● DSL – Dataset and Snapshot Layer ● JNI – Java Native InterfaceZPL – ZFS POSIX Layer (traditional file system interface) ● VDEV – Virtual Device layer ● ZAP – ZFS Attribute Processor ● ZIL – ZFS Intent Log ● ZIO – ZFS I/O layer ● Zvol – ZFS volume (raw/cooked block device interface) June 13, 2009 © 2009 Richard Elling 12
  13. 13. nvlists ● name=value pairs ● libnvpair(3LIB) ● Allows ZFS capabilities to change without changing the physical on- disk format ● Data stored is XDR encoded ● A good thing, used often June 13, 2009 © 2009 Richard Elling 13
  14. 14. Versioning ● Features can be added and identified by nvlist entries ● Change in pool or dataset versions do not change physical on-disk format (!) – does change nvlist parameters ● Older-versions can be used – might see warning messages, but harmless ● Available versions and features can be easily viewed – zpool upgrade -v – zfs upgrade -v ● Online references – zpool: www.opensolaris.org/os/community/zfs/version/N – zfs: www.opensolaris.org/os/community/zfs/version/zpl/N Don't confuse zpool and zfs versions June 13, 2009 © 2009 Richard Elling 14
  15. 15. zpool versions VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit support 15 user and group quotas 16 COMSTAR support June 13, 2009 © 2009 Richard Elling 15
  16. 16. zfs versions VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS filesystem version 2 Enhanced directory entries 3 Case insensitive and File system unique identifier (FUID) 4 user and group quotas June 13, 2009 © 2009 Richard Elling 16
  17. 17. Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free June 13, 2009 © 2009 Richard Elling 17
  18. 18. COW Notes ● COW works on blocks, not files ● ZFS reserves 32 MBytes or 1/64 of pool size – COWs need some free space to remove files – need space for ZIL ● For fixed-record size workloads “fragmentation” and “poor performance” can occur if the recordsize is not matched ● Spatial distribution is good fodder for performance speculation – affects HDDs – moot for SSDs June 13, 2009 © 2009 Richard Elling 18
  19. 19. Pooled Storage Layer raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? June 13, 2009 © 2009 Richard Elling 19
  20. 20. vdevs – Virtual Devices Logical vdevs root vdev top-level vdev top-level vdev children[0] children[1] mirror mirror vdev vdev vdev vdev type=disk type=disk type=disk type=disk children[0] children[1] children[0] children[1] Physical or leaf vdevs June 13, 2009 © 2009 Richard Elling 20
  21. 21. vdev Labels ● vdev labels != disk labels ● 4 labels written to every physical vdev ● Label size = 256kBytes ● Two-stage update process – write label0 & label2 – check for errors – write label1 & label3 0 256k 512k 4M N-512k N-256k N Boot label0 label1 label2 label3 Block June 13, 2009 © 2009 Richard Elling 21
  22. 22. vdev Label Contents 0 256k 512k 4M N-512k N-256k N Boot label0 label1 label2 label3 Block Boot Name=Value Blank Header Pairs 128-slot Uberblock Array 0 8k 16k 128k 256k June 13, 2009 © 2009 Richard Elling 22
  23. 23. Observing Labels # zdb -l /dev/rdsk/c0t0d0s0 -------------------------------------------- LABEL 0 -------------------------------------------- version=14 name='rpool' state=0 txg=13152 pool_guid=17111649328928073943 hostid=8781271 hostname='' top_guid=11960061581853893368 guid=11960061581853893368 vdev_tree type='disk' id=0 guid=11960061581853893368 path='/dev/dsk/c0t0d0s0' devid='id1,sd@SATA_____ST3500320AS_________________9QM3FWFT/a' phys_path='/pci@0,0/pci1458,b002@11/disk@0,0:a' whole_disk=0 metaslab_array=24 metaslab_shift=30 ashift=9 asize=157945167872 is_log=0 June 13, 2009 © 2009 Richard Elling 23
  24. 24. Uberblocks ● 1 kByte ● Stored in 128-entry circular queue ● Only one uberblock is active at any time – highest transaction group number – correct SHA-256 checksum ● Stored in machine's native format – A magic number is used to determine endian format when imported ● Contains pointer to MOS June 13, 2009 © 2009 Richard Elling 24
  25. 25. MOS – Meta Object Set ● Only one MOS per pool ● Contains object directory pointers – root_dataset – references all top-level datasets in the pool – config – nvlist describing the pool configuration – sync_bplist – list of block pointers which need to be freed during the next transaction June 13, 2009 © 2009 Richard Elling 25
  26. 26. Block Pointers ● blkptr_t structure ● 128 bytes ● contents: – 3x data virtual address (DVA) – endianess – level of indirection – DMU object type – checksum function – compression function – physical size – logical size – birth txg – fill count – checksum (256 bits) June 13, 2009 © 2009 Richard Elling 26
  27. 27. DVA – Data Virtual Address ● Contains – vdev id – offset in sectors – grid (future) – allocated size – gang block indicator ● Physical block address = (offset << 9) + 4 MBytes June 13, 2009 © 2009 Richard Elling 27
  28. 28. Gang Blocks ● Gang blocks contain block pointers ● Used when space requested is not available in a contiguous block ● 512 bytes ● self checksummed ● contains 3 block pointers June 13, 2009 © 2009 Richard Elling 28
  29. 29. To fsck or not to fsck ● fsck was created to fix known inconsistencies in file system metadata – UFS is not transactional – metadata inconsistencies must be reconciled – does NOT repair data – how could it? ● ZFS doesn't need fsck, as-is – all on-disk changes are transactional – COW means previously existing, consistent metadata is not overwritten – ZFS can repair itself ● metadata is at least dual-redundant ● data can also be redundant ● Reality check – this does not mean that ZFS is not susceptible to corruption – nor is any other file system June 13, 2009 © 2009 Richard Elling 29
  30. 30. VDEV June 13, 2009 © 2009 Richard Elling 30
  31. 31. Dynamic Striping ● RAID-0 – SNIA definition: fixed-length sequences of virtual disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern ● Dynamic Stripe – Data is dynamically mapped to member disks – No fixed-length sequences – Allocate up to ~1 MByte/vdev before changing vdev – vdevs can be different size – Good combination of the concatenation feature with RAID-0 performance June 13, 2009 © 2009 Richard Elling 31
  32. 32. Dynamic Striping RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes ZFS Dynamic Stripe recordsize = 128 kBytes Total write size = 2816 kBytes June 13, 2009 © 2009 Richard Elling 32
  33. 33. Mirroring ● Straightforward: put N copies of the data on N vdevs ● Unlike RAID-1 – No 1:1 mapping at the block level – vdev labels are still at beginning and end – vdevs can be of different size ● effective space is that of smallest vdev ● Arbitration: ZFS does not blindly trust either side of mirror – Most recent, correct view of data wins – Checksums validate data June 13, 2009 © 2009 Richard Elling 33
  34. 34. Mirroring June 13, 2009 © 2009 Richard Elling 34
  35. 35. Dynamic vdev Replacement ● zpool replace poolname vdev [vdev] ● Today, replacing vdev must be same size or larger (as measured by blocks) ● Replacing all vdevs in a top-level vdev with larger vdevs results in (automatic?) top-level vdev resizing 15G 10G 10G 15G 10G 20G 15G 20G 10G 15G 20G 20G 20G 20G 10G 10G Mirror 10G Mirror 15G Mirror 15G Mirror 20G Mirror June 13, 2009 © 2009 Richard Elling 35
  36. 36. RAIDZ ● RAID-5 – Parity check data is distributed across the RAID array's disks – Must read/modify/write when data is smaller than stripe width ● RAIDZ – Dynamic data placement – Parity added as needed – Writes are full-stripe writes – No read/modify/write (write hole) ● Arbitration: ZFS does not blindly trust any device – Does not rely on disk reporting read error – Checksums validate data – If checksum fails, read parity Space used is dependent on how used June 13, 2009 © 2009 Richard Elling 36
  37. 37. RAID-5 vs RAIDZ DiskA DiskB DiskC DiskD DiskE D0:0 D0:1 D0:2 D0:3 P0 RAID-5 P1 D1:0 D1:1 D1:2 D1:3 D2:3 P2 D2:0 D2:1 D2:2 D3.2 D3:3 P3 D3:0 D3:1 DiskA DiskB DiskC DiskD DiskE P0 D0:0 D0:1 D0:2 D0:3 RAIDZ P1 D1:0 D1:1 P2:0 D2:0 D2:1 D2:2 D2:3 P2:1 D2:4 D2:5 June 13, 2009 © 2009 Richard Elling 37
  38. 38. RAID-5 Write Hole ● Occurs when data to be written is smaller than stripe size ● Must read unallocated columns to recalculate the parity or the parity must be read/modify/write ● Read/modify/write is risky for consistency – Multiple disks – Reading independently – Writing independently – System failure before all writes are complete to media could result in data loss ● Effects can be hidden from host using RAID array with nonvolatile write cache, but extra I/O cannot be hidden from disks June 13, 2009 © 2009 Richard Elling 38
  39. 39. RAIDZ2 ● RAIDZ2 = double parity RAIDZ – Can recover data if any 2 leaf vdevs fail ● Sorta like RAID-6 – Parity 1: XOR – Parity 2: another Reed-Soloman syndrome ● More computationally expensive than RAIDZ ● Arbitration: ZFS does not blindly trust any device – Does not rely on disk reporting read error – Checksums validate data – If data not valid, read parity – If data still not valid, read other parity Space used is dependent on how used June 13, 2009 © 2009 Richard Elling 39
  40. 40. Evaluating Data Retention ● MTTDL = Mean Time To Data Loss ● Note: MTBF is not constant in the real world, but keeps math simple ● MTTDL[1] is a simple MTTDL model ● No parity (single vdev, striping, RAID-0) – MTTDL[1] = MTBF / N ● Single Parity (mirror, RAIDZ, RAID-1, RAID-5) – MTTDL[1] = MTBF2 / (N * (N-1) * MTTR) ● Double Parity (3-way mirror, RAIDZ2, RAID-6) – MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2) June 13, 2009 © 2009 Richard Elling 40
  41. 41. Another MTTDL Model ● MTTDL[1] model doesn't take into account unrecoverable read ● But unrecoverable reads (UER) are becoming the dominant failure mode – UER specifed as errors per bits read – More bits = higher probability of loss per vdev ● MTTDL[2] model considers UER June 13, 2009 © 2009 Richard Elling 41
  42. 42. Why Worry about UER? ● Richard's study – 3,684 hosts with 12,204 LUNs – 11.5% of all LUNs reported read errors ● Bairavasundaram et.al. FAST08 www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf – 1.53M LUNs over 41 months – RAID reconstruction discovers 8% of checksum mismatches – 4% of disks studies developed checksum errors over 17 months June 13, 2009 © 2009 Richard Elling 42
  43. 43. Why Worry about UER? ● RAID array study June 13, 2009 © 2009 Richard Elling 43
  44. 44. MTTDL[2] Model ● Probability that a reconstruction will fail – Precon_fail = (N-1) * size / UER ● Model doesn't work for non-parity schemes (single vdev, striping, RAID-0) ● Single Parity (mirror, RAIDZ, RAID-1, RAID-5) – MTTDL[2] = MTBF / (N * Precon_fail) ● Double Parity (3-way mirror, RAIDZ2, RAID-6) – MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail) June 13, 2009 © 2009 Richard Elling 44
  45. 45. Practical View of MTTDL[1] June 13, 2009 © 2009 Richard Elling 45
  46. 46. MTTDL Models: Mirror June 13, 2009 © 2009 Richard Elling 46
  47. 47. MTTDL Models: RAIDZ2 June 13, 2009 © 2009 Richard Elling 47
  48. 48. Ditto Blocks ● Recall that each blkptr_t contains 3 DVAs ● Allows up to 3 physical copies of the data ZFS copies parameter Data copies Metadata copies default 1 2 copies=2 2 3 copies=3 3 3 June 13, 2009 © 2009 Richard Elling 48
  49. 49. Copies ● Dataset property used to indicate how many copies (aka ditto blocks) of data is desired – Write all copies – Read any copy – Recover corrupted read from a copy ● By default – data copies=1 – metadata copies=data copies +1 or max=3 ● Not a replacement for mirroring ● Easier to describe in pictures... June 13, 2009 © 2009 Richard Elling 49
  50. 50. Copies in Pictures June 13, 2009 © 2009 Richard Elling 50
  51. 51. Copies in Pictures June 13, 2009 © 2009 Richard Elling 51
  52. 52. ZIO – ZFS I/O Layer June 13, 2009 © 2009 Richard Elling 52
  53. 53. ZIO Framework ● All physical disk I/O goes through ZIO Framework ● Translates DVAs into Logical Block Address (LBA) on leaf vdevs – Keeps free space maps (spacemap) – If contiguous space is not available: ● Allocate smaller blocks (the gang) ● Allocate gang block, pointing to the gang ● Implemented as multi-stage pipeline – Allows extensions to be added fairly easily ● Handles I/O errors June 13, 2009 © 2009 Richard Elling 53
  54. 54. SpaceMap from Space June 13, 2009 © 2009 Richard Elling 54
  55. 55. ZIO Write Pipeline ZIO State Compression Crypto Checksum DVA vdev I/O open compress if savings > 12.5% encrypt generate allocate start start start done done done assess assess assess done Gang activity elided, for clarity June 13, 2009 © 2009 Richard Elling 55
  56. 56. ZIO Read Pipeline ZIO State Compression Crypto Checksum DVA vdev I/O open start start start done done done assess assess assess verify decrypt decompress done Gang activity elided, for clarity June 13, 2009 © 2009 Richard Elling 56
  57. 57. VDEV – Virtual Device Subsytem ● Where mirrors, RAIDZ, and RAIDZ2 are implemented – Surprisingly few lines of code needed to implement RAID ● Leaf vdev (physical device) I/O management – Number of outstanding iops – Read-ahead cache Name Priority NOW 0 ● Priority scheduling SYNC_READ 0 SYNC_WRITE 0 FREE 0 CACHE_FILL 0 LOG_WRITE 0 ASYNC_READ 4 ASYNC_WRITE 4 RESILVER 10 SCRUB 20 June 13, 2009 © 2009 Richard Elling 57
  58. 58. ARC – Adaptive Replacement Cache June 13, 2009 © 2009 Richard Elling 58
  59. 59. Object Cache ● UFS uses page cache managed by the virtual memory system ● ZFS does not use the page cache, except for mmap'ed files ● ZFS uses a Adaptive Replacement Cache (ARC) ● ARC used by DMU to cache DVA data objects ● Only one ARC per system, but caching policy can be changed on a per-dataset basis ● Seems to work much better than page cache ever did for UFS June 13, 2009 © 2009 Richard Elling 59
  60. 60. Traditional Cache ● Works well when data being accessed was recently added ● Doesn't work so well when frequently accessed data is evicted Misses cause insert MRU Dynamic caches can change Cache size size by either not evicting or aggressively evicting LRU Evict the oldest June 13, 2009 © 2009 Richard Elling 60
  61. 61. ARC – Adaptive Replacement Cache Evict the oldest single-use entry LRU Recent Cache Miss MRU Evictions and dynamic MRU resizing needs to choose best Hit size cache to evict (shrink) Frequent Cache LRU Evict the oldest multiple accessed entry June 13, 2009 © 2009 Richard Elling 61
  62. 62. ZFS ARC – Adaptive Replacement Cache with Locked Pages Evict the oldest single-use entry Cannot evict LRU locked pages! Recent Cache Miss MRU MRU Hit size Frequent If hit occurs Cache within 62 ms LRU Evict the oldest multiple accessed entry ZFS ARC handles mixed-size pages June 13, 2009 © 2009 Richard Elling 62
  63. 63. ARC Directory ● Each ARC directory entry contains arc_buf_hdr structs – Info about the entry – Pointer to the entry ● Directory entries have size, ~200 bytes ● ZFS block size is dynamic, 512 bytes – 128 kBytes ● Disks are large ● Suppose we use a Seagate LP 2 TByte disk for the L2ARC – Disk has 3,907,029,168 512 byte sectors, guaranteed – Workload uses 8 kByte fixed record size – RAM needed for arc_buf_hdr entries ● Need = 3,907,029,168 * 200 / 16 = 45 GBytes ● Don't underestimate the RAM needed for large L2ARCs June 13, 2009 © 2009 Richard Elling 63
  64. 64. L2ARC – Level 2 ARC ● ARC evictions are sent to cache vdev ● ARC directory remains in memory ARC ● Works well when cache vdev is optimized for fast reads – lower latency than pool disks evicted – inexpensive way to “increase memory” data ● Content considered volatile, no ZFS data protection allowed ● Monitor usage with zpool iostat “cache” “cache” “cache” vdev vdev vdev June 13, 2009 © 2009 Richard Elling 64
  65. 65. ARC Tips ● In general, it seems to work well for most workloads ● ARC size will vary, based on usage ● Internals tracked by kstats in Solaris – Use memory_throttle_count to observe pressure to evict ● Can limit at boot time – Solaris – set zfs:zfs_arc_max in /etc/system ● Performance – Prior to b107, L2ARC fill rate was limited to 8 MBytes/s L2ARC keeps its directory in kernel memory June 13, 2009 © 2009 Richard Elling 65
  66. 66. Transactional Object Layer June 13, 2009 © 2009 Richard Elling 66
  67. 67. flash Source Code Structure File system Device GUI Mgmt Consumer Consumer JNI User libzfs Kernel Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration June 13, 2009 © 2009 Richard Elling 67
  68. 68. ZAP – ZFS Attribute Processor ● Module sits on top of DMU ● Important component for managing everything ● Operates on ZAP objects – Contain name=value pairs ● FatZAP – Flexible architecture for storing large numbers of attributes ● MicroZAP – Lightweight version of fatzap – Uses 1 block – All name=value pairs must fit in block – Names <= 50 chars (including NULL terminator) – Values are type uint64_t June 13, 2009 © 2009 Richard Elling 68
  69. 69. DMU – Data Management Layer ● Datasets issue transactions to the DMU ● Transactional based object model ● Transactions are – Atomic – Grouped (txg = transaction group) ● Responsible for on-disk data ● ZFS Attribute Processor (ZAP) ● Dataset and Snapshot Layer (DSL) ● ZFS Intent Log (ZIL) June 13, 2009 © 2009 Richard Elling 69
  70. 70. Transaction Engine ● Manages physical I/O ● Transactions grouped into transaction group (txg) – txg updates – All-or-nothing – Commit interval ● Older versions: 5 seconds (zfs_ ● Now: 30 seconds max, dynamically scale based on time required to commit txg ● Delay committing data to physical storage – Improves performance – A bad thing for sync workloads – hence the ZFS Intent Log (ZIL) 30 second delay could impact failure detection time June 13, 2009 © 2009 Richard Elling 70
  71. 71. ZIL – ZFS Intent Log ● DMU is transactional, and likes to group I/O into transactions for later commits, but still needs to handle “write it now” desire of sync writers – NFS – Databases ● If I/O < 32 kBytes – write it (now) to ZIL (allocated from pool) – write it later as part of the txg commit ● If I/O > 32 kBytes, write it to pool now – Should be faster for large, sequential writes ● Never read, except at import (eg reboot), when transactions may need to be rolled forward June 13, 2009 © 2009 Richard Elling 71
  72. 72. Separate Logs (slogs) ● ZIL competes with pool for iops – Applications will wait for sync writes to be on nonvolatile media – Very noticeable on HDD JBODs ● Put ZIL on separate vdev, outside of pool – ZIL writes tend to be sequential – No competition with pool for iops – Downside: slog device required to be operational at import ● 10x or more performance improvements possible – Better if using write-optimized SSD or non-volatile write cache on RAID array ● Use zilstat to observe ZIL activity June 13, 2009 © 2009 Richard Elling 72
  73. 73. DSL – Dataset and Snapshot Layer June 13, 2009 © 2009 Richard Elling 73
  74. 74. flash Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free June 13, 2009 © 2009 Richard Elling 74
  75. 75. zfs snapshot ● Create a read-only, point-in-time window into the dataset (file system or Zvol) ● Computationally free, because of COW architecture ● Very handy feature – Patching/upgrades – Basis for Time Slider June 13, 2009 © 2009 Richard Elling 75
  76. 76. Snapshot Snapshot tree root Current tree root ● Create a snapshot by not free'ing COWed blocks ● Snapshot creation is fast and easy ● Number of snapshots determined by use – no hardwired limit ● Recursive snapshots also possible June 13, 2009 © 2009 Richard Elling 76
  77. 77. Clones ● Snapshots are read-only ● Clones are read-write based upon a snapshot ● Child depends on parent – Cannot destroy parent without destroying all children – Can promote children to be parents ● Good ideas – OS upgrades – Change control – Replication ● zones ● virtual disks June 13, 2009 © 2009 Richard Elling 77
  78. 78. zfs clone ● Create a read-write file system from a read-only snapshot ● Used extensively for OpenSolaris upgrades OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 snapshot snapshot snapshot OS rev1 upgrade OS rev2 clone boot manager Origin snapshot cannot be destroyed, if clone exists June 13, 2009 © 2009 Richard Elling 78
  79. 79. zfs promote OS b104 OS b104 clone OS rev1 rpool/ROOT/b104 rpool/ROOT/b104 OS b104 OS b105 OS rev2 snapshot snapshot snapshot rpool/ROOT/b104@today rpool/ROOT/b105@today OS b105 OS b105 clone promote OS rev2 rpool/ROOT/b105 rpool/ROOT/b105 June 13, 2009 © 2009 Richard Elling 79
  80. 80. zfs rollback OS b104 OS b104 rpool/ROOT/b104 rpool/ROOT/b104 OS b104 OS b104 snapshot rollback snapshot rpool/ROOT/b104@today rpool/ROOT/b104@today June 13, 2009 © 2009 Richard Elling 80
  81. 81. Commands June 13, 2009 © 2009 Richard Elling 81
  82. 82. zpool(1m) raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? June 13, 2009 © 2009 Richard Elling 82
  83. 83. Dataset & Snapshot Layer ● Object – Allocated storage – dnode describes collection Dataset Directory of blocks Dataset ● Object Set Object Set Childmap – Group of related objects ● Dataset Object Object Object Properties – Snapmap: snapshot relationships Snapmap – Space usage ● Dataset directory – Childmap: dataset relationships – Properties June 13, 2009 © 2009 Richard Elling 83
  84. 84. zpool create ● zpool create poolname vdev-configuration – vdev-configuration examples ● mirror c0t0d0 c3t6d0 ● mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6 ● mirror disk1s0 disk2s0 cache disk4s0 log disk5 ● raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0 ● Solaris – Additional checks to see if disk/slice overlaps or is currently in use – Whole disks are given EFI labels ● Can set initial pool or dataset properties ● By default, creates a file system with the same name – poolname pool → /poolname file system People get confused by a file system with same name as the pool June 13, 2009 © 2009 Richard Elling 84
  85. 85. zpool destroy ● Destroy the pool and all datasets therein ● zpool destroy poolname ● Can (try to) force with “-f” ● There is no “are you sure?” prompt – if you weren't sure, you would not have typed “destroy” zpool destroy is destructive... really! Use with caution! June 13, 2009 © 2009 Richard Elling 85
  86. 86. zpool add ● Adds a device to the pool as a top-level vdev ● zpool add poolname vdev-configuration ● vdev-configuration can be any combination also used for zpool create ● Complains if the added vdev-configuration would cause a different data protection scheme than is already in use – use “-f” to override ● Good idea: try with “-n” flag first – will show final configuration without actually performing the add Do not add a device which is in use as a quorum device June 13, 2009 © 2009 Richard Elling 86
  87. 87. zpool remove ● Remove a top-level vdev from the pool ● zpool remove poolname vdev ● Today, you can only remove the following vdevs: – cache – hot spare ● An RFE is open to allow removal of other top-level vdevs Don't confuse “remove” with “detach” June 13, 2009 © 2009 Richard Elling 87
  88. 88. zpool attach ● Attach a vdev as a mirror to an existing vdev ● zpool attach poolname existing-vdev vdev ● Attaching vdev must be the same size or larger than the existing vdev ● Note: today this is not available for RAIDZ or RAIDZ2 vdevs vdev Configurations ok simple vdev → mirror ok mirror ok log → mirrored log no RAIDZ no RAIDZ2 “Same size” literally means the same number of blocks. Beware that many “same size” disks have different number of available blocks. June 13, 2009 © 2009 Richard Elling 88
  89. 89. zpool detach ● Detach a vdev from a mirror ● zpool detach poolname vdev ● A resilvering vdev will wait until resilvering is complete June 13, 2009 © 2009 Richard Elling 89
  90. 90. zpool replace ● Replaces an existing vdev with a new vdev ● zpool replace poolname existing-vdev vdev ● Effectively, a shorthand for “zpool attach” followed by “zpool detach” ● Attaching vdev must be the same size or larger than the existing vdev ● Works for any top-level vdev-configuration, including RAIDZ and RAIDZ2 vdev Configurations ok simple vdev ok mirror ok log ok RAIDZ ok RAIDZ2 “Same size” literally means the same number of blocks. Beware that many “same size” disks have different number of available blocks. June 13, 2009 © 2009 Richard Elling 90
  91. 91. zpool import ● Import a pool and mount all mountable datasets ● Import a specific pool – zpool import poolname – zpool import GUID ● Scan LUNs for pools which may be imported – zpool import ● Can set options, such as alternate root directory or other properties Beware of zpool.cache interactions Beware of artifacts, especially partial artifacts June 13, 2009 © 2009 Richard Elling 91
  92. 92. zpool export ● Unmount datasets and export the pool ● zpool export poolname ● Removes pool entry from zpool.cache June 13, 2009 © 2009 Richard Elling 92
  93. 93. zpool upgrade ● Display current versions – zpool upgrade ● View available upgrade versions, with features, but don't actually upgrade – zpool upgrade -v ● Upgrade pool to latest version – zpool upgrade poolname ● Upgrade pool to specific version – zpool upgrade -V version poolname Once you upgrade, there is no downgrade June 13, 2009 © 2009 Richard Elling 93
  94. 94. zpool history ● Show history of changes made to the pool # zpool history rpool History for 'rpool': 2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -o cachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s0 2009-03-04.07:29:47 zfs set canmount=noauto rpool 2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool 2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT 2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap 2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump 2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_106 2009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_106 2009-03-04.07:29:51 zfs set canmount=on rpool 2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export 2009-03-04.07:29:51 zfs create rpool/export/home 2009-03-04.00:21:42 zpool import -f -R /a 17111649328928073943 2009-03-04.00:21:42 zpool export rpool 2009-03-04.08:47:08 zpool set bootfs=rpool rpool 2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b108 2009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/snv_b108 ... June 13, 2009 © 2009 Richard Elling 94
  95. 95. zpool status ● Shows the status of the current pools, including their configuration ● Important troubleshooting step # zpool status … pool: stuff state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM stuff ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0s0 ONLINE 0 0 0 c0t0d0s7 ONLINE 0 0 0 errors: No known data errors Understanding status output error messages can be tricky June 13, 2009 © 2009 Richard Elling 95
  96. 96. zpool clear ● Clears device errors ● Clears device error counters ● Improves sysadmin sanity and reduces sweating June 13, 2009 © 2009 Richard Elling 96
  97. 97. zpool iostat ● Show pool physical I/O activity, in an iostat-like manner ● Solaris: fsstat will show I/O activity looking into a ZFS file system ● Especially useful for showing slog activity # zpool iostat -v capacity operations bandwidth pool used avail read write read write ------------ ----- ----- ----- ----- ----- ----- rpool 16.5G 131G 0 0 1.16K 2.80K c0t0d0s0 16.5G 131G 0 0 1.16K 2.80K ------------ ----- ----- ----- ----- ----- ----- stuff 135G 14.4G 0 5 2.09K 27.3K mirror 135G 14.4G 0 5 2.09K 27.3K c0t2d0s0 - - 0 3 1.25K 27.5K c0t0d0s7 - - 0 2 1.27K 27.5K ------------ ----- ----- ----- ----- ----- ----- Unlike iostat, does not show latency June 13, 2009 © 2009 Richard Elling 97
  98. 98. zpool scrub ● Manually starts scrub – zpool scrub poolname ● Scrubbing performed in background ● Use zpool status to track scrub progress ● Stop scrub – zpool scrub -s poolname Estimated scrub completion time improves over time June 13, 2009 © 2009 Richard Elling 98
  99. 99. zfs(1m) ● Manages file systems (ZPL) and Zvols ● Can proxy to other, related commands – iSCSI, NFS, CIFS raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer June 13, 2009 © 2009 Richard Elling 99
  100. 100. zfs create, destroy ● By default, a file system with the same name as the pool is created by zpool create ● Name format is: pool/name[/name ...] ● File system – zfs create fs-name – zfs destroy fs-name ● Zvol – zfs create -V size vol-name – zfs destroy vol-name ● Parameters can be set at create time June 13, 2009 © 2009 Richard Elling 100
  101. 101. zfs mount, unmount ● Note: mount point is a file system parameter – zfs get mountpoint fs-name ● Rarely used subcommand (!) ● Display mounted file systems – zfs mount ● Mount a file system – zfs mount fs-name – zfs mount -a ● Unmount – zfs unmount fs-name – zfs unmount -a June 13, 2009 © 2009 Richard Elling 101
  102. 102. zfs list ● List mounted datasets ● Old versions: listed everything ● New versions: do not list snapshots ● Examples – zfs list – zfs list -t snapshot – zfs list -H -o name June 13, 2009 © 2009 Richard Elling 102
  103. 103. zfs send, receive ● Send – send a snapshot to stdout – data is decompressed ● Receive – receive a snapshot from stdin – receiving file system parameters apply (compression, et.al) ● Can incrementally send snapshots in time order ● Handy way to replicate dataset snapshots ● NOT a replacement for traditional backup solutions – All-or-nothing design per snapshot – In general, does not send files (!) – Today, no per-file management Send streams from b35 (or older) no longer supported after b89 June 13, 2009 © 2009 Richard Elling 103
  104. 104. zfs rename ● Renames a file system, volume,or snapshot – zfs rename export/home/relling export/home/richard June 13, 2009 © 2009 Richard Elling 104
  105. 105. zfs upgrade ● Display current versions – zfs upgrade ● View available upgrade versions, with features, but don't actually upgrade – zfs upgrade -v ● Upgrade pool to latest version – zfs upgrade dataset ● Upgrade pool to specific version – zfs upgrade -V version dataset Once you upgrade, there is no downgrade June 13, 2009 © 2009 Richard Elling 105
  106. 106. Sharing June 13, 2009 © 2009 Richard Elling 106
  107. 107. Sharing ● zfs share dataset ● Type of sharing set by parameters – shareiscsi = [on | off] – sharenfs = [on | off | options] – sharesmb = [on | off | options] ● Shortcut to manage sharing – Uses external services (nfsd, COMSTAR, etc) – Importing pool will also share ● May vary by OS June 13, 2009 © 2009 Richard Elling 107
  108. 108. NFS ● ZFS file systems work as expected – use ACLs based on NFSv4 ACLs ● Parallel NFS, aks pNFS, aka NFSv4.1 – Still a work-in-progress – http://opensolaris.org/os/project/nfsv41/ – zfs create -t pnfsdata mypnfsdata pNFS Client pNFS Data Server pNFS Data Server pnfsdata pnfsdata pNFS dataset dataset Metadata Server pool pool June 13, 2009 © 2009 Richard Elling 108
  109. 109. CIFS ● UID mapping ● casesensitivity parameter – Good idea, set when file system is created – zfs create -o casesensitivity=insensitive mypool/Shared ● Shadow Copies for Shared Folders (VSS) supported – CIFS clients cannot create shadow remotely (yet) CIFS features vary by OS, Samba, etc. June 13, 2009 © 2009 Richard Elling 109
  110. 110. iSCSI ● SCSI over IP ● Block-level protocol ● Uses Zvols as storage ● Solaris has 2 iSCSI target implementations – shareiscsi enables old, klunky iSCSI target – To use COMSTAR, enable using itadm(1m) – b116 adds COMSTAR support (zpool version 16) June 13, 2009 © 2009 Richard Elling 110
  111. 111. Properties June 13, 2009 © 2009 Richard Elling 111
  112. 112. Properties ● Properties are stored in an nvlist ● By default, are inherited ● Some properties are common to all datasets, but a specific dataset type may have additional properties ● Easily set or retrieved via scripts ● In general, properties affect future file system activity zpool get doesn't script as nicely as zfs get June 13, 2009 © 2009 Richard Elling 112
  113. 113. User-defined Properties ● Names – Must include colon ':' – Can contain lower case alphanumerics or “+” “.” “_” – Max length = 256 characters – By convention, module:property ● com.sun:auto-snapshot ● Values – Max length = 1024 characters ● Examples – com.sun:auto-snapshot=true – com.richardelling:important_files=true June 13, 2009 © 2009 Richard Elling 113
  114. 114. set & get properties ● Set – zfs set compression=on export/home/relling ● Get – zfs get compression export/home/relling ● Reset to inherited value – zfs inherit compression export/home/relling ● Clear user-defined parameter – zfs inherit com.sun:auto-snapshot export/home/relling June 13, 2009 © 2009 Richard Elling 114
  115. 115. Pool Properties Property Change? Brief Description altroot Alternate root directory (ala chroot) autoreplace vdev replacement policy available readonly Available storage space bootfs Default bootable dataset for root pool cachefile Cache file to use other than /etc/zfs/zpool.cache capacity readonly Percent of pool space used delegation Master pool delegation switch failmode Catastrophic pool failure policy guid readonly Unique identifier health readonly Current health of the pool listsnapshots zfs list policy size readonly Total size of pool used readonly Amount of space used version readonly Current on-disk version June 13, 2009 © 2009 Richard Elling 115
  116. 116. Common Dataset Properties Property Change? Brief Description available readonly Space available to dataset & children checksum Checksum algorithm compression Compression algorithm compressratio readonly Compression ratio – logical size:referenced physical copies Number of copies of user data creation readonly Dataset creation time origin readonly For clones, origin snapshot primarycache ARC caching policy readonly Is dataset in readonly mode? referenced readonly Size of data accessible by this dataset refreservation Max space guaranteed to a dataset, not including descendants (snapshots & clones) reservation Minimum space guaranteed to dataset, including descendants June 13, 2009 © 2009 Richard Elling 116
  117. 117. Common Dataset Properties Property Change? Brief Description secondarycache L2ARC caching policy type readonly Type of dataset (filesystem, snapshot, volume) used readonly Sum of usedby* (see below) usedbychildren readonly Space used by descendants usedbydataset readonly Space used by dataset usedbyrefreservation readonly Space used by a refreservation for this dataset usedbysnapshots readonly Space used by all snapshots of this dataset zoned readonly Is dataset added to non-global zone (Solaris) June 13, 2009 © 2009 Richard Elling 117
  118. 118. File System Dataset Properties Property Change? Brief Description aclinherit ACL inheritance policy, when files or directories are created aclmode ACL modification policy, when chmod is used atime Disable access time metadata updates canmount Mount policy casesensitivity creation Filename matching algorithm devices Device opening policy for dataset exec File execution policy for dataset mounted readonly Is file system currently mounted? nbmand export/ File system should be mounted with non-blocking import mandatory locks (CIFS client feature) normalization creation Unicode normalization of file names for matching June 13, 2009 © 2009 Richard Elling 118
  119. 119. File System Dataset Properties Property Change? Brief Description quota Max space dataset and descendants can consume recordsize Suggested maximum block size for files refquota Max space dataset can consume, not including descendants setuid setuid mode policy sharenfs NFS sharing options sharesmb CIFS sharing options snapdir Controls whether .zfs directory is hidden utf8only UTF-8 character file name policy vscan Virus scan enabled xattr Extended attributes policy June 13, 2009 © 2009 Richard Elling 119
  120. 120. More Goodies... June 13, 2009 © 2009 Richard Elling 120
  121. 121. Dataset Space Accounting ● used = usedbydataset + usedbychildren + usedbysnapshots + usedbyrefreservation ● Lazy updates, may not be correct until txg commits ● ls and du will show size of allocated files which includes all copies of a file ● Shorthand report available $ zfs list -o space NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD rpool 126G 18.3G 0 35.5K 0 18.3G rpool/ROOT 126G 15.3G 0 18K 0 15.3G rpool/ROOT/snv_106 126G 86.1M 0 86.1M 0 0 rpool/ROOT/snv_b108 126G 15.2G 5.89G 9.28G 0 0 rpool/dump 126G 1.00G 0 1.00G 0 0 rpool/export 126G 37K 0 19K 0 18K rpool/export/home 126G 18K 0 18K 0 0 rpool/swap 128G 2G 0 193M 1.81G 0 June 13, 2009 © 2009 Richard Elling 121
  122. 122. zfs vs zpool Space Accounting ● zfs list != zpool list ● zfs list shows space used by the dataset plus space for internal accounting ● zpool list shows physical space available to the pool ● For simple pools and mirrors, they are nearly the same ● For RAIDZ or RAIDZ2, zpool list will show space available for parity Users will be confused about reported space available June 13, 2009 © 2009 Richard Elling 122
  123. 123. Testing ● ztest ● fstest June 13, 2009 © 2009 Richard Elling 123
  124. 124. Accessing Snapshots ● By default, snapshots are accessible in .zfs directory ● Visibility of .zfs directory is tunable via snapdir property – Don't really want find to find the .zfs directory ● Windows CIFS clients can see snapshots as Shadow Copies for Shared Folders (VSS) # zfs snapshot rpool/export/home/relling@20090415 # ls -a /export/home/relling … .Xsession .xsession-errors # ls /export/home/relling/.zfs shares snapshot # ls /export/home/relling/.zfs/snapshot 20090415 # ls /export/home/relling/.zfs/snapshot/20090415 Desktop Documents Downloads Public June 13, 2009 © 2009 Richard Elling 124
  125. 125. Resilver & Scrub ● Can be read iops bound ● Resilver can also be bandwidth bound to the resilvering device ● Both work at lower I/O scheduling priority than normal work, but that may not matter for read iops bound devices June 13, 2009 © 2009 Richard Elling 125
  126. 126. Time-based Resilvering ● Block pointers contain birth txg number ● Resilvering begins with oldest blocks first 73 73 ● Interrupted resilver will still result in a valid file system view 73 55 73 27 68 73 27 27 73 68 Birth txg = 27 Birth txg = 68 Birth txg = 73 June 13, 2009 © 2009 Richard Elling 126
  127. 127. Time Slider – Automatic Snapshots ● Underpinnings for Solaris feature similar to OSX's Time Machine ● SMF service for managing snapshots ● SMF properties used to specify policies – Frequency – Number to keep ● Creates cron jobs ● GUI tool makes it easy to select individual file systems Service Name Interval (default) Keep (default) auto-snapshot:frequent 15 minutes 4 auto-snapshot:hourly 1 hour 24 auto-snapshot:daily 1 day 31 auto-snapshot:weekly 7 days 4 auto-snapshot:monthly 1 month 12 June 13, 2009 © 2009 Richard Elling 127
  128. 128. Nautilus ● File system views which can go back in time June 13, 2009 © 2009 Richard Elling 128
  129. 129. ACL – Access Control List ● Based on NFSv4 ACLs ● Similar to Windows NT ACLs ● Works well with CIFS services ● Supports ACL inheritance ● Change using chmod ● View using ls June 13, 2009 © 2009 Richard Elling 129
  130. 130. Checksums ● DVA contains 256 bits for checksum ● Checksum is in the parent, not in the block itself ● Types – none – fletcher2: truncated Fletcher algorithm – fletcher4: full Fletcher algorithm – SHA-256 ● There are open proposals for better algorithms June 13, 2009 © 2009 Richard Elling 130
  131. 131. Checksum Use Use Algorithm Notes Uberblock SHA-256 self-checksummed Metadata fletcher4 Labels SHA-256 Data fletcher2 (default) zfs compression parameter ZIL log fletcher2 self-checksummed Gang block SHA-256 self-checksummed June 13, 2009 © 2009 Richard Elling 131
  132. 132. Checksum Performance ● Metadata – you won't notice ● Data – LZJB is barely noticeable – gzip-9 can be very noticeable ● Geriatric hardware ??? June 13, 2009 © 2009 Richard Elling 132
  133. 133. Compression ● Builtin – lzjb, Lempel-Ziv by Jeff Bonwick – gzip, levels 1-9 ● Extensible – new compressors can be added – backwards compatibility issues ● Uses taskqs to take advantage of multi-processor systems Cannot boot from gzip compressed root (RFE is open) June 13, 2009 © 2009 Richard Elling 133
  134. 134. Encryption ● Placeholder – details TBD ● http://opensolaris.org/os/project/zfs-crypto ● Complicated by: – Block pointer rewrites – Deduplication June 13, 2009 © 2009 Richard Elling 134
  135. 135. Impedance Matching ● RAID arrays & columns ● Label offsets – Older Solaris starting block = 34 – Newer Solaris starting block = 256 June 13, 2009 © 2009 Richard Elling 135
  136. 136. Quotas ● File system quotas – quota includes descendants (snapshots, clones) – refquota does not include descendants ● User and group quotas (b114) – Works like refquota, descendants don't count – Not inherited – zfs userspace and groupspace subcommands show quotas ● Users can only see their own and group quota, but can delegate – Managed via properties ● [user|group]quota@[UID|username|SID name|SID number] ● not visible via zfs get all June 13, 2009 © 2009 Richard Elling 136
  137. 137. zpool.cache ● Old way – mount / – read /etc/[v]fstab – mount file systems ● ZFS – import pool(s) – find mountable datasets and mount them ● /etc/zpool.cache is a cache of pools to be imported at boot time – No scanning of all available LUNs for pools to import – cachefile property permits selecting an alternate zpool.cache ● Useful for OS installers ● Useful for clusters, where you don't want a booting node to automatically import a pool ● Not persistent (!) June 13, 2009 © 2009 Richard Elling 137
  138. 138. Mounting ZFS File Systems ● By default, mountable file systems are mounted when the pool is imported – Controlled by canmount policy (not inherited) ● on – (default) file system is mountable ● off – file system is not mountable – if you want children to be mountable, but not the parent ● noauto – file system must be explicitly mounted (boot environment) ● Can zfs set mountpoint=legacy to use /etc/[v]fstab ● By default, cannot mount on top of non-empty directory – Can override explicitly using zfs mount -O or legacy mountpoint ● Mount properties are persistent, use zfs mount -o for temporary changes Imports are done in parallel, beware of mountpoint races June 13, 2009 © 2009 Richard Elling 138
  139. 139. recordsize ● Dynamic – Max 128 kBytes – Min 512 Bytes – Power of 2 ● For most workloads, don't worry about it ● For fixed size workloads, can set to match workloads – Databases ● File systems or Zvols ● zfs set recordsize=8k dataset June 13, 2009 © 2009 Richard Elling 139
  140. 140. Delegated Administration ● Fine grain control – users or groups of users – subcommands, parameters, or sets ● Similar to Solaris' Role Based Access Control (RBAC) ● Enable/disable at the pool level – zpool set delegation=on mypool (default) ● Allow/unallow at the dataset level – zfs allow relling snapshot mypool/relling – zfs allow @backupusers snapshot,send mypool/relling – zfs allow mypool/relling June 13, 2009 © 2009 Richard Elling 140
  141. 141. Delegation Inheritance Beware of inheritance ● Local – zfs allow -l relling snapshot mypool ● Local + descendants – zfs allow -d relling mount mypool Make sure permissions are set at the correct level June 13, 2009 © 2009 Richard Elling 141
  142. 142. Delegatable Subcommands ● allow ● receive ● clone ● rename ● create ● rollback ● destroy ● send ● groupquota ● share ● groupused ● snapshot ● mount ● userquota ● promote ● userused June 13, 2009 © 2009 Richard Elling 142
  143. 143. Delegatable Parameters ● aclinherit ● nbmand ● sharenfs ● aclmode ● normalization ● sharesmb ● atime ● quota ● snapdir ● canmount ● readonly ● userprop ● casesensitivity ● recordsize ● utf8only ● checksum ● refquota ● version ● compression ● refreservation ● volsize ● copies ● reservation ● vscan ● devices ● setuid ● xattr ● exec ● shareiscsi ● zoned ● mountpoint June 13, 2009 © 2009 Richard Elling 143
  144. 144. Browser User Interface ● Solaris – WebConsole ● Nexenta - ● OSX - ● OpenStorage - June 13, 2009 © 2009 Richard Elling 144
  145. 145. Solaris WebConsole June 13, 2009 © 2009 Richard Elling 145
  146. 146. Solaris WebConsole June 13, 2009 © 2009 Richard Elling 146
  147. 147. Solaris Swap and Dump ● Swap – Solaris does not have automatic swap resizing – Swap as a separate dataset – Swap device is raw, with a refreservation – Blocksize matched to pagesize – Don't really need or want snapshots or clones – Can resize while online, manually ● Dump – Only used during crash dump – Preallocated – No refreservation – Checksum off – Compression off (dumps are already compressed) June 13, 2009 © 2009 Richard Elling 147
  148. 148. Performance June 13, 2009 © 2009 Richard Elling 148
  149. 149. General Comments ● In general, performs well out of the box ● Standard performance improvement techniques apply ● Lots of DTrace knowledge available ● Typical areas of concern: – ZIL ● check with zilstat, improve with slogs – COW “fragmentation” ● check iostat, improve with L2ARC – Memory consumption ● check with arcstat ● set primarycache property ● can be capped ● can compete with large page aware apps – Compression, or lack thereof June 13, 2009 © 2009 Richard Elling 149
  150. 150. ZIL Performance ● Big performance increases demonstrated, especially with SSDs ● NFS servers – 32kByte threshold (zfs_immediate_write_sz) also corresponds to NFSv3 write size ● May cause more work than needed ● See CR6686887 ● Databases – May want different sync policies for logs and data – Current ZIL is pool-wide and enabled for all sync writes – CR6832481 proposes a separate intent log bypass property on a per-dataset basis June 13, 2009 © 2009 Richard Elling 150
  151. 151. vdev Cache ● vdev cache occurs at the SPA level – readahead – 10 MBytes per vdev – only caches metadata (as of b70) ● Stats collected as Solaris kstats # kstat -n vdev_cache_stats module: zfs instance: 0 name: vdev_cache_stats class: misc crtime 38.83342625 delegations 14030 hits 105169 misses 59452 snaptime 4564628.18130739 Hit rate = 59%, not bad... June 13, 2009 © 2009 Richard Elling 151
  152. 152. Intelligent Prefetching ● Intelligent file-level prefetching occurs at the DMU level ● Feeds the ARC ● In a nutshell, prefetch hits cause more prefetching – Read a block, prefetch a block – If we used the prefetched block, read 2 more blocks – Up to 256 blocks ● Recognizes strided reads – 2 sequential reads of same length and a fixed distance will be coalesced ● Fetches backwards ● Seems to work pretty well, as-is, for most workloads ● Easy to disable in mdb for testing on Solaris – echo zfs_prefetch_disable/W0t1 | mdb -kw June 13, 2009 © 2009 Richard Elling 152
  153. 153. I/O Queues ● By default, for devices which can support it, 35 iops are queued to each vdev – Tunable with zfs_vdev_max_pending – echo zfs_vdev_max_pending/W0t10 | mdb -kw ● Implies that more vdevs is better – Consider avoiding RAID array with a single, large LUN ● ZFS I/O scheduler loses control once iops are queued – CR6471212 proposes reserved slots for high-priority iops ● May need to match queues for the entire data path – zfs_vdev_max_pending – Fibre channel, SCSI, SAS, SATA driver – RAID array controller ● Fast disks → small queues, slow disks → larger queues June 13, 2009 © 2009 Richard Elling 153
  154. 154. COW Penalty ● COW can negatively affect workloads which have updates and sequential reads – Initial writes will be sequential – Updates (writes) will cause seeks to read data ● Lots of people seem to worry a lot about this ● Only affects HDDs ● Very difficult to speculate about the impact on real-world apps – Large sequential scans of random data hurt anyway – Reads are cached in many places in the data path ● Sysbench benchmark used to test on MySQL w/InnoDB engine – One hour read/write test – select count(*) – repeat, for a week June 13, 2009 © 2009 Richard Elling 154
  155. 155. COW Penalty Performance seems to level at about 25% penalty Results compliments of Allan Packer & Neelakanth Nadgir http://blogs.sun.com/realneel/resource/MySQL_Conference_2009_ZFS_MySQL.pdf June 13, 2009 © 2009 Richard Elling 155
  156. 156. About Disks... ● Disks still the most important performance bottleneck – Modern processors are multi-core – Default checksums and compression are computationally efficient Average Max Size Rotational Disk Size RPM (GBytes) Latency (ms) Average Seek (ms) HDD 2.5” 5,400 500 5.5 11 HDD 3.5” 5,900 2,000 5.1 16 HDD 3.5” 7,200 1,500 4.2 8 - 8.5 HDD 2.5” 10,000 300 3 4.2 - 4.6 HDD 2.5” 15,000 146 2 3.2 - 3.5 SSD (w) 2.5” N/A 73 0 0.02 - 0.15 SSD (r) 2.5” N/A 500 0 0.02 - 0.15 June 13, 2009 © 2009 Richard Elling 156
  157. 157. DirectIO ● UFS forcedirectio option brought the early 1980s design of UFS up to the 1990s ● ZFS designed to run on modern multiprocessors ● Databases or applications which manage their data cache may benefit by disabling file system caching ● Expect L2ARC to improve random reads UFS DirectIO ZFS Unbuffered I/O primarycache=metadata primarycache=none Concurrency Available at inception Improved Async I/O code path Available at inception June 13, 2009 © 2009 Richard Elling 157
  158. 158. Hybrid Storage Pool SPA separate log L2ARC Main Pool device cache device Write optimized HDD HDD Read optimized device (SSD) HDD device (SSD) Size (GBytes) < 1 GByte large big Cost write iops/$ size/$ size/$ Performance low-latency writes - low-latency reads June 13, 2009 © 2009 Richard Elling 158
  159. 159. Future Plans ● Announced enhancements OpenSolaris Town Hall 2009.06 – de-duplication (see also GreenBytes ZFS+) – user quotas (delivered b114) – access-based enumeration – snapshot reference counting – dynamic LUN expansion (delivering b117?) ● Others – mirror to smaller disk (delivered b117) June 13, 2009 © 2009 Richard Elling 159
  160. 160. Its a wrap! Thank You! Questions? Richard.Elling@RichardElling.com June 13, 2009 © 2009 Richard Elling 160

×