• Like
  • Save
S8 File Systems Tutorial USENIX LISA13
Upcoming SlideShare
Loading in...5
×
 

S8 File Systems Tutorial USENIX LISA13

on

  • 1,371 views

Slides from the S8 File Systems Tutorial at USENIX LISA'13 conference in Washington, DC. The topic covers ext4, btrfs, and ZFS with an emphasis on Linux implementations.

Slides from the S8 File Systems Tutorial at USENIX LISA'13 conference in Washington, DC. The topic covers ext4, btrfs, and ZFS with an emphasis on Linux implementations.

Statistics

Views

Total Views
1,371
Views on SlideShare
1,314
Embed Views
57

Actions

Likes
2
Downloads
42
Comments
1

1 Embed 57

https://twitter.com 57

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • This is great, thanks for sharing Richard.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    S8 File Systems Tutorial USENIX LISA13 S8 File Systems Tutorial USENIX LISA13 Presentation Transcript

    • File Systems Top to Bottom and Back Richard.Elling@RichardElling.com LISA’13 Washington, DC November 3, 2013
    • Agenda • • • • • • • Introduction Installation Creation and Destruction Backup and Restore Migration Settings and Options Performance and Tuning November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 2
    • Introduction
    • ext4 File Systems ZFS btrfs File system discussed on slide • Today’s discussions: emphasis on Linux • ext4, with a few comments on ext3 • btrfs • ZFS • Not in scope (maybe next year?) • ReFS • HSF+ November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 4
    • ext4 ext4 Highlights • ext3 was limited • 16TB filesystem size (32-bit block numbers) • 32k limit on subdirectories • Performance limitations • ext4 is natural successor • • • • • Easy migration from ext3 Replace indirect blocks with extents > 16TB filesystem size Preallocation Journal checksums • Now default on many Linux distros November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 5
    • ZFS Highlights ZFS • Figure out why storage has become so • • • • • • • complicated Blow away 20+ years of obsolete assumptions Sun had to replace UFS Opportunity to design integrated system from scratch Widely ported: Linux, FreeBSD, OSX Builtin RAID Checksums Large scale (256 ZB) November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 6
    • btrfs • • • • • • • btrfs New copy-on-write file system Pooled storage model Snapshots Checksums Large scale (16 EB) Builtin RAID Clever in-place migration from ext3 November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 7
    • ReFS Pooled Storage Model ZFS btrfs • Old school • 1 disk means • 1 file system • 1 directory structure (directory tree) • File systems didn’t change when virtual disks (eg RAID) arrived • • ok, so we could partition them... ugly solution New school • Combine storage devices into a pool • Allow many file systems per pool November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 8
    • ZFS btrfs Sysadmin’s View of Pools Pool File System Configuration Information File System Dataset November 3, 2013 Volume File System File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 9
    • ext4 Blocks and Extents ZFS btrfs • Early file systems were block-based • ext3, UFS, FAT • Data blocks are fixed sizes • Difficult to scale due to indirection levels and allocation algorithms • Extents solve many indirection issues • Extent is a contiguous area of storage • • reserved for a file Data blocks are variable sizes ext4, btrfs, ZFS, XFS, NTFS, VxFS November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 10
    • ext4 Blocks and Extents ZFS btrfs Data Direct Direct Block-based Data Direct Data Direct ┊ Data Metadata is list of (direct) pointers to fixed-size blocks Data Extent-based Extent Extent Data Extent ┊ Data Metadata is list of extent structures (offset + length) to mixed-size blocks November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 11
    • ext4 Scalability ZFS btrfs Problem: what happens when we need more metadata? • Block-based: go with indirect blocks • Really just pointers to pointers • Gets ugly at triple-indirection • Function of data size and block size • Extent-based: grow trees • B-trees are popular • • ext4, for more than 3 levels • btrfs ZFS uses a Merkle tree November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 12
    • ext3 UFS Indirect Blocks Data Direct Direct Data Direct Data Indirect Indirect Direct ┊ Double Indirect Data Data Data Direct ┊ ┊ Metadata Direct Indirect ┊ ┊ Problem 1: big files use lots of indirection Problem 2: metadata size fixed at creation November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 13
    • ext4 Treed Metadata ZFS btrfs Root Data Data Data Data • Trees can be large, yet efficiently searched and modified • Enables copy-on-write (COW) • Lots of good computer science here! November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 14
    • ZFS btrfs Trees Allow Copy-on-Write 1. Initial block tree 3. COW metadata November 3, 2013 2. COW some data 4. Update Uberblocks & free File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 15
    • ext4 fsck ZFS btrfs Problem: how do we know the metadata is correct? • Keep redundant copies • But what if the copies don’t agree? 1. File system check reconciles metadata inconsistencies • fsck (ext[234], btrfs, UFS), chkdsk (FAT), etc • Repairs problems that are known to occur (!) • Does not repair data (!) 2. Build a transactional system with atomic updates • • November 3, 2013 Databases (MySQL, Oracle, etc) ZFS File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 16
    • Installation
    • ext4 Ubuntu 12.04.3 ZFS btrfs • ext4 = default root file system • btrfs version v0.19 installed by default • ZFS 1. Install python-software-properties apt-get install python-software-properties  2. Add ZFSonLinux repo apt-add-repository --yes ppa:zfs-native/stable apt-get update 3. Install ZFS package apt-get install debootstrap ubuntu-zfs 4. Verify modprobe -l zfs dmesg | grep ZFS: November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 18
    • ext4 Fedora Core F19 ZFS btrfs • ext4 = default root file system • btrfs version v0.20-rc1 installed by default • ZFS 1. Update to latest package versions 2. Add ZFSonLinux repo Beware of word wrap yum localinstall --nogpgcheck http://archive.zfsonlinux.org/ fedora/zfs-release-1-2$(rpm -E %dist).noarch.rpm 3. Install ZFS package yum install zfs  4. Verify modprobe -l zfs dmesg | grep ZFS: November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 19
    • AΩ Creation and Destruction
    • But first... a brief discussion of RAID
    • RAID Basics • Disks fail. Sometimes they lose data. • • • • • • Sometimes they completely die. Get over it. RAID = Redundant Array of Inexpensive Disks RAID = Redundant Array of Independent Disks Key word: Redundant Redundancy is good. More redundancy is better. Everything else fails, too. You’re over it by now, right? November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 22
    • RAID-0 or Striping ZFS btrfs • RAID-0 • SNIA definition: fixed-length sequences of virtual • • disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern Good for space and performance Bad for dependability • ZFS Dynamic Stripe • • • • Data is dynamically mapped to member disks No fixed-length sequences Allocate up to ~1 MByte/vdev before changing vdev Good combination of the concatenation feature with RAID-0 performance November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 23
    • ZFS btrfs RAID-0 Example RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes 384 kBytes ZFS Dynamic Stripe recordsize = 128 kBytes Total write size = 2816 kBytes November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 24
    • RAID-1 or Mirroring ZFS btrfs • Straightforward: put N copies of the data on N disks • Good for read performance and • dependability Bad for space • Arbitration: btrfs and ZFS do not blindly trust either side of mirror • Most recent, correct view of data wins • Checksums validate data November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 25
    • Traditional Mirrors File system does bad read Can not tell November 3, 2013 If it’s a metadata block FS panics does disk rebuild Or we get back bad data File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 26
    • ZFS btrfs Checksums for Mirrors • What if a disk is (mostly) ok, but the data became corrupted? • btrfs and ZFS improve dependability using checksums for data and store checksums in metadata November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 27
    • RAID-5 and RAIDZ ZFS btrfs • N+1 redundancy • Good for space and dependability • Bad for performance • RAID-5 (btrfs) • Parity check data is distributed across the RAID array's • disks Must read/modify/write when data is smaller than stripe width • RAIDZ (ZFS) • • • • Dynamic data placement Parity added as needed Writes are full-stripe writes No read/modify/write (write hole) November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 28
    • ZFS btrfs RAID-5 and RAIDZ RAID-5 DiskA D0:0 P1 D2:3 D3:2 DiskB D0:1 D1:0 P2 D3:3 DiskC D0:2 D1:1 D2:0 P3 DiskD D0:3 D1:2 D2:1 D3:0 DiskE P0 D1:3 D2:2 D3:1 RAIDZ DiskA P0 P1 D2:1 D2:4 DiskB D0:0 D1:0 D2:2 D2:5 DiskC D0:1 D1:1 D2:3 P3 DiskD D0:2 P2:0 Gap D3:0 DiskE D0:3 D2:0 P2:1 November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 29
    • RAID-6, RAIDZ2, RAIDZ3 ZFS btrfs • Adding more parity • Parity 1: XOR • Parity 2: another Reed-Solomon syndrome • Parity 3: yet another Reed-Solomon syndrome • Double parity: N+2 • RAID-6 (btrfs) • RAIDZ2 (ZFS) • Triple parity: N+3 • RAIDZ3 (ZFS) November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 30
    • ZFS btrfs Dependability vs Space Dependability model metric MTTDL = Mean time to data loss (bigger is better) For this analysis, RAIDZ1/2 and RAID-5/6 are equivalent November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 31
    • We now return you to your regularly scheduled program: AΩ
    • Create a Simple Pool ZFS btrfs 1. Determine the name of an unused disk • • • • /dev/sd* or /dev/hd* /dev/disk/by-id /dev/disk/by-path /dev/disk/by-vdev (ZFS) 2. Create a simple pool • • btrfs mkfs.btrfs -m single /dev/sdb ZFS zpool create zwimming /dev/sdd Note: might need “-f” flag to create EFI label 3. Woohoo! November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 33
    • ZFS btrfs Verify Pool Status • btrfs btrfs filesystem show • ZFS zpool status November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 34
    • Destroy Pool ZFS btrfs • btrfs • Unmount all btrfs file systems • ZFS zpool destroy zwimming • Unmounts file systems and volumes • Exports pool • Marks pool as destroyed • Walk away... • Until overwritten, data is still ok and can be imported again • To see destroyed ZFS pools zpool import -D November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 35
    • Create Mirrored Pool ZFS btrfs 1. Determine the name of two unused disks 2. Create a mirrored pool • btrfs mkfs.btrfs -d raid1 /dev/sdb /dev/sdc • • -d specifies redundancy for data, metadata is redundant by default ZFS zpool create zwimming mirror /dev/sdd /dev/sde 3. Woohoo! 4. Verify November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 36
    • Creating Filesystems
    • ext4 Create & Mount File System ZFS btrfs • Make some mount points for this example • • • mkdir /mnt.ext4 mkdir /mnt.btrfs ext4 mkfs.ext4 /dev/sdf mount /dev/sdf /mnt.ext4 btrfs mount /dev/sdb /mnt.btrfs ZFS • zpool create already made a file system and mounted it at /zwimming • Verify... November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 38
    • But first... a brief introduction to accounting principles
    • ext4 Verify Mounted File Systems ZFS btrfs • df is handy tool to verify mounted file systems root@ubuntu:~# df -h Filesystem Size ... /dev/sdf 976M zwimming 976M /dev/sdb 1.0G Used Avail Use% Mounted on 1.3M 0 56K 924M 976M 894M 1% /mnt.ext4 0% /zwimming 1% /mnt.btrfs • WAT? • Pool space accounting isn’t like traditional filesystem space accounting • NB: the raw disk has 1,073,741,824 bytes November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 40
    • ext4 Again! ZFS btrfs • Try again with our mirrored pool examples root@ubuntu:~# df -h Filesystem Size ... /dev/sdf 976M zwimming 976M /dev/sdc 2.0G Used Avail Use% Mounted on 1.3M 0 56K 924M 976M 1.8G 1% /mnt.ext4 0% /zwimming 1% /mnt.btrfs • WAT, WAT, WAT? • The accounting is correct, your • understanding of the accounting might need a little bit of help Adding RAID-5, compression, copies, and deduplication makes accounting very confusing November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 41
    • Accounting Sanity ZFS btrfs • A full explanation of the accounting for pools is an opportunity for aspiring writers! • A more pragmatic view: • The accounting is correct • You can tell how much space is unallocated (free), but you can’t tell how much data you can put into it, until you do so November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 42
    • btrfs subvolumes and ZFS filesystems
    • ZFS btrfs One Pool Many File Systems Pool File System Configuration Information File System Dataset Volume File System • Good idea: create new file systems when you want a new policy • readonly, quota, snapshots/clones, etc • Act like directories, but slightly heavier November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 44
    • ZFS btrfs Create New File Systems • Context: new file system in existing pool • btrfs • • btrfs subvolume /mnt.btrfs/sv1 ZFS zfs create zwimming/fs1 Verify root@ubuntu:~# df -h Filesystem Size Used Avail Use% Mounted on ... /dev/sdf 976M 1.3M 924M 1% /mnt.ext4 zwimming 976M 128K 976M 1% /zwimming /dev/sdb 1.0G 64K 894M 1% /mnt.btrfs zwimming/fs1 976M 128K 976M 1% /zwimming/fs1 root@ubuntu:~# ls -l /mnt.btrfs total 0 drwxr-xr-x 1 root root 0 Nov 2 20:30 sv1 root@ubuntu:~# ls -l /zwimming total 2 drwxr-xr-x 2 root root 2 Nov 2 20:29 fs1 root@ubuntu:~# btrfs subvolume list /mnt.btrfs ID 256 top level 5 path sv1 November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 45
    • ZFS btrfs Nesting • It is tempting to create deep, nested multiple file system structures • But it increases management complexity • Good idea: use shallow file system hierarchy November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 46
    • Backup and Restore
    • ext4 ZFS btrfs Traditional Tools • For file systems, the traditional tools work as you expect • cp, scp, tar, rsync, zip, ... • For ZFS volumes, dd • But those are boring, let’s talk about snapshots and replication November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 48
    • Snapshots ZFS btrfs Snapshot tree root Current tree root • Create a snapshot by not free'ing COWed blocks • Snapshot creation is fast and easy • Number of snapshots determined by use – no hardwired limit • Recursive snapshots also possible in ZFS • Terminology: btrfs “writable snapshot” is like ZFS “clone” November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 49
    • Create Read-only Snapshot ZFS btrfs • btrfs • btrfs version v0.20-rc1 or later • Read-only needed for btrfs send btrfs subvolume snapshot -r /mnt.btrfs/sv1 /mnt.btrfs/sv1_ro • ZFS zfs snapshot zwimming@snapme November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 50
    • ZFS btrfs Create Writable Snapshot • btrfs • btrfs subvolume snapshot /mnt.btrfs/sv1 ZFS zfs snapshot zwimming@snapme zfs clone zwimming@snapme zwimming/cloneme root@ubuntu:~# btrfs subvolume snapshot /mnt.btrfs/sv1 /mnt.btrfs/sv1_snap Create a snapshot of '/mnt.btrfs/sv1' in '/mnt.btrfs/sv1_snap' root@ubuntu:~# btrfs subvolume list /mnt.btrfs ID 256 top level 5 path sv1 ID 257 top level 5 path sv1_snap root@ubuntu:~# zfs snapshot zwimming@snapme root@ubuntu:~# zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT zwimming@snapme 0 31K root@ubuntu:~# ls -l /zwimming/.zfs/snapshot total 0 dr-xr-xr-x 1 root root 0 Nov 2 21:02 snapme root@ubuntu:~# zfs clone zwimming@snapme zwimming/cloneme root@ubuntu:~# df -h Filesystem Size Used Avail Use% Mounted on ... zwimming 976M 0 976M 0% /zwimming zwimming/cloneme 976M 0 976M 0% /zwimming/cloneme November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 51
    • btrfs btfs Send and Receive • New feature in v0.20-rc1 • Operates on read-only snapshots btrfs subvolume snapshot -r /mnt.btrfs/sv1 /mnt.btrfs/sv1_ro • Note: send data must be on disk, either wait or use sync command • Send the to stdout, receive from stdin root# btrfs subvolume snapshot -r /mnt.btrfs/sv1 /mnt.btrfs/sv1_ro root# sync root# btrfs subvolume create /mnt.btrfs/backup root# btrfs send /mnt.btrfs/sv1_ro | btrfs receive /mnt.btrfs/backup At subvol /mnt.btrfs/sv1_ro At subvol sv1_ro root# btrfs subvolume list /mnt.btrfs ID 256 gen 8 top level 5 path sv1 ID 257 gen 8 top level 5 path sv1_ro ID 258 gen 13 top level 5 path backup ID 259 gen 14 top level 5 path backup/svr_ro November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 52
    • ZFS Send and Receive ZFS • Works the same on file systems as volumes • (datasets) Send a snapshot as a stream to stdout • Whole: single snapshot • Incremental: difference between two snapshots • Receive a snapshot into a dataset • Whole: create a new dataset • Incremental: add to existing, common snapshot • Each snapshot has a GUID and creation time property • Good idea: avoid putting time in snapshot name, use the properties for automation • Example zfs send zwimming@snap | zfs receive zbackup/zwimming November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 53
    • Migration
    • ext4 Forward Migration ZFS btrfs • • • • But first... backup your data! And second... test your backup ext3 ➯ ext4 ext3 or ext4 ➯ btrfs • Cleverly treats existing ext3 or ext4 data as readonly snapshot • btrfs seed devices • Read-only file system as basis of new file system • All writes are COWed into new file system • ZFS is fundamentally different • Use traditional copies: cp, tar, rsync, etc November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 55
    • ext4 btrfs Reverting Migration • Once you start to use ext4 features or add data to btrfs, the old ext3 filesystems doesn’t see the new data • Seems to be unallocated space • Reverting loses the changes after migration • But first... backup your data! • And second... test your backup November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 56
    • Settings and Options
    • ext4 ext4 Options • Extends function set available to ext2 and ext3 • Creation options • uninit_bg creates file system without initializing all of the block groups • • speeds filesystem creation • can speed fsck Mount options of note • barriers enabled by default • max_batch_time for coalescing synchronous writes • Adjusts dynamically by observing commit time • Use with caution, know your workload • discard/nodiscard for enabling TRIM for SSDs • Is TRIM actually useful? The jury is still out... November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 58
    • btrfs Options btrfs • Mount options • degraded: useful when mounting redundant • pools with broken or missing devices compress: select zlib, lzo, or no compression algorithms • Note: by default, only compressible data is written • discard: enables TRIM (see ext4 option) • fatal_errors: choose error fail policy November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 59
    • ZFS Properties ZFS • Recall that ZFS doesn’t use fstab or mkfs • Properties are stored in metadata for the pool or • • • • • dataset By default, properties are inherited Some properties are common to all datasets, but a specific dataset type may have additional properties Easily set or retrieved via scripts Can set at creation time, or later (restrictions apply) In general, properties affect future file system activity November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 60
    • Managing ZFS Properties ZFS • Pool properties zpool get all poolname zpool get propertyname poolname zpool set propertyname=value poolname • Dataset properties zfs get all dataset zfs get propertyname [dataset] zfs set propertyname=value dataset November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 61
    • User-defined Properties ZFS • Useful for adding metadata to datasets • Limited to description property on pools • Recall each pool has a dataset of the same name • Names • • • • • Must include colon ':' Can contain lower case alphanumerics or “+” “.” “_” Max length = 256 characters By convention, module:property • com.sun:auto-snapshot Values • Max length = 1024 characters • Examples • com.richardelling:important_files=true November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 62
    • ZFS Pool Properties ZFS Property altroot Change? Brief Description Alternate root directory (ala chroot) autoexpand Policy for expanding when vdev size changes autoreplace vdev replacement policy available readonly Available storage space bootfs Default bootable dataset for root pool cachefile Cache file to use other than /etc/zfs/ zpool.cache capacity dedupditto readonly Percent of pool space used Automatic copies for deduped data dedupratio readonly Deduplication efficiency metric delegation Master pool delegation switch failmode Catastrophic pool failure policy November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 63
    • More ZFS Pool Properties ZFS Property feature@async_destroy Change? Brief Description Reduce pain of dataset destroy workload feature@empty_bpobj Improves performance for lots of snapshots feature@lz4_compress lz4 compression guid readonly Unique identifier health listsnapshots readonly Current health of the pool zfs list policy size used readonly Amount of space used version November 3, 2013 readonly Total size of pool readonly Current on-disk version File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 64
    • Common Dataset Properties ZFS Property Change? available readonly checksum copies creation Space available to dataset & children Checksum algorithm compression compressratio Brief Description Compression algorithm readonly Compression ratio – logical size:referenced physical Number of copies of user data readonly Dataset creation time dedup Deduplication policy logbias Separate log write policy mlslabel Multilayer security label origin November 3, 2013 readonly For clones, origin snapshot File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 65
    • More Dataset Properties ZFS Property Change? primarycache Brief Description ARC caching policy readonly Is dataset in readonly mode? referenced readonly Size of data accessible by this dataset refreservation Minimum space guaranteed to a dataset, excluding descendants (snapshots & clones) reservation Minimum space guaranteed to dataset, including descendants secondarycache L2ARC caching policy sync type November 3, 2013 Synchronous write policy readonly Type of dataset (filesystem, snapshot, volume) File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 66
    • Still More Dataset Properties ZFS Property Change? Brief Description used readonly Sum of usedby* (see below) usedbychildren readonly Space used by descendants usedbydataset readonly Space used by dataset usedbyrefreservation readonly Space used by a refreservation for this dataset usedbysnapshots readonly Space used by all snapshots of this dataset zoned readonly Is dataset added to non-global zone (Solaris) November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 67
    • ZFS ZFS Volume Properties Property Change? shareiscsi volblocksize iSCSI service (per-distro option) creation volsize zoned November 3, 2013 Brief Description fixed block size Implicit quota readonly Set if dataset delegated to nonglobal zone (Solaris) File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 68
    • ZFS File System Properties ZFS Property Change? Brief Description aclinherit ACL inheritance policy, when files or directories are created aclmode ACL modification policy, when chmod is used atime Disable access time metadata updates canmount Mount policy casesensitivity creation Filename matching algorithm (CIFS client feature) devices Device opening policy for dataset exec File execution policy for dataset mounted November 3, 2013 readonly Is file system currently mounted? File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 69
    • ZFS Filesystem Properties2 ZFS Property Change ? nbmand export/ File system should be mounted with nonimport blocking mandatory locks (CIFS client feature) normalization creation Brief Description Unicode normalization of file names for matching quota Max space dataset and descendants can consume recordsize Suggested maximum block size for files refquota Max space dataset can consume, not including descendants setuid setuid mode policy sharenfs NFS sharing options (per-distro) sharesmb Files system shared with SMB (per-distro) November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 70
    • ZFS Filesystem Properties3 ZFS Property Change ? snapdir utf8only Brief Description Controls whether .zfs directory is hidden creation UTF-8 character file name policy vscan Virus scan enabled xattr Extended attributes policy November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 71
    • ZFS Distro Properties ZFS Pool Properties Release Property Brief Description illumos comment Human-readable comment field ZFSonLinux ashift Sets default disk sector size Dataset Properties Release Property Brief Description Solaris 11 encryption Dataset encryption Delphix/illumos clones Clone descendants Delphix/illumos refratio Compression ratio for references Solaris 11 share Combines sharenfs & sharesmb Solaris 11 shadow Shadow copy NexentaOS/illumos worm WORM feature Delphix/illumos written Amount of data written since last snapshot November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 72
    • Performance and Tuning
    • ext4 About Disks ZFS btrfs • Hard disk drives are slow. Get over it. Average Average Seek Rotational (ms) Latency (ms) Disk Size RPM Max Size (GBytes) HDD 2.5” 5,400 1,000 5.5 11 HDD 3.5” 5,900 4,000 5.1 16 HDD 3.5” 7,200 4,000 4.2 8 - 8.5 HDD 2.5” 10,000 300 3 4.2 - 4.6 HDD 2.5” 15,000 146 2 3.2 - 3.5 SSD (w) 2.5” N/A 800 0 0.02 - 0.25 SSD (r) 2.5” N/A 1,000 0 0.02 - 0.15 November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 74
    • btrfs Performance btrfs • Move metadata to separate devices • Common option for distributed file systems • Attribute-intensive workloads can benefit from faster metadata management Metadata Pool Minimal HDD Good HDD HDD Better November 3, 2013 RAID-1 SSD SSD RAID-1 HDD HDD RAID-1 HDD HDD RAID-1 RAID-10 HDD HDD RAID-1 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 75
    • ZFS Performance ZFS Log Main Pool HDD Minimal Good Better SSD SSD HDD HDD SSD Best November 3, 2013 Cache mirror SSD SSD SSD SSD mirror mirror stripe mirror HDD HDD HDD raidz, raidz2, raidz3 HDD HDD mirror HDD HDD mirror stripe HDD HDD mirror File Systems: Top to Bottom and Back — USENIX LISA’13 SSD SSD SSD stripe Slide 76
    • ZFS Performance Good Better Best November 3, 2013 More ZFS Performance Log SSD SSD mirror Main Pool HDD HDD Cache HDD raidz, raidz2, raidz3 SSD SSD mirror mirror stripe HDD HDD HDD HDD HDD HDD SSD HDD SSD SSD SSD HDD SSD HDD mirror mirror mirror stripe mirror stripe mirror SSD SSD SSD stripe $ / Byte Best Better Good mirror File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 77
    • Device Sector Optimization ZFS • Problem: not all drive sectors are equal and read-modify-write is inefficient • 512 bytes - legacy and enterprise • 4KB - Advanced Format (AF) consumer and high-density • ZFSonLinux • zpool create ashift option (size = 2ashift) Sector size 512 bytes 9 4kB November 3, 2013 ashift 12 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 78
    • ext4 ZFS Wounded Soldier NFS Service btrfs Bad Disk Offlined November 3, 2013 Resilver Complete File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 79
    • Summary Woohoo!
    • ext4 Great File Systems! ZFS btrfs • All of these file systems have great features and bright futures • Now you know how to use them better! • ext4 is now default for many Linux distros • btrfs takes it to the next level in the Linux • ecosystem ZFS is widely ported to many different OSes • OpenZFS organization recently launched to be focal point for open-source ZFS • We’re always looking for more contributors! November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 81
    • Websites ZFS btrfs • www.Open-ZFS.org • www.ZFSonLinux.org • github.com/zfsonlinux/pkg-zfs/wiki/HOWTOinstall-Ubuntu-to-a-Native-ZFS-RootFilesystem • btrfs.wiki.kernel.org November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 82
    • ZFS btrfs Online Chats • irc.freenode.net • #zfs - general ZFS discussions • #zfsonlinux - Linux-specific discussions • #btrfs - general btrfs discusions November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 83
    • Thank You! Richard.Elling@RichardElling.com @richardelling