0
File Systems
Top to Bottom
and Back
Richard.Elling@RichardElling.com
LISA’13 Washington, DC
November 3, 2013
Agenda
•
•
•
•
•
•
•

Introduction
Installation
Creation and Destruction
Backup and Restore
Migration
Settings and Options...
Introduction
ext4

File Systems

ZFS
btrfs

File system
discussed on slide

• Today’s discussions: emphasis on Linux
• ext4, with a few...
ext4

ext4 Highlights
• ext3 was limited
• 16TB filesystem size (32-bit block numbers)
• 32k limit on subdirectories
• Per...
ZFS Highlights

ZFS

• Figure out why storage has become so
•
•
•
•
•
•
•

complicated
Blow away 20+ years of obsolete
ass...
btrfs

•
•
•
•
•
•
•

btrfs

New copy-on-write file system
Pooled storage model
Snapshots
Checksums
Large scale (16 EB)
Bu...
ReFS

Pooled Storage Model

ZFS
btrfs

• Old school
• 1 disk means
• 1 file system
• 1 directory structure (directory tree...
ZFS
btrfs

Sysadmin’s View of Pools

Pool
File System

Configuration
Information

File System

Dataset

November 3, 2013

...
ext4

Blocks and Extents

ZFS
btrfs

• Early file systems were block-based
• ext3, UFS, FAT
• Data blocks are fixed sizes
...
ext4

Blocks and Extents

ZFS
btrfs

Data
Direct
Direct

Block-based

Data

Direct

Data

Direct

┊

Data

Metadata is lis...
ext4

Scalability

ZFS
btrfs

Problem: what happens when we
need more metadata?

• Block-based: go with indirect blocks
• ...
ext3
UFS

Indirect Blocks
Data
Direct
Direct

Data

Direct

Data

Indirect
Indirect

Direct

┊

Double
Indirect

Data

Dat...
ext4

Treed Metadata

ZFS
btrfs

Root

Data

Data

Data

Data

• Trees can be large, yet efficiently

searched and modifie...
ZFS
btrfs

Trees Allow Copy-on-Write
1. Initial block tree

3. COW metadata

November 3, 2013

2. COW some data

4. Update...
ext4

fsck

ZFS
btrfs

Problem: how do we know the metadata is
correct?

• Keep redundant copies
• But what if the copies ...
Installation
ext4

Ubuntu 12.04.3

ZFS
btrfs

• ext4 = default root file system
• btrfs version v0.19 installed by default
• ZFS
1. Ins...
ext4

Fedora Core F19

ZFS
btrfs

• ext4 = default root file system
• btrfs version v0.20-rc1 installed by default
• ZFS
1...
AΩ
Creation and
Destruction
But first...
a brief discussion
of RAID
RAID Basics
• Disks fail. Sometimes they lose data.
•
•
•
•
•
•

Sometimes they completely die. Get over it.
RAID = Redund...
RAID-0 or Striping

ZFS
btrfs

• RAID-0
• SNIA definition: fixed-length sequences of virtual
•
•

disk data addresses are ...
ZFS
btrfs

RAID-0 Example
RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes

384 kBytes
ZFS Dynamic Stripe record...
RAID-1 or Mirroring

ZFS
btrfs

• Straightforward: put N copies of the data
on N disks

• Good for read performance and
•
...
Traditional Mirrors
File system
does bad read
Can not tell

November 3, 2013

If it’s a metadata
block FS panics
does disk...
ZFS
btrfs

Checksums for Mirrors

• What if a disk is (mostly) ok, but the data
became corrupted?
• btrfs and ZFS improve ...
RAID-5 and RAIDZ

ZFS
btrfs

• N+1 redundancy
• Good for space and dependability
• Bad for performance

• RAID-5 (btrfs)
•...
ZFS
btrfs

RAID-5 and RAIDZ

RAID-5

DiskA
D0:0
P1
D2:3
D3:2

DiskB
D0:1
D1:0
P2
D3:3

DiskC
D0:2
D1:1
D2:0
P3

DiskD
D0:3...
RAID-6, RAIDZ2, RAIDZ3

ZFS
btrfs

• Adding more parity
• Parity 1: XOR
• Parity 2: another Reed-Solomon syndrome
• Parity...
ZFS
btrfs

Dependability vs Space

Dependability model metric MTTDL = Mean time to data loss (bigger is better)
For this a...
We now return you to
your regularly
scheduled program:
AΩ
Create a Simple Pool

ZFS
btrfs

1. Determine the name of an unused disk

•
•
•
•

/dev/sd* or /dev/hd*
/dev/disk/by-id
/d...
ZFS
btrfs

Verify Pool Status

• btrfs

btrfs filesystem show
• ZFS
zpool status

November 3, 2013

File Systems: Top to B...
Destroy Pool

ZFS
btrfs

• btrfs
• Unmount all btrfs file systems

• ZFS

zpool destroy zwimming

• Unmounts file systems ...
Create Mirrored Pool

ZFS
btrfs

1. Determine the name of two unused disks
2. Create a mirrored pool

•

btrfs
mkfs.btrfs ...
Creating Filesystems
ext4

Create & Mount File System

ZFS
btrfs

• Make some mount points for this example
•
•
•

mkdir /mnt.ext4
mkdir /mnt.b...
But first...
a brief introduction to
accounting principles
ext4

Verify Mounted File Systems

ZFS
btrfs

• df is handy tool to verify mounted file
systems

root@ubuntu:~# df -h
File...
ext4

Again!

ZFS
btrfs

• Try again with our mirrored pool examples
root@ubuntu:~# df -h
Filesystem
Size
...
/dev/sdf
976...
Accounting Sanity

ZFS
btrfs

• A full explanation of the accounting for
pools is an opportunity for aspiring
writers!
• A...
btrfs subvolumes
and
ZFS filesystems
ZFS
btrfs

One Pool Many File Systems
Pool
File System

Configuration
Information

File System

Dataset

Volume

File Syst...
ZFS
btrfs

Create New File Systems

• Context: new file system in existing pool
• btrfs
•
•

btrfs subvolume /mnt.btrfs/sv...
ZFS
btrfs

Nesting

• It is tempting to create deep, nested

multiple file system structures
• But it increases management...
Backup and Restore
ext4
ZFS
btrfs

Traditional Tools

• For file systems, the traditional tools work
as you expect

• cp, scp, tar, rsync, zi...
Snapshots

ZFS
btrfs

Snapshot tree
root

Current tree
root

• Create a snapshot by not free'ing COWed blocks
• Snapshot c...
Create Read-only Snapshot

ZFS
btrfs

• btrfs
• btrfs version v0.20-rc1 or later
• Read-only needed for btrfs send

btrfs ...
ZFS
btrfs

Create Writable Snapshot

• btrfs
•

btrfs subvolume snapshot /mnt.btrfs/sv1
ZFS
zfs snapshot zwimming@snapme
z...
btrfs

btfs Send and Receive

• New feature in v0.20-rc1
• Operates on read-only snapshots

btrfs subvolume snapshot -r /m...
ZFS Send and Receive

ZFS

• Works the same on file systems as volumes
•

(datasets)
Send a snapshot as a stream to stdout...
Migration
ext4

Forward Migration

ZFS
btrfs

•
•
•
•

But first... backup your data!
And second... test your backup
ext3 ➯ ext4
ext...
ext4
btrfs

Reverting Migration

• Once you start to use ext4 features or

add data to btrfs, the old ext3 filesystems
doe...
Settings and Options
ext4

ext4 Options
• Extends function set available to ext2 and ext3
• Creation options
• uninit_bg creates file system wi...
btrfs Options

btrfs

• Mount options
• degraded: useful when mounting redundant
•

pools with broken or missing devices
c...
ZFS Properties

ZFS

• Recall that ZFS doesn’t use fstab or mkfs
• Properties are stored in metadata for the pool or
•
•
•...
Managing ZFS Properties

ZFS

• Pool properties

zpool get all poolname
zpool get propertyname poolname
zpool set property...
User-defined Properties

ZFS

• Useful for adding metadata to datasets
• Limited to description property on pools
• Recall...
ZFS Pool Properties

ZFS

Property
altroot

Change?

Brief Description
Alternate root directory (ala chroot)

autoexpand

...
More ZFS Pool Properties

ZFS

Property
feature@async_destroy

Change?

Brief Description
Reduce pain of dataset
destroy w...
Common Dataset Properties

ZFS

Property

Change?

available

readonly

checksum

copies
creation

Space available to data...
More Dataset Properties

ZFS

Property

Change?

primarycache

Brief Description
ARC caching policy

readonly

Is dataset ...
Still More Dataset Properties

ZFS

Property

Change?

Brief Description

used

readonly Sum of usedby* (see below)

usedb...
ZFS

ZFS Volume Properties
Property

Change?

shareiscsi
volblocksize

iSCSI service (per-distro option)
creation

volsize...
ZFS File System Properties

ZFS

Property

Change?

Brief Description

aclinherit

ACL inheritance policy, when files or
d...
ZFS Filesystem Properties2

ZFS

Property

Change
?

nbmand

export/
File system should be mounted with nonimport blocking...
ZFS Filesystem Properties3

ZFS

Property

Change
?

snapdir
utf8only

Brief Description
Controls whether .zfs directory i...
ZFS Distro Properties

ZFS

Pool Properties
Release

Property

Brief Description

illumos

comment

Human-readable comment...
Performance
and Tuning
ext4

About Disks

ZFS
btrfs

• Hard disk drives are slow. Get over it.
Average
Average Seek
Rotational
(ms)
Latency (ms)
...
btrfs Performance

btrfs

• Move metadata to separate devices
• Common option for distributed file systems
• Attribute-int...
ZFS Performance

ZFS

Log

Main Pool
HDD

Minimal

Good

Better

SSD
SSD

HDD
HDD

SSD

Best

November 3, 2013

Cache

mir...
ZFS

Performance
Good

Better

Best

November 3, 2013

More ZFS Performance
Log
SSD
SSD
mirror

Main Pool
HDD

HDD

Cache
...
Device Sector Optimization

ZFS

• Problem: not all drive sectors are equal
and read-modify-write is inefficient

• 512 by...
ext4
ZFS

Wounded Soldier

NFS Service

btrfs

Bad Disk Offlined

November 3, 2013

Resilver Complete

File Systems: Top t...
Summary
Woohoo!
ext4

Great File Systems!

ZFS
btrfs

• All of these file systems have great
features and bright futures

• Now you know h...
Websites

ZFS
btrfs

• www.Open-ZFS.org
• www.ZFSonLinux.org
• github.com/zfsonlinux/pkg-zfs/wiki/HOWTOinstall-Ubuntu-to-a...
ZFS
btrfs

Online Chats

• irc.freenode.net
• #zfs - general ZFS discussions
• #zfsonlinux - Linux-specific discussions
• ...
Thank You!
Richard.Elling@RichardElling.com
@richardelling
S8 File Systems Tutorial USENIX LISA13
Upcoming SlideShare
Loading in...5
×

S8 File Systems Tutorial USENIX LISA13

1,666

Published on

Slides from the S8 File Systems Tutorial at USENIX LISA'13 conference in Washington, DC. The topic covers ext4, btrfs, and ZFS with an emphasis on Linux implementations.

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
  • This is great, thanks for sharing Richard.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,666
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
82
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "S8 File Systems Tutorial USENIX LISA13"

  1. 1. File Systems Top to Bottom and Back Richard.Elling@RichardElling.com LISA’13 Washington, DC November 3, 2013
  2. 2. Agenda • • • • • • • Introduction Installation Creation and Destruction Backup and Restore Migration Settings and Options Performance and Tuning November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 2
  3. 3. Introduction
  4. 4. ext4 File Systems ZFS btrfs File system discussed on slide • Today’s discussions: emphasis on Linux • ext4, with a few comments on ext3 • btrfs • ZFS • Not in scope (maybe next year?) • ReFS • HSF+ November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 4
  5. 5. ext4 ext4 Highlights • ext3 was limited • 16TB filesystem size (32-bit block numbers) • 32k limit on subdirectories • Performance limitations • ext4 is natural successor • • • • • Easy migration from ext3 Replace indirect blocks with extents > 16TB filesystem size Preallocation Journal checksums • Now default on many Linux distros November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 5
  6. 6. ZFS Highlights ZFS • Figure out why storage has become so • • • • • • • complicated Blow away 20+ years of obsolete assumptions Sun had to replace UFS Opportunity to design integrated system from scratch Widely ported: Linux, FreeBSD, OSX Builtin RAID Checksums Large scale (256 ZB) November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 6
  7. 7. btrfs • • • • • • • btrfs New copy-on-write file system Pooled storage model Snapshots Checksums Large scale (16 EB) Builtin RAID Clever in-place migration from ext3 November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 7
  8. 8. ReFS Pooled Storage Model ZFS btrfs • Old school • 1 disk means • 1 file system • 1 directory structure (directory tree) • File systems didn’t change when virtual disks (eg RAID) arrived • • ok, so we could partition them... ugly solution New school • Combine storage devices into a pool • Allow many file systems per pool November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 8
  9. 9. ZFS btrfs Sysadmin’s View of Pools Pool File System Configuration Information File System Dataset November 3, 2013 Volume File System File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 9
  10. 10. ext4 Blocks and Extents ZFS btrfs • Early file systems were block-based • ext3, UFS, FAT • Data blocks are fixed sizes • Difficult to scale due to indirection levels and allocation algorithms • Extents solve many indirection issues • Extent is a contiguous area of storage • • reserved for a file Data blocks are variable sizes ext4, btrfs, ZFS, XFS, NTFS, VxFS November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 10
  11. 11. ext4 Blocks and Extents ZFS btrfs Data Direct Direct Block-based Data Direct Data Direct ┊ Data Metadata is list of (direct) pointers to fixed-size blocks Data Extent-based Extent Extent Data Extent ┊ Data Metadata is list of extent structures (offset + length) to mixed-size blocks November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 11
  12. 12. ext4 Scalability ZFS btrfs Problem: what happens when we need more metadata? • Block-based: go with indirect blocks • Really just pointers to pointers • Gets ugly at triple-indirection • Function of data size and block size • Extent-based: grow trees • B-trees are popular • • ext4, for more than 3 levels • btrfs ZFS uses a Merkle tree November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 12
  13. 13. ext3 UFS Indirect Blocks Data Direct Direct Data Direct Data Indirect Indirect Direct ┊ Double Indirect Data Data Data Direct ┊ ┊ Metadata Direct Indirect ┊ ┊ Problem 1: big files use lots of indirection Problem 2: metadata size fixed at creation November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 13
  14. 14. ext4 Treed Metadata ZFS btrfs Root Data Data Data Data • Trees can be large, yet efficiently searched and modified • Enables copy-on-write (COW) • Lots of good computer science here! November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 14
  15. 15. ZFS btrfs Trees Allow Copy-on-Write 1. Initial block tree 3. COW metadata November 3, 2013 2. COW some data 4. Update Uberblocks & free File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 15
  16. 16. ext4 fsck ZFS btrfs Problem: how do we know the metadata is correct? • Keep redundant copies • But what if the copies don’t agree? 1. File system check reconciles metadata inconsistencies • fsck (ext[234], btrfs, UFS), chkdsk (FAT), etc • Repairs problems that are known to occur (!) • Does not repair data (!) 2. Build a transactional system with atomic updates • • November 3, 2013 Databases (MySQL, Oracle, etc) ZFS File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 16
  17. 17. Installation
  18. 18. ext4 Ubuntu 12.04.3 ZFS btrfs • ext4 = default root file system • btrfs version v0.19 installed by default • ZFS 1. Install python-software-properties apt-get install python-software-properties  2. Add ZFSonLinux repo apt-add-repository --yes ppa:zfs-native/stable apt-get update 3. Install ZFS package apt-get install debootstrap ubuntu-zfs 4. Verify modprobe -l zfs dmesg | grep ZFS: November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 18
  19. 19. ext4 Fedora Core F19 ZFS btrfs • ext4 = default root file system • btrfs version v0.20-rc1 installed by default • ZFS 1. Update to latest package versions 2. Add ZFSonLinux repo Beware of word wrap yum localinstall --nogpgcheck http://archive.zfsonlinux.org/ fedora/zfs-release-1-2$(rpm -E %dist).noarch.rpm 3. Install ZFS package yum install zfs  4. Verify modprobe -l zfs dmesg | grep ZFS: November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 19
  20. 20. AΩ Creation and Destruction
  21. 21. But first... a brief discussion of RAID
  22. 22. RAID Basics • Disks fail. Sometimes they lose data. • • • • • • Sometimes they completely die. Get over it. RAID = Redundant Array of Inexpensive Disks RAID = Redundant Array of Independent Disks Key word: Redundant Redundancy is good. More redundancy is better. Everything else fails, too. You’re over it by now, right? November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 22
  23. 23. RAID-0 or Striping ZFS btrfs • RAID-0 • SNIA definition: fixed-length sequences of virtual • • disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern Good for space and performance Bad for dependability • ZFS Dynamic Stripe • • • • Data is dynamically mapped to member disks No fixed-length sequences Allocate up to ~1 MByte/vdev before changing vdev Good combination of the concatenation feature with RAID-0 performance November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 23
  24. 24. ZFS btrfs RAID-0 Example RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes 384 kBytes ZFS Dynamic Stripe recordsize = 128 kBytes Total write size = 2816 kBytes November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 24
  25. 25. RAID-1 or Mirroring ZFS btrfs • Straightforward: put N copies of the data on N disks • Good for read performance and • dependability Bad for space • Arbitration: btrfs and ZFS do not blindly trust either side of mirror • Most recent, correct view of data wins • Checksums validate data November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 25
  26. 26. Traditional Mirrors File system does bad read Can not tell November 3, 2013 If it’s a metadata block FS panics does disk rebuild Or we get back bad data File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 26
  27. 27. ZFS btrfs Checksums for Mirrors • What if a disk is (mostly) ok, but the data became corrupted? • btrfs and ZFS improve dependability using checksums for data and store checksums in metadata November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 27
  28. 28. RAID-5 and RAIDZ ZFS btrfs • N+1 redundancy • Good for space and dependability • Bad for performance • RAID-5 (btrfs) • Parity check data is distributed across the RAID array's • disks Must read/modify/write when data is smaller than stripe width • RAIDZ (ZFS) • • • • Dynamic data placement Parity added as needed Writes are full-stripe writes No read/modify/write (write hole) November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 28
  29. 29. ZFS btrfs RAID-5 and RAIDZ RAID-5 DiskA D0:0 P1 D2:3 D3:2 DiskB D0:1 D1:0 P2 D3:3 DiskC D0:2 D1:1 D2:0 P3 DiskD D0:3 D1:2 D2:1 D3:0 DiskE P0 D1:3 D2:2 D3:1 RAIDZ DiskA P0 P1 D2:1 D2:4 DiskB D0:0 D1:0 D2:2 D2:5 DiskC D0:1 D1:1 D2:3 P3 DiskD D0:2 P2:0 Gap D3:0 DiskE D0:3 D2:0 P2:1 November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 29
  30. 30. RAID-6, RAIDZ2, RAIDZ3 ZFS btrfs • Adding more parity • Parity 1: XOR • Parity 2: another Reed-Solomon syndrome • Parity 3: yet another Reed-Solomon syndrome • Double parity: N+2 • RAID-6 (btrfs) • RAIDZ2 (ZFS) • Triple parity: N+3 • RAIDZ3 (ZFS) November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 30
  31. 31. ZFS btrfs Dependability vs Space Dependability model metric MTTDL = Mean time to data loss (bigger is better) For this analysis, RAIDZ1/2 and RAID-5/6 are equivalent November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 31
  32. 32. We now return you to your regularly scheduled program: AΩ
  33. 33. Create a Simple Pool ZFS btrfs 1. Determine the name of an unused disk • • • • /dev/sd* or /dev/hd* /dev/disk/by-id /dev/disk/by-path /dev/disk/by-vdev (ZFS) 2. Create a simple pool • • btrfs mkfs.btrfs -m single /dev/sdb ZFS zpool create zwimming /dev/sdd Note: might need “-f” flag to create EFI label 3. Woohoo! November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 33
  34. 34. ZFS btrfs Verify Pool Status • btrfs btrfs filesystem show • ZFS zpool status November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 34
  35. 35. Destroy Pool ZFS btrfs • btrfs • Unmount all btrfs file systems • ZFS zpool destroy zwimming • Unmounts file systems and volumes • Exports pool • Marks pool as destroyed • Walk away... • Until overwritten, data is still ok and can be imported again • To see destroyed ZFS pools zpool import -D November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 35
  36. 36. Create Mirrored Pool ZFS btrfs 1. Determine the name of two unused disks 2. Create a mirrored pool • btrfs mkfs.btrfs -d raid1 /dev/sdb /dev/sdc • • -d specifies redundancy for data, metadata is redundant by default ZFS zpool create zwimming mirror /dev/sdd /dev/sde 3. Woohoo! 4. Verify November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 36
  37. 37. Creating Filesystems
  38. 38. ext4 Create & Mount File System ZFS btrfs • Make some mount points for this example • • • mkdir /mnt.ext4 mkdir /mnt.btrfs ext4 mkfs.ext4 /dev/sdf mount /dev/sdf /mnt.ext4 btrfs mount /dev/sdb /mnt.btrfs ZFS • zpool create already made a file system and mounted it at /zwimming • Verify... November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 38
  39. 39. But first... a brief introduction to accounting principles
  40. 40. ext4 Verify Mounted File Systems ZFS btrfs • df is handy tool to verify mounted file systems root@ubuntu:~# df -h Filesystem Size ... /dev/sdf 976M zwimming 976M /dev/sdb 1.0G Used Avail Use% Mounted on 1.3M 0 56K 924M 976M 894M 1% /mnt.ext4 0% /zwimming 1% /mnt.btrfs • WAT? • Pool space accounting isn’t like traditional filesystem space accounting • NB: the raw disk has 1,073,741,824 bytes November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 40
  41. 41. ext4 Again! ZFS btrfs • Try again with our mirrored pool examples root@ubuntu:~# df -h Filesystem Size ... /dev/sdf 976M zwimming 976M /dev/sdc 2.0G Used Avail Use% Mounted on 1.3M 0 56K 924M 976M 1.8G 1% /mnt.ext4 0% /zwimming 1% /mnt.btrfs • WAT, WAT, WAT? • The accounting is correct, your • understanding of the accounting might need a little bit of help Adding RAID-5, compression, copies, and deduplication makes accounting very confusing November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 41
  42. 42. Accounting Sanity ZFS btrfs • A full explanation of the accounting for pools is an opportunity for aspiring writers! • A more pragmatic view: • The accounting is correct • You can tell how much space is unallocated (free), but you can’t tell how much data you can put into it, until you do so November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 42
  43. 43. btrfs subvolumes and ZFS filesystems
  44. 44. ZFS btrfs One Pool Many File Systems Pool File System Configuration Information File System Dataset Volume File System • Good idea: create new file systems when you want a new policy • readonly, quota, snapshots/clones, etc • Act like directories, but slightly heavier November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 44
  45. 45. ZFS btrfs Create New File Systems • Context: new file system in existing pool • btrfs • • btrfs subvolume /mnt.btrfs/sv1 ZFS zfs create zwimming/fs1 Verify root@ubuntu:~# df -h Filesystem Size Used Avail Use% Mounted on ... /dev/sdf 976M 1.3M 924M 1% /mnt.ext4 zwimming 976M 128K 976M 1% /zwimming /dev/sdb 1.0G 64K 894M 1% /mnt.btrfs zwimming/fs1 976M 128K 976M 1% /zwimming/fs1 root@ubuntu:~# ls -l /mnt.btrfs total 0 drwxr-xr-x 1 root root 0 Nov 2 20:30 sv1 root@ubuntu:~# ls -l /zwimming total 2 drwxr-xr-x 2 root root 2 Nov 2 20:29 fs1 root@ubuntu:~# btrfs subvolume list /mnt.btrfs ID 256 top level 5 path sv1 November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 45
  46. 46. ZFS btrfs Nesting • It is tempting to create deep, nested multiple file system structures • But it increases management complexity • Good idea: use shallow file system hierarchy November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 46
  47. 47. Backup and Restore
  48. 48. ext4 ZFS btrfs Traditional Tools • For file systems, the traditional tools work as you expect • cp, scp, tar, rsync, zip, ... • For ZFS volumes, dd • But those are boring, let’s talk about snapshots and replication November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 48
  49. 49. Snapshots ZFS btrfs Snapshot tree root Current tree root • Create a snapshot by not free'ing COWed blocks • Snapshot creation is fast and easy • Number of snapshots determined by use – no hardwired limit • Recursive snapshots also possible in ZFS • Terminology: btrfs “writable snapshot” is like ZFS “clone” November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 49
  50. 50. Create Read-only Snapshot ZFS btrfs • btrfs • btrfs version v0.20-rc1 or later • Read-only needed for btrfs send btrfs subvolume snapshot -r /mnt.btrfs/sv1 /mnt.btrfs/sv1_ro • ZFS zfs snapshot zwimming@snapme November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 50
  51. 51. ZFS btrfs Create Writable Snapshot • btrfs • btrfs subvolume snapshot /mnt.btrfs/sv1 ZFS zfs snapshot zwimming@snapme zfs clone zwimming@snapme zwimming/cloneme root@ubuntu:~# btrfs subvolume snapshot /mnt.btrfs/sv1 /mnt.btrfs/sv1_snap Create a snapshot of '/mnt.btrfs/sv1' in '/mnt.btrfs/sv1_snap' root@ubuntu:~# btrfs subvolume list /mnt.btrfs ID 256 top level 5 path sv1 ID 257 top level 5 path sv1_snap root@ubuntu:~# zfs snapshot zwimming@snapme root@ubuntu:~# zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT zwimming@snapme 0 31K root@ubuntu:~# ls -l /zwimming/.zfs/snapshot total 0 dr-xr-xr-x 1 root root 0 Nov 2 21:02 snapme root@ubuntu:~# zfs clone zwimming@snapme zwimming/cloneme root@ubuntu:~# df -h Filesystem Size Used Avail Use% Mounted on ... zwimming 976M 0 976M 0% /zwimming zwimming/cloneme 976M 0 976M 0% /zwimming/cloneme November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 51
  52. 52. btrfs btfs Send and Receive • New feature in v0.20-rc1 • Operates on read-only snapshots btrfs subvolume snapshot -r /mnt.btrfs/sv1 /mnt.btrfs/sv1_ro • Note: send data must be on disk, either wait or use sync command • Send the to stdout, receive from stdin root# btrfs subvolume snapshot -r /mnt.btrfs/sv1 /mnt.btrfs/sv1_ro root# sync root# btrfs subvolume create /mnt.btrfs/backup root# btrfs send /mnt.btrfs/sv1_ro | btrfs receive /mnt.btrfs/backup At subvol /mnt.btrfs/sv1_ro At subvol sv1_ro root# btrfs subvolume list /mnt.btrfs ID 256 gen 8 top level 5 path sv1 ID 257 gen 8 top level 5 path sv1_ro ID 258 gen 13 top level 5 path backup ID 259 gen 14 top level 5 path backup/svr_ro November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 52
  53. 53. ZFS Send and Receive ZFS • Works the same on file systems as volumes • (datasets) Send a snapshot as a stream to stdout • Whole: single snapshot • Incremental: difference between two snapshots • Receive a snapshot into a dataset • Whole: create a new dataset • Incremental: add to existing, common snapshot • Each snapshot has a GUID and creation time property • Good idea: avoid putting time in snapshot name, use the properties for automation • Example zfs send zwimming@snap | zfs receive zbackup/zwimming November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 53
  54. 54. Migration
  55. 55. ext4 Forward Migration ZFS btrfs • • • • But first... backup your data! And second... test your backup ext3 ➯ ext4 ext3 or ext4 ➯ btrfs • Cleverly treats existing ext3 or ext4 data as readonly snapshot • btrfs seed devices • Read-only file system as basis of new file system • All writes are COWed into new file system • ZFS is fundamentally different • Use traditional copies: cp, tar, rsync, etc November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 55
  56. 56. ext4 btrfs Reverting Migration • Once you start to use ext4 features or add data to btrfs, the old ext3 filesystems doesn’t see the new data • Seems to be unallocated space • Reverting loses the changes after migration • But first... backup your data! • And second... test your backup November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 56
  57. 57. Settings and Options
  58. 58. ext4 ext4 Options • Extends function set available to ext2 and ext3 • Creation options • uninit_bg creates file system without initializing all of the block groups • • speeds filesystem creation • can speed fsck Mount options of note • barriers enabled by default • max_batch_time for coalescing synchronous writes • Adjusts dynamically by observing commit time • Use with caution, know your workload • discard/nodiscard for enabling TRIM for SSDs • Is TRIM actually useful? The jury is still out... November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 58
  59. 59. btrfs Options btrfs • Mount options • degraded: useful when mounting redundant • pools with broken or missing devices compress: select zlib, lzo, or no compression algorithms • Note: by default, only compressible data is written • discard: enables TRIM (see ext4 option) • fatal_errors: choose error fail policy November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 59
  60. 60. ZFS Properties ZFS • Recall that ZFS doesn’t use fstab or mkfs • Properties are stored in metadata for the pool or • • • • • dataset By default, properties are inherited Some properties are common to all datasets, but a specific dataset type may have additional properties Easily set or retrieved via scripts Can set at creation time, or later (restrictions apply) In general, properties affect future file system activity November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 60
  61. 61. Managing ZFS Properties ZFS • Pool properties zpool get all poolname zpool get propertyname poolname zpool set propertyname=value poolname • Dataset properties zfs get all dataset zfs get propertyname [dataset] zfs set propertyname=value dataset November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 61
  62. 62. User-defined Properties ZFS • Useful for adding metadata to datasets • Limited to description property on pools • Recall each pool has a dataset of the same name • Names • • • • • Must include colon ':' Can contain lower case alphanumerics or “+” “.” “_” Max length = 256 characters By convention, module:property • com.sun:auto-snapshot Values • Max length = 1024 characters • Examples • com.richardelling:important_files=true November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 62
  63. 63. ZFS Pool Properties ZFS Property altroot Change? Brief Description Alternate root directory (ala chroot) autoexpand Policy for expanding when vdev size changes autoreplace vdev replacement policy available readonly Available storage space bootfs Default bootable dataset for root pool cachefile Cache file to use other than /etc/zfs/ zpool.cache capacity dedupditto readonly Percent of pool space used Automatic copies for deduped data dedupratio readonly Deduplication efficiency metric delegation Master pool delegation switch failmode Catastrophic pool failure policy November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 63
  64. 64. More ZFS Pool Properties ZFS Property feature@async_destroy Change? Brief Description Reduce pain of dataset destroy workload feature@empty_bpobj Improves performance for lots of snapshots feature@lz4_compress lz4 compression guid readonly Unique identifier health listsnapshots readonly Current health of the pool zfs list policy size used readonly Amount of space used version November 3, 2013 readonly Total size of pool readonly Current on-disk version File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 64
  65. 65. Common Dataset Properties ZFS Property Change? available readonly checksum copies creation Space available to dataset & children Checksum algorithm compression compressratio Brief Description Compression algorithm readonly Compression ratio – logical size:referenced physical Number of copies of user data readonly Dataset creation time dedup Deduplication policy logbias Separate log write policy mlslabel Multilayer security label origin November 3, 2013 readonly For clones, origin snapshot File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 65
  66. 66. More Dataset Properties ZFS Property Change? primarycache Brief Description ARC caching policy readonly Is dataset in readonly mode? referenced readonly Size of data accessible by this dataset refreservation Minimum space guaranteed to a dataset, excluding descendants (snapshots & clones) reservation Minimum space guaranteed to dataset, including descendants secondarycache L2ARC caching policy sync type November 3, 2013 Synchronous write policy readonly Type of dataset (filesystem, snapshot, volume) File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 66
  67. 67. Still More Dataset Properties ZFS Property Change? Brief Description used readonly Sum of usedby* (see below) usedbychildren readonly Space used by descendants usedbydataset readonly Space used by dataset usedbyrefreservation readonly Space used by a refreservation for this dataset usedbysnapshots readonly Space used by all snapshots of this dataset zoned readonly Is dataset added to non-global zone (Solaris) November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 67
  68. 68. ZFS ZFS Volume Properties Property Change? shareiscsi volblocksize iSCSI service (per-distro option) creation volsize zoned November 3, 2013 Brief Description fixed block size Implicit quota readonly Set if dataset delegated to nonglobal zone (Solaris) File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 68
  69. 69. ZFS File System Properties ZFS Property Change? Brief Description aclinherit ACL inheritance policy, when files or directories are created aclmode ACL modification policy, when chmod is used atime Disable access time metadata updates canmount Mount policy casesensitivity creation Filename matching algorithm (CIFS client feature) devices Device opening policy for dataset exec File execution policy for dataset mounted November 3, 2013 readonly Is file system currently mounted? File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 69
  70. 70. ZFS Filesystem Properties2 ZFS Property Change ? nbmand export/ File system should be mounted with nonimport blocking mandatory locks (CIFS client feature) normalization creation Brief Description Unicode normalization of file names for matching quota Max space dataset and descendants can consume recordsize Suggested maximum block size for files refquota Max space dataset can consume, not including descendants setuid setuid mode policy sharenfs NFS sharing options (per-distro) sharesmb Files system shared with SMB (per-distro) November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 70
  71. 71. ZFS Filesystem Properties3 ZFS Property Change ? snapdir utf8only Brief Description Controls whether .zfs directory is hidden creation UTF-8 character file name policy vscan Virus scan enabled xattr Extended attributes policy November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 71
  72. 72. ZFS Distro Properties ZFS Pool Properties Release Property Brief Description illumos comment Human-readable comment field ZFSonLinux ashift Sets default disk sector size Dataset Properties Release Property Brief Description Solaris 11 encryption Dataset encryption Delphix/illumos clones Clone descendants Delphix/illumos refratio Compression ratio for references Solaris 11 share Combines sharenfs & sharesmb Solaris 11 shadow Shadow copy NexentaOS/illumos worm WORM feature Delphix/illumos written Amount of data written since last snapshot November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 72
  73. 73. Performance and Tuning
  74. 74. ext4 About Disks ZFS btrfs • Hard disk drives are slow. Get over it. Average Average Seek Rotational (ms) Latency (ms) Disk Size RPM Max Size (GBytes) HDD 2.5” 5,400 1,000 5.5 11 HDD 3.5” 5,900 4,000 5.1 16 HDD 3.5” 7,200 4,000 4.2 8 - 8.5 HDD 2.5” 10,000 300 3 4.2 - 4.6 HDD 2.5” 15,000 146 2 3.2 - 3.5 SSD (w) 2.5” N/A 800 0 0.02 - 0.25 SSD (r) 2.5” N/A 1,000 0 0.02 - 0.15 November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 74
  75. 75. btrfs Performance btrfs • Move metadata to separate devices • Common option for distributed file systems • Attribute-intensive workloads can benefit from faster metadata management Metadata Pool Minimal HDD Good HDD HDD Better November 3, 2013 RAID-1 SSD SSD RAID-1 HDD HDD RAID-1 HDD HDD RAID-1 RAID-10 HDD HDD RAID-1 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 75
  76. 76. ZFS Performance ZFS Log Main Pool HDD Minimal Good Better SSD SSD HDD HDD SSD Best November 3, 2013 Cache mirror SSD SSD SSD SSD mirror mirror stripe mirror HDD HDD HDD raidz, raidz2, raidz3 HDD HDD mirror HDD HDD mirror stripe HDD HDD mirror File Systems: Top to Bottom and Back — USENIX LISA’13 SSD SSD SSD stripe Slide 76
  77. 77. ZFS Performance Good Better Best November 3, 2013 More ZFS Performance Log SSD SSD mirror Main Pool HDD HDD Cache HDD raidz, raidz2, raidz3 SSD SSD mirror mirror stripe HDD HDD HDD HDD HDD HDD SSD HDD SSD SSD SSD HDD SSD HDD mirror mirror mirror stripe mirror stripe mirror SSD SSD SSD stripe $ / Byte Best Better Good mirror File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 77
  78. 78. Device Sector Optimization ZFS • Problem: not all drive sectors are equal and read-modify-write is inefficient • 512 bytes - legacy and enterprise • 4KB - Advanced Format (AF) consumer and high-density • ZFSonLinux • zpool create ashift option (size = 2ashift) Sector size 512 bytes 9 4kB November 3, 2013 ashift 12 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 78
  79. 79. ext4 ZFS Wounded Soldier NFS Service btrfs Bad Disk Offlined November 3, 2013 Resilver Complete File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 79
  80. 80. Summary Woohoo!
  81. 81. ext4 Great File Systems! ZFS btrfs • All of these file systems have great features and bright futures • Now you know how to use them better! • ext4 is now default for many Linux distros • btrfs takes it to the next level in the Linux • ecosystem ZFS is widely ported to many different OSes • OpenZFS organization recently launched to be focal point for open-source ZFS • We’re always looking for more contributors! November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 81
  82. 82. Websites ZFS btrfs • www.Open-ZFS.org • www.ZFSonLinux.org • github.com/zfsonlinux/pkg-zfs/wiki/HOWTOinstall-Ubuntu-to-a-Native-ZFS-RootFilesystem • btrfs.wiki.kernel.org November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 82
  83. 83. ZFS btrfs Online Chats • irc.freenode.net • #zfs - general ZFS discussions • #zfsonlinux - Linux-specific discussions • #btrfs - general btrfs discusions November 3, 2013 File Systems: Top to Bottom and Back — USENIX LISA’13 Slide 83
  84. 84. Thank You! Richard.Elling@RichardElling.com @richardelling
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×