Successfully reported this slideshow.

Zfs Nuts And Bolts

11,288 views

Published on

A look at the internals of Sun's ZFS filesystem.

Published in: Technology, Business
  • Be the first to comment

Zfs Nuts And Bolts

  1. 1. ZFS Nuts and Bolts Eric Sproul OmniTI Computer Consulting
  2. 2. Quick Overview • More than just another filesystem: it’s a filesystem, a volume manager, and a RAID controller all in one • Production debut in Solaris 10 6/06 • 1 ZB = 1 billion TB • 128-bit • 264 snapshots, 248 files/directory, 264 bytes/filesystem, 278 bytes/pool, 264 devices/pool, 264 pools/system
  3. 3. Old & Busted Traditional storage stack: filesystem(upper): filename to object (inode) filesystem(lower): object to volume LBA volume manager: volume LBA to array LBA RAID controller: array LBA to disk LBA • Strict separation between layers • Each layer often comes from separate vendors • Complex, difficult to administer, hard to predict performance of a particular combination
  4. 4. New Hotness • Telescoped stack: ZPL: filename to object DMU: object to DVA SPA: DVA to disk LBA • Terms: • ZPL: ZFS POSIX layer (standard syscall interface) • DMU: Data Management Unit (transactional object store) • DVA: Data Virtual Address (vdev + offset) • SPA: Storage Pool Allocator (block allocation, data transformation)
  5. 5. New Hotness • No more separate tools to manage filesystems vs. volumes vs. RAID arrays • 2 commands: zpool(1M), zfs(1M) (RFE exists to combine these) • Pooled storage means never getting stuck with too much or too little space in your filesystems • Can expose block devices as well; “zvol” blocks map directly to DVAs
  6. 6. ZFS Advantages • Fast • copy-on-write, pipelined I/O, dynamic striping, variable block size, intelligent resilvering • Simple management • End-to-end data integrity, self-healing • Checksum everything, all the time • Built-in goodies • block transforms • snapshots • NFS, CIFS, iSCSI sharing • Platform-neutral on-disk format
  7. 7. Getting Down to Brass Tacks How does ZFS achieve these feats?
  8. 8. ZFS I/O Life Cycle Writes 1. Translated to object transactions by the ZPL: “Make these 5 changes to these 2 objects.” 2. Transactions bundled in DMU into transaction groups (TXGs) that flush when full (1/8 of system memory) or at regular intervals (30 seconds) 3. Blocks making up a TXG are transformed (if necessary), scheduled and then issued to physical media in the SPA
  9. 9. ZFS I/O Life Cycle Synchronous Writes • ZFS maintains a per-filesystem log called the ZFS Intent Log (ZIL). Each transaction gets a log sequence number. • When a synchronous command, such as fsync(), is issued, the ZIL commits blocks up to the current sequence number. This is a blocking operation. • The ZIL commits all necessary operations and flushes any write caches that may be enabled, ensuring that all bits have been committed to stable storage.
  10. 10. ZFS I/O Life Cycle Reads • ZFS makes heavy use of caching and prefetching • If requested blocks are not cached, issue a prioritized I/O that “cuts the line” ahead of pending writes • Writes are intelligently throttled to maintain acceptable read performance • ARC (Adaptive Replacement Cache) tracks recently and frequently used blocks in main memory • L2 ARC uses durable storage to extend the ARC
  11. 11. Speed Is Life • Copy-on-write design means random writes can be made sequential • Pipelined I/O extracts maximum parallelism with out-of-order issue, sorting and aggregation • Dynamic striping across all underlying devices eliminates hot-spots • Variable block size = no wasted space or effort • Intelligent resilvering copies only live data, can do partial rebuild for transient outages
  12. 12. Copy-On-Write Initial block tree
  13. 13. Copy-On-Write New blocks represent changes Never modifies existing data
  14. 14. Copy-On-Write Indirect blocks also change
  15. 15. Copy-On-Write Atomically update uberblock to point at updated blocks The uberblock is special in that it does get overwritten, but 4 copies are stored as part of the vdev label and are updated in transactional pairs. Therefore, integrity on disk is maintained.
  16. 16. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes:
  17. 17. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning:
  18. 18. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move head
  19. 19. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin head wait
  20. 20. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move head wait head
  21. 21. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move Move head wait head head
  22. 22. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move Move Move head wait head head head
  23. 23. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: If left in original order, we waste a lot of time waiting for head and platter positioning: Move Spin Move Move Move head wait head head head
  24. 24. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: Pipelining lets us examine writes as a group and optimize order:
  25. 25. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: Pipelining lets us examine writes as a group and optimize order: Move head
  26. 26. Pipelined I/O Reorders writes to be as sequential as possible App #1 writes: App #2 writes: Pipelining lets us examine writes as a group and optimize order: Move Move head head
  27. 27. Dynamic Striping • Load distribution across top-level vdevs • Factors determining block allocation include: • Capacity • Latency & bandwidth • Device health
  28. 28. Dynamic Striping New data striped across three mirrors. Writes striped across both mirrors. No migration of existing data. Copy-on-write reallocates data over time, Reads occur wherever data was gradually spreading it across all three mirrors. written. * RFE for “on-demand” resilvering to explicitly re-balance + # zpool create tank # zpool add tank mirror c1t0d0 c1t1d0 mirror c3t0d0 c3t1d0 mirror c2t0d0 c2t1d0
  29. 29. Variable Block Size • No single value works well with all types of files • Large blocks increase bandwidth but reduce metadata and can lead to wasted space • Small blocks save space for smaller files, but increase I/O operations on larger ones • Record-based files such as those used by databases have a fixed block size that must be matched by the filesystem to avoid extra overhead (blocks too small) or read-modify-write (blocks too large)
  30. 30. Variable Block Size • The DMU operates on units of a fixed record size; default is 128KB • Files that are less than the record size are written as a single filesystem block (FSB) of variable size in multiples of disk sectors (512B) • Files that are larger than the record size are stored in multiple FSBs equal to record size • DMU records are assembled into transaction groups and committed atomically
  31. 31. Variable Block Size • FSBs are the basic unit of ZFS datasets, of which checksums are maintained • Handled by the SPA, which can optionally transform them (compression, ditto blocks today; encryption, de-dupe in the future) • Compression improves I/O performance, as fewer operations are needed on the underlying disk
  32. 32. Intelligent Resilver • a.k.a. rebuild, resync, reconstruct • Traditional resilvering is basically a whole-disk copy in the mirror case; RAID-5 does XOR of the other disks to rebuild • No priority given to more important blocks (top of the tree) • If you’ve copied 99% of the blocks, but the last 1% contains the top few blocks in the tree, another failure ruins everything
  33. 33. Intelligent Resilver • The ZFS way is metadata-driven • Live blocks only: just walk the block tree; unallocated blocks are ignored • Top-down: Start with the most important blocks. Every block copied increases the amount of discoverable data. • Transactional pruning: If the failure is transient, repair by identifying the missed TXGs. Resilver time is only slightly longer than the outage time.
  34. 34. Keep It Simple • Unified management model: pools and datasets • Datasets are just a group of tagged bits with certain attributes: filesystems, volumes, snapshots, clones • Properties can be set while the dataset is active • Hierarchical arrangement: children inherit properties of parent • Datasets become administration points-- give every user or application their own filesystem
  35. 35. Keep It Simple • Datasets only occupy as much space as they need • Compression, quotas and reservations are built-in properties • Pools may be grown dynamically without service interruption
  36. 36. Data Integrity • Not enough to be fast and simple; must be safe too • Silent corruption is our mortal enemy • Defects can occur anywhere: disks, firmware, cables, kernel drivers • Main memory has ECC; why shouldn’t storage have something similar? • Other types of corruption are also killers: • Power outages, accidental overwrite, use a disk as swap
  37. 37. Data Integrity Traditional Method: Disk Block Checksum cksum data
  38. 38. Data Integrity Traditional Method: Disk Block Checksum cksum data Only detects problems after data is successfully written (“bit rot”)
  39. 39. Data Integrity Traditional Method: Disk Block Checksum cksum data Only detects problems after data is successfully written (“bit rot”) Won’t catch silent corruption caused by issues in the I/O path between disk and host
  40. 40. Data Integrity The ZFS Way • Store data checksum in parent block pointer ptr cksum • Isolates faults between checksum and data ptr • Forms a hash tree, enabling validation of cksum the entire pool • 256-bit checksums • fletcher2 (default, simple and fast) or data data SHA-256 (slower, more secure) • Can be validated at any time with ‘zpool scrub’
  41. 41. Data Integrity App ZFS data data
  42. 42. Data Integrity App ZFS data data data
  43. 43. Data Integrity App ZFS data data data
  44. 44. Data Integrity App data ZFS data data
  45. 45. Data Integrity App data ZFS data data
  46. 46. Data Integrity App data ZFS data data Self-healing mirror!
  47. 47. Goodie Bag • Block Transforms • Snapshots & Clones • Sharing (NFS, CIFS, iSCSI) • Platform-neutral on-disk format
  48. 48. Block Transforms • Handled at SPA layer, transparent to upper layers • Available today: • Compression • zfs set compression=on tank/myfs • LZJB (default) or GZIP • Multi-threaded as of snv_79 • Duplication, a.k.a. “ditto blocks” • zfs set copies=N tank/myfs • In addition to mirroring/RAID-Z: One logical block = up to 3 physical blocks • Metadata always has 2+ copies, even without ditto blocks • Copies stored on different devices, or different places on same device • Future: de-duplication, encryption
  49. 49. Snapshots & Clones • zfs snapshot tank/myfs@thursday • Based on block birth time, stored in block pointer • Nearly instantaneous (<1 sec) on idle system • Communicates structure, since it is based on object changes, not just a block delta • Occupies negligible space initially, and only grows as large as the block changeset • Clone is just a read/write snapshot
  50. 50. Sharing • NFSv4 • zfs set sharenfs=on tank/myfs • Automatically updates /etc/dfs/sharetab • CIFS • zfs set sharesmb=on tank/myfs • Additional properties control the share name and workgroup • Supports full NT ACLs and user mapping, not just POSIX uid • iSCSI • zfs set shareiscsi=on tank/myvol • Makes sharing block devices as easy as sharing filesystems
  51. 51. On-Disk Format • Platform-neutral, adaptive endianness • Writes always use native endianness, recorded in a bit in the block pointer • Reads byteswap if necessary, based on comparison of host endianness to value of block pointer bit • Migrate between x86 and SPARC • No worries about device paths, fstab, mountpoints, it all just works • ‘zpool export’ on old host, move disks, ‘zpool import’ on new host • Also migrate between Solaris and non-Sun implementations, such as MacOS X and FreeBSD
  52. 52. Fin Further reading: ZFS Community: http://opensolaris.org/os/community/zfs ZFS Administration Guide: http://docs.sun.com/app/docs/doc/819-5461 Jeff Bonwick’s blog: http://blogs.sun.com/bonwick/en_US/category/ZFS ZFS-related blog entries: http://blogs.sun.com/main/tags/zfs

×