ZFS for Databases

  • 2,418 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,418
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
61
Comments
0
Likes
8

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Delphix Agile Data Platform ZFS for Databases Adam Leventhal CTO, Delphix @ahl
  • 2. Definition 1: ZFS Storage Appliance (ZSA) • Shipped by Sun in 2008 • Originally the Sun Storage 7000 2
  • 3. Definition 2: Filesystem for Solaris • Filesystem developed in the Solaris Kernel Group • First shipped in 2006 as part of Solaris 10 u2 • The engine for the ZSA • Always consistent on disk (no fsck) • End-to-end (strong) checksumming • Snapshots are cheap to create; no practical limit • Built-in replication • Custom RAID (RAID-Z) 3
  • 4. Definition 3: OpenZFS • Sun open sourced ZFS in 2006 • Oracle closed it in 2010 • OpenZFS has continued • Many of the same developers – Many left Oracle for companies innovating around OpenZFS • Expanded beyond Solaris – Active OpenZFS ports on Linux, FreeBSD, Mac OS X • Significant evolution – Many critical bugs fixed – Test framework, CLI improvements, progress report and resumability for replication, lz4, simpler API, etc. – Big emphasis on data driven performance enhancements 4
  • 5. This Talk • First, which ZFS? The filesystem one. – Most will apply to both Oracle Solaris ZFS and OpenZFS • Benefits of ZFS • Practical considerations: storage pool and dataset layout • One highly relevant area of performance analysis 5
  • 6. Who am I? • Joined the Solaris Kernel Group in 2001 • One of the three developers of DTrace • Added double- and triple-parity RAID-Z to ZFS • Founding member of the ZSA team (Fishworks) in 2006 • Joined Delphix in 2010 – – – – Founded in 2008 using ZFS as a component Virtualize the database Database copies become as cheap and flexible as VMs Agile data for faster projects, more efficient devs, and happier DBAs – Now the leader in ZFS expertise – Founded the OpenZFS project – Also: UKOUG TECH13 sponsor; check out our booth; drinks 6
  • 7. Why ZFS for Databases? • Modern – in development for over 12 years • Stable – in production for over 7 years • Strong data integrity • No practical limit on snapshots or clones • Not all good news: – Random writes turn into sequential writes – Sequential reads turn into random reads – (Like NetApp/WAFL) 7
  • 8. RAID-Z • Traditional RAID-5/6/7 requires NV-RAM to perform • RAID-Z always writes full, variable-width stripes • Particularly good for cheap disks Oracle Solaris ZFS implements an improvement on RAID-5, RAID-Z3, which uses parity, striping, and atomic operations to ensure reconstruction of corrupted data even in the face of three concurrent drive failures. It is ideally suited for managing industry standard storage servers.* • Not strictly better – Individual records are split between disks – RAID-5/6/7 -- a random read translates to a single disk read – RAID-Z – a random read becomes many disk ops (like RAID-3) *www.oracle.com/us/products/servers-storage/solaris/solaris-zfs-ds-067320.pdf 8
  • 9. Datasets for Oracle • Filesystems (datasets) cheap/easy to create in ZFS • Key settings – recordsize – atomic unit in ZFS; match Oracle block size (8K) – logbias={latency,throughput} – QoS hint – primarycache={none,metadata,all} – caching hint # zfs create -o recordsize=8k -o logbias=throughput pool/datafiles # zfs create -o recordsize=8k -o logbias=throughput pool/temp # zfs create –o primarycache=metadata pool/archive # zfs create pool/redo # zfs list -o name,recordsize,logbias,primarycache NAME RECSIZE LOGBIAS PRIMARYCACHE ... pool/archive 128K latency metadata pool/datafiles 8K throughput all pool/redo 128K latency all pool/temp 8K throughput all 9
  • 10. Inconsistent Write Latency microseconds ------------- Distribution ------------- count 8| 0 16 | 149 32 |@@@@@@@@@@@@@@@@@@@@@ 64 |@@@@@ 2226 128 |@@@@ 1743 256 |@@ 658 512 | 95 1024 | 20 2048 | 19 4096 | 122 8192 |@@ 744 16384 |@@ 865 32768 |@@ 625 65536 |@ 316 131072 | 113 262144 | 22 524288 | 70 1048576 | 94 2097152 | 16 4194304 | 0 8682 10
  • 11. Oracle Solaris ZFS Write Throttle • Basic problem: limit rate of input to rate of output • Originally no write throttle: consume all memory, then wait • ZFS composes transactions into transaction groups • Idea: limit the size of a transaction group • Figure out the backend throughput; target a few seconds 11
  • 12. ZFS Write Throttle Problems • Transaction group full? Start writing it out • One already being written out? Wait • And it can be a looooong wait • Solution? – When the transaction group is 7/8ths full, delay for 10ms – Didn’t guess that did you? 12
  • 13. Let’s Look Again microseconds ------------- Distribution ------------- count 8| 0 16 | 149 32 |@@@@@@@@@@@@@@@@@@@@@ 64 |@@@@@ 2226 128 |@@@@ 1743 256 |@@ 658 512 | 95 1024 | 20 2048 | 19 4096 | 122 8192 |@@ 744 16384 |@@ 865 32768 |@@ 625 65536 |@ 316 131072 | 113 262144 | 22 524288 | 70 1048576 | 94 2097152 | 16 4194304 | 0 8682 13
  • 14. Write Amplification microseconds NFS write IO writes value ------------------------- count ---------------- count 16 | 0 | 0 32 | 56 | 259 64 | 118 |@ 631 128 | 47 |@ 1024 256 | 13 |@@@@@@ 5747 512 | 16 |@@@@@@ 5421 1024 |@@@@@@@@@@ 4172 |@@@@ 4113 2048 |@@@@@@@@@@@@@@@@@@@@@@@ 9835 |@@@@@ 4096 |@ 425 |@@@@@ 4528 8192 | 121 |@@@@@ 4311 16384 | 198 |@@@@ 3334 32768 |@@@ 1158 |@@ 1885 65536 |@@ 957 |@ 528 131072 | 110 | 28 262144 | 31 | 0 524288 | 25 1048576 | 0 NFS write IO write 4890 avg latency iops 13231us 292/s 8559us 622/s 14
  • 15. Oracle Solaris ZFS Tuning • IO queue depth zfs_vdev_max_pending – – – – Default of 10 – may be reasonable for spinning disks ZFS on a SAN? 24 - 100 Higher for additional throughput Lower for reduced latency • Transaction group duration zfs_txg_synctime – Default of 5 seconds – Higher for more metadata amortization – Lower for a smaller window for data loss with non-synced writes 15
  • 16. Back to the ZFS Write Throttle • Measure of IO throughput swings wildly: # dtrace -n 'BEGIN{ start = timestamp; } fbt::dsl_pool_sync:entry/stringof(args[0]->dp_spa->spa_name) == "domain0"/{ @[(timestamp - start) / 1000000000] = min(args[0]->dp_write_limit / 1000000); }' –xaggsortkey dtrace: description 'BEGIN' matched 2 probes … 14 487 15 515 16 515 17 557 18 581 19 581 20 617 21 617 22 635 23 663 24 663 … • Many factors impact the measured IO throughput • The wrong guess can lead to massive delays 16
  • 17. OpenZFS I/O Scheduler • Throw out the ZFS write throttle and IO queue • Queue depth and throttle based on quantity of modified data 20 18 16 14 12 10 Queue Depth 8 Delay 6 4 2 0 0 10 20 30 40 50 60 70 80 90 100 • Result: smooth, single-moded write latency 17
  • 18. OpenZFS I/O Scheduler Tuning • Tunables that area easier to reason about – – – – zfs_vdev_async_write_max_active (default: 10) zfs_dirty_data_max (default: min(memory/10, 4GB)) zfs_delay_max_ns (default: 100µs) zfs_delay_scale (delay curve; default: 500µs/op) 18
  • 19. Summing Up • ZFS is great for databases – Storage Appliance, Oracle Solaris, OpenZFS • Important best practices • Beware the false RAID-Z idol • Measure, measure, measure – DTrace is your friend (Wednesday 11:00am Exchange 1) 19
  • 20. Further Reading • Oracle Solaris ZFS “Evil” Tuning Guide – www.solaris-cookbook.com/solaris/solaris-10-zfs-evil-tuningguide/ • OpenZFS – www.open-zfs.org • Oracle’s tuning guide – docs.oracle.com/cd/E26505_01/html/E37386/chapterzfs-db1.html 20