Your SlideShare is downloading. ×
0
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
OpenZFS data-driven performance
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

OpenZFS data-driven performance

1,864

Published on

OpenZFS data-driven performance presented at the first OpenZFS developer conference 11/18/2013. Lot's of DTrace examples and output.

OpenZFS data-driven performance presented at the first OpenZFS developer conference 11/18/2013. Lot's of DTrace examples and output.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,864
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
65
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data-Driven Development in OpenZFS Adam Leventhal, CTO Delphix @ahl
  • 2. ZFS Was Slow, Is Faster Adam Leventhal, CTO Delphix @ahl
  • 3. My Version of ZFS History • 2001-2005 The 1st age of ZFS: building the behemoth – Stability, reliability, features • 2006-2008 The 2nd age of ZFS: appliance model and open source – Completing the picture; making it work as advertised; still more features • 2008-2010 The 3rd age of ZFS: trial by fire – Stability in the face of real workloads – Performance in the face of real workloads
  • 4. The 1st Age of OpenZFS • All the stuff Matt talked about, yes: – Many platforms – Many companies – Many contributors • Performance analysis on real and varied customer workloads
  • 5. A note about the data • • • • • The data you are about to see is real The names have been changed to protect the innocent (and guilty) It was mostly collected with DTrace We used some other tools as well: lockstat, mpstat You might wish I had more / different data – I do too
  • 6. Writes Are Slow
  • 7. NFS Sync Writes sync write microseconds value ------------- Distribution ------------- count 8| 0 16 | 149 32 |@@@@@@@@@@@@@@@@@@@@@ 64 |@@@@@ 2226 128 |@@@@ 1743 256 |@@ 658 512 | 95 1024 | 20 2048 | 19 4096 | 122 8192 |@@ 744 16384 |@@ 865 32768 |@@ 625 65536 |@ 316 131072 | 113 262144 | 22 524288 | 70 1048576 | 94 2097152 | 16 4194304 | 0 8682
  • 8. IO Writes write microseconds value ------------- Distribution ------------- count 16 | 0 32 | 338 64 | 490 128 | 720 256 |@@@@ 15079 512 |@@@@@ 20342 1024 |@@@@@@@ 27807 2048 |@@@@@@@@ 28897 4096 |@@@@@@@@ 29910 8192 |@@@@@ 20605 16384 |@ 5081 32768 | 1079 65536 | 69 131072 | 5 262144 | 1 524288 | 0
  • 9. NFS Sync Writes: Even Worse sync write microseconds value ------------- Distribution ------------- count 8| 0 16 |@ 9 32 |@@@@@@@@@@ 84 64 |@@@@@@@@@@ 85 128 |@@@@ 34 256 |@ 9 512 | 0 1024 | 1 2048 | 2 4096 |@ 7 8192 |@@ 19 16384 |@ 7 32768 | 2 65536 | 2 131072 | 0 262144 | 0 524288 | 0 1048576 |@@ 14 2097152 |@@@@@@ 51 4194304 |@ 7 8388608 | 0
  • 10. First Problem: The Write Throttle
  • 11. How long is spa_sync() taking? #!/usr/sbin/dtrace -s fbt::spa_sync:entry /stringof(args[0]->spa_name) == "domain0"/ { self->ts = timestamp; loads = 0; } fbt::space_map_load:entry /stringof(args[4]->os_spa->spa_name) == "domain0"/ { loads++; } fbt::spa_sync:return { @["microseconds", loads] = quantize((timestamp - self->ts) / 1000); self->ts = 0; }
  • 12. How long is spa_sync() taking? # ./sync.d -c 'sleep 60' dtrace: script './sync.d' matched 3 probes dtrace: pid 20420 has exited microseconds 15 value ------------- Distribution ------------- count 524288 | 0 1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 2097152 | 0 microseconds 16 value ------------- Distribution ------------- count 524288 | 0 1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2097152 |@@@@@@@@@@ 7 4194304 | 0 20
  • 13. Where is spa_sync() giving up the CPU? #!/usr/sbin/dtrace -s fbt::spa_sync:entry{ self->ts = timestamp; } sched:::off-cpu/self->ts/{ self->off = timestamp; } sched:::on-cpu /self->off/ { @s[stack()] = quantize((timestamp - self->off) / 1000); self->off = 0; } fbt::spa_sync:return /self->ts/ { @t["microseconds", probefunc] = quantize((timestamp - self->ts) / 1000); self->ts = 0; self->sync = 0; }
  • 14. Where is spa_sync() giving up the CPU? … genunix`cv_wait+0x61 zfs`zio_wait+0x5d zfs`dsl_pool_sync+0xe1 zfs`spa_sync+0x38d zfs`txg_sync_thread+0x247 unix`thread_start+0x8 value ------------- Distribution ------------- count 256 | 0 512 |@@@@@@ 4 1024 |@@@@@@@@@@@@ 2048 | 0 4096 | 0 8192 | 0 16384 | 0 32768 | 0 65536 | 0 131072 | 0 262144 | 0 524288 |@@@@ 3 1048576 |@@@ 2 2097152 |@@@@@@@@@@@@@ 4194304 |@ 1 8388608 | 0 8 9
  • 15. ZFS Write Throttle • • • • • Keep transactions to a reasonable size – limit outstanding data Target a fixed time (1-5 seconds on most systems) Figure out how much we can write in that time Don’t accept more than that amount of data in a txg When we get to 7/8ths of the limit, insert a 10ms delay
  • 16. ZFS Write Throttle • • • • • Keep transactions to a reasonable size – limit outstanding data Target a fixed time (1-5 seconds on most systems) Figure out how much we can write in that time Don’t accept more than that amount of data in a txg When we get to 7/8ths of the limit, insert a 10ms delay WTF!?
  • 17. 7/8ths full delaying for 10ms async write microseconds value ------------- Distribution ------------- count 16 | 0 32 |@@@@@@@@@@@@@ 1549 64 |@@@@@@@@@@@ 1306 128 |@@@@@@@@@ 1049 256 |@@ 192 512 | 34 1024 | 23 2048 | 47 4096 |@ 63 8192 |@ 153 16384 |@ 83 32768 | 11 65536 | 5 131072 | 4 262144 | 3 524288 |@ 102 1048576 |@ 106 2097152 |@ 69 4194304 | 0
  • 18. Observing the write throttle limit (second-bysecond) # dtrace -n 'BEGIN{ start = timestamp; } fbt::dsl_pool_sync:entry/stringof(args[0]->dp_spa->spa_name) == "domain0"/{ @[(timestamp - start) / 1000000000] = min(args[0]->dp_write_limit / 1000000); }' -xaggsortkey -c 'sleep 600' dtrace: description 'BEGIN' matched 2 probes … 9 470 10 470 11 487 14 487 15 515 16 515 17 557 18 581 19 581 20 617 21 617 22 635 23 663 24 663 25 673 Saw anywhere from 100 – 800 MB!
  • 19. Second Problem: IO Queuing
  • 20. Check out IO queue times microseconds write sync value ------------- Distribution ------------- count 0| 0 1| 2 2 |@@@@@@@ 51 4 |@@@@@@ 43 8 |@ 5 16 | 3 32 |@ 6 64 |@ 10 128 |@@ 13 256 |@@ 18 512 |@@@@@ 38 1024 |@@@@@@ 44 2048 |@@@@@ 37 4096 |@@@ 24 8192 |@ 9 16384 | 0
  • 21. IO times with queue depth 10 (default) write microseconds value ------------- Distribution ------------- count 16 | 0 32 | 70 64 | 170 128 | 130 256 |@@ 1143 512 |@@@ 1762 1024 |@@@@ 2417 2048 |@@@@@@@ 4135 4096 |@@@@@@@@ 4816 8192 |@@@@@@@ 4132 16384 |@@@@ 2370 32768 |@@@ 1456 65536 | 148 131072 | 8 262144 | 0
  • 22. IO times with queue depth 20 write microseconds value ------------- Distribution ------------- count 16 | 0 32 | 43 64 | 137 128 |@ 243 256 |@@@@@ 2233 512 |@@@@@ 2238 1024 |@@@@ 1968 2048 |@@@@@ 2395 4096 |@@@@@@ 2660 8192 |@@@@@@ 2829 16384 |@@@@@ 2499 32768 |@@@ 1466 65536 |@ 296 131072 | 0
  • 23. IO times with queue depth 30 write microseconds value ------------- Distribution ------------- count 16 | 0 32 | 82 64 | 137 128 | 230 256 |@@@@ 2195 512 |@@@@ 2589 1024 |@@@@ 2416 2048 |@@@@@ 2844 4096 |@@@@@@ 3330 8192 |@@@@@@ 3794 16384 |@@@@@@ 3306 32768 |@@@ 2008 65536 |@ 443 131072 | 1 262144 | 0
  • 24. IO times with queue depth 64 microseconds write value ------------- Distribution ------------- count 16 | 0 32 | 345 64 |@ 697 128 | 169 256 | 60 512 | 380 1024 |@ 1084 2048 |@ 1562 4096 |@ 1819 8192 |@@@@ 4974 16384 |@@@@@@@@@ 32768 |@@@@@@@@@@@@@ 65536 |@@@@@@@@@ 131072 |@ 1050 262144 | 0 write avg latency 44557us 10683 15637 10608 iops throughput 817/s 30300k/s
  • 25. IO times with queue depth 128 microseconds write value ------------- Distribution ------------- count 16 | 0 32 | 330 64 |@ 665 128 | 228 256 | 203 512 |@ 552 1024 |@ 1135 2048 |@ 1458 4096 |@ 1434 8192 |@@ 2049 16384 |@@@@ 4070 32768 |@@@@@@@ 7936 65536 |@@@@@@@@@@@ 11269 131072 |@@@@@@@@@ 9737 262144 |@ 1282 524288 | 0 write avg latency 88774us iops throughput 705/s 38303k/s
  • 26. IO Problems • The choice of IO queue depth was crucial – Where did the default of 10 come from?! – Balance between latency and throughput • Shared IO queue for reads and writes – Maybe this makes sense for disks… maybe… • The wrong queue depth caused massive queuing within ZFS – “What do you mean my SAN is slow? It looks great to me!”
  • 27. New IO Scheduler • • • • Choose a limit on the “dirty” (modified) data on the system As more accumulates, schedule more concurrent IOs Limits per IO type If we still can’t keep up, start to limit the rate of incoming data • Chose defaults as close to the old behavior as possible • Much more straightforward to measure and tune
  • 28. Third Problem: Lock Contention
  • 29. Looking at lockstat(1M) (1/3) Count indv cuml rcnt nsec Lock Caller 167980 9% 9% 0.00 61747 0xffffff0d4aaa4818 taskq_thread+0x2a8 nsec ------ Time Distribution ------ count Stack 512 | 3233 thread_start+0x8 1024 |@ 10651 2048 |@@@@ 26537 4096 |@@@@@@@@@@ 56854 8192 |@@@@@ 29262 16384 |@ 10577 32768 |@ 5703 65536 | 5053 131072 | 3555 262144 | 5272 524288 | 5400 1048576 | 4186 2097152 | 1487 4194304 | 163 8388608 | 17 16777216 | 21 33554432 | 7 67108864 | 2
  • 30. Looking at lockstat(1M) (2/3) Count indv cuml rcnt nsec Lock Caller 166416 8% 17% 0.00 88424 0xffffff0d4aaa4818 cv_wait+0x69 nsec ------ Time Distribution ------ count Stack 512 |@ 7775 taskq_thread_wait+0x84 1024 |@@ 14577 taskq_thread+0x308 2048 |@@@@@ 31499 thread_start+0x8 4096 |@@@@@@ 36522 8192 |@@@ 19818 16384 |@ 11065 32768 |@ 7302 65536 |@ 7932 131072 | 5537 262144 |@ 7992 524288 |@ 8003 1048576 |@ 6017 2097152 | 2086 4194304 | 198 8388608 | 48 16777216 | 37 33554432 | 7 67108864 | 1
  • 31. Looking at lockstat(1M) (3/3) Count indv cuml rcnt nsec Lock Caller 136877 7% 24% 0.00 19897 0xffffff0d4aaa4818 taskq_dispatch_ent+0x4a nsec ------ Time Distribution ------ count Stack 512 | 1798 zio_taskq_dispatch+0xb5 1024 | 1575 zio_issue_async+0x19 2048 |@ 5593 zio_execute+0x8d 4096 |@@@@@@@@@@@@@ 61337 8192 |@@@@ 19408 16384 |@@@ 15724 32768 |@@@ 13923 65536 |@@ 9733 131072 | 3564 262144 | 3171 524288 | 947 1048576 | 84 2097152 | 1 4194304 | 0 8388608 | 15 16777216 | 1 33554432 | 2 67108864 | 1
  • 32. Name that lock! > 0xffffff0d4aaa4818::whatis ffffff0d4aaa4818 is ffffff0d4aaa47fc+20, allocated from taskq_cache > 0xffffff0d4aaa4818-20::taskq ADDR NAME ACT/THDS Q'ED MAXQ INST ffffff0d4aaa47fc zio_write_issue 0/ 24 0 26977 -
  • 33. Lock Breakup • • • • Broke up the taskq lock for write_issue Added multiple taskqs, randomly assigned Recently hit a similar problem for read_interrupt Same solution • Worth investigating taskq stats • A dynamic taskq might be an interesting experiment • Other lock contention issues resolved • Still more need additional attention
  • 34. Last Problem: Spacemap Shenanigans
  • 35. Where does spa_sync() spend its time? … dsl_pool_sync_done 16us ( 0%) spa_config_exit 19us ( 0%) zio_root 20us ( 0%) spa_config_enter 23us ( 0%) spa_errlog_sync 45us ( 0%) spa_update_dspace 49us ( 0%) zio_wait 53us ( 0%) dmu_objset_is_dirty 66us ( 0%) spa_sync_config_object 75us ( 0%) spa_sync_aux_dev 79us ( 0%) list_is_empty 86us ( 0%) dsl_scan_sync 124us ( 0%) ddt_sync 201us ( 0%) txg_list_remove 519us ( 0%) vdev_config_sync 1830us ( 0%) bpobj_iterate 9939us ( 0%) vdev_sync 27907us ( 1%) bplist_iterate 35301us ( 1%) vdev_sync_done 346336us (16%) dsl_pool_sync 1652050us (79%) spa_sync 2077646us (100%)
  • 36. Where does spa_sync() spend its time? … dsl_pool_sync_done 16us ( 0%) spa_config_exit 19us ( 0%) zio_root 20us ( 0%) spa_config_enter 23us ( 0%) spa_errlog_sync 45us ( 0%) spa_update_dspace 49us ( 0%) zio_wait 53us ( 0%) dmu_objset_is_dirty 66us ( 0%) spa_sync_config_object 75us ( 0%) spa_sync_aux_dev 79us ( 0%) list_is_empty 86us ( 0%) dsl_scan_sync 124us ( 0%) ddt_sync 201us ( 0%) txg_list_remove 519us ( 0%) vdev_config_sync 1830us ( 0%) bpobj_iterate 9939us ( 0%) vdev_sync 27907us ( 1%) bplist_iterate 35301us ( 1%) vdev_sync_done 346336us (16%) dsl_pool_sync 1652050us (79%) spa_sync 2077646us (100%) This is expected; it means we’re writing
  • 37. Where does spa_sync() spend its time? … dsl_pool_sync_done 16us ( 0%) spa_config_exit 19us ( 0%) zio_root 20us ( 0%) spa_config_enter 23us ( 0%) spa_errlog_sync 45us ( 0%) spa_update_dspace 49us ( 0%) zio_wait 53us ( 0%) dmu_objset_is_dirty 66us ( 0%) spa_sync_config_object 75us ( 0%) spa_sync_aux_dev 79us ( 0%) list_is_empty 86us ( 0%) dsl_scan_sync 124us ( 0%) ddt_sync 201us ( 0%) txg_list_remove 519us ( 0%) vdev_config_sync 1830us ( 0%) bpobj_iterate 9939us ( 0%) vdev_sync 27907us ( 1%) bplist_iterate 35301us ( 1%) vdev_sync_done 346336us (16%) dsl_pool_sync 1652050us (79%) spa_sync 2077646us (100%) What’s this?
  • 38. What’s vdev_sync_done() doing? txg_list_empty txg_list_remove metaslab_sync_done vdev_sync_done 0us ( 0%) 15us ( 0%) 8681us (90%) 9563us (100%)
  • 39. How about metaslab_sync_done()? vdev_dirty vdev_space_update space_map_load_wait space_map_vacate metaslab_weight metaslab_group_sort space_map_unload metaslab_sync_done 3266us 5333us 5758us 30455us 54507us 68445us 1519906us 1630626us
  • 40. What about all space_map_*() functions? space_map_truncate 33 times 6ms ( 0%) space_map_load_wait 1721 times 7ms ( 0%) space_map_sync 3766 times 210ms ( 0%) space_map_unload 135 times 1268ms ( 0%) space_map_free 21694 times 4280ms ( 1%) space_map_vacate 3643 times 45891ms (12%) space_map_seg_compare 13124822 times 55423ms (14%) space_map_add 580809 times 79868ms (21%) space_map_remove 514181 times 81682ms (21%) space_map_walk 2081 times 120962ms (32%) spa_sync 1 times 374818ms (100%)
  • 41. How about the CPU performance counters? # dtrace -n 'cpc:::PAPI_tlb_dm-all-10000{ @[stack()] = count(); }' -n END'{ trunc(@, 20); printa(@); }' -c 'sleep 100’ … zfs`metaslab_segsize_compare+0x1f genunix`avl_find+0x52 genunix`avl_add+0x2d zfs`space_map_remove+0x170 zfs`space_map_alloc+0x47 zfs`metaslab_group_alloc+0x310 zfs`metaslab_alloc_dva+0x2c1 zfs`metaslab_alloc+0x9c zfs`zio_dva_allocate+0x8a zfs`zio_execute+0x8d genunix`taskq_thread+0x285 unix`thread_start+0x8 1550 zfs`lzjb_decompress+0x89 zfs`zio_decompress_data+0x53 zfs`zio_decompress+0x56 zfs`zio_pop_transforms+0x3d zfs`zio_done+0x26b zfs`zio_execute+0x8d zfs`zio_notify_parent+0xa6 zfs`zio_done+0x4ea zfs`zio_execute+0x8d zfs`zio_notify_parent+0xa6
  • 42. Spacemaps and Metaslabs • Two things going on here: – 30,000+ segments per spacemap – Building the perfect spacemap – close enough would work – Doing a bunch of work that we can clever our way out of • Still much to be done: – Why 200 metaslabs per LUN? – Allocations can still be very painful
  • 43. The Next Age of OpenZFS • General purpose and purpose-built OpenZFS products • Used for varied and demanding uses • Data-driven discoveries – – – – Write throttle needed rethinking Metaslabs / spacemaps / allocation is fertile ground Performance nose-dives around 85% of pool capacity Lock contention impacts high-performance workloads • What’s next? – – – – More workloads; more data! Feedback on recent enhancements Connect allocation / scrub to the new IO scheduler Consider data-driven, adaptive algorithms within OpenZFS

×