Successfully reported this slideshow.
Your SlideShare is downloading. ×

P99CONF — What We Need to Unlearn About Persistent Storage

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 26 Ad

P99CONF — What We Need to Unlearn About Persistent Storage

Download to read offline

System software engineers have long been taught that disks are slow and sequential I/O is key to performance. With SSD drives I/O really got much faster but not simpler. In this brave new world of rocket-speed throughputs an engineer has to distinguish sustained workload from bursts, (still) take care about I/O buffer sizes, account for disks’ internal parallelism and study mixed I/O characteristics in advance. In this talk we will share some key performance measurements of the modern hardware we’re taking at ScyllaDB and our opinion about the implications for the database and system software design.

System software engineers have long been taught that disks are slow and sequential I/O is key to performance. With SSD drives I/O really got much faster but not simpler. In this brave new world of rocket-speed throughputs an engineer has to distinguish sustained workload from bursts, (still) take care about I/O buffer sizes, account for disks’ internal parallelism and study mixed I/O characteristics in advance. In this talk we will share some key performance measurements of the modern hardware we’re taking at ScyllaDB and our opinion about the implications for the database and system software design.

Advertisement
Advertisement

More Related Content

Slideshows for you (19)

Similar to P99CONF — What We Need to Unlearn About Persistent Storage (20)

Advertisement

More from ScyllaDB (20)

Recently uploaded (20)

Advertisement

P99CONF — What We Need to Unlearn About Persistent Storage

  1. 1. Brought to you by What We Need to Unlearn About Persistent Storage Pavel Emelyanov Principal Engineer @ ScyllaDB
  2. 2. HDD vs SSD
  3. 3. Why HDD is hard to deal with ■ HDD has moving parts inside ● Each IOP is probably a seek ● Seek time can be milliseconds ■ Working with HDD in an efficient way: try not to move the head ● Use sequential IO ● Use larger buffers (batch) ■ DB commitlog was designed with that in mind
  4. 4. Why SSD is cool ■ SSD has RAM-like storage inside ● Each IO is can be constant time ■ Working with SSD in an efficient way: just do the IO ● Spoiler: not really
  5. 5. Is your disk fast or slow? ■ SSD is usually described by 4 “speeds” ● Throughput in MB/s ● IOPS in Hz (op/s) ● Both for read and write ■ The larger the “speed” numbers are – the better the disk should be
  6. 6. Now why my IO sucks? ■ SSD block overwrite problem ■ Internal caching ■ Internal parallelism ■ Bandwidth depends on buffer size ■ Mixed IO ■ Noisy neighbours (in clouds)
  7. 7. Overwriting
  8. 8. Internal structure ■ Read/Write is done in pages (e.g. 4k) ■ Erasure is done in blocks (e.g. 128 pages) ■ Overwrite is not possible ■ Disk controller has ● a mapping table to map IO offset to in-disk offset ● relocates pages in the background
  9. 9. IO sucks because ... ■ Disk is aged out ● Virtually sequential IO results in physically random one ● Background GC is taking place
  10. 10. How to make it suck faster? ■ Sequential IO with large buffers is back from the dead ■ Discard unused blocks ● Filesystems may do it for you
  11. 11. Burst vs Sustain
  12. 12. More on internal structure ■ Flash cells are prepended with faster cache ● Read-ahead ■ Parallel IO lanes ● Lare (N * page size) IO may be served by several chips in parallel ● Internal indirection may hide it
  13. 13. What’s measured in ads ■ Reported numbers can show burst performance ■ Sustained IO may be, and usually is, somewhat slower
  14. 14. How to live with it? ■ Get your disk’s sustained performance
  15. 15. IO size matters
  16. 16. Throughput vs IOPS ■ IOPS limit is the ability to process requests ● Measured with minimally possible buffers (usually a page-size) ■ Throughput is the ability to process data ● Measured with “large” buffers (~1MB and larger)
  17. 17. What if the buffer size is in between? ■ It depends on the disk ■ Some drop down to 70% of both bandwidth and IOPS peaks
  18. 18. What’s the optimal IO size? ■ Depends on the application ■ Less IO size – better latency ■ Larger IO size – better throughput, but it really scales
  19. 19. Write for real
  20. 20. Is my WRITE safe? ■ Write can be cached at many levels ● Application ● Linux page cache ● In-disk cache ■ Cache means faster but less reliable writes
  21. 21. Is my WRITE safe? (cont.) ■ There are different buzzwords that refer to writing for real ● O_DIRECT – prevent Linux from caching ● O_DSYNC – prevent disk from caching ● FUA – do write the data into energy-independent place ■ Not all disks handle O_DSYNC at the same speed as regular writes
  22. 22. How to write the data? ■ Check if the disk is O_DSYNC-friendly ● Most cloud disks are ■ Chose between speed and safety ● It may happen that losing last few seconds of writes is not critical
  23. 23. Read && Write
  24. 24. What’s really-really measured in ads ■ Bandwidth and IOPS of a pure IO ● Only read or only write ■ Mixed mode is incredibly worse ● Concurrency matters ● Disks often prefer writes over reads
  25. 25. What if I do read and write at the same time? ■ Not much ■ Hold on requests for better latencies
  26. 26. Brought to you by Pavel Emelyanov xemul@scylladb.com

×