System software engineers have long been taught that disks are slow and sequential I/O is key to performance. With SSD drives I/O really got much faster but not simpler. In this brave new world of rocket-speed throughputs an engineer has to distinguish sustained workload from bursts, (still) take care about I/O buffer sizes, account for disks’ internal parallelism and study mixed I/O characteristics in advance. In this talk we will share some key performance measurements of the modern hardware we’re taking at ScyllaDB and our opinion about the implications for the database and system software design.
3. Why HDD is hard to deal with
■ HDD has moving parts inside
● Each IOP is probably a seek
● Seek time can be milliseconds
■ Working with HDD in an efficient way: try not to move the head
● Use sequential IO
● Use larger buffers (batch)
■ DB commitlog was designed with that in mind
4. Why SSD is cool
■ SSD has RAM-like storage inside
● Each IO is can be constant time
■ Working with SSD in an efficient way: just do the IO
● Spoiler: not really
5. Is your disk fast or slow?
■ SSD is usually described by 4 “speeds”
● Throughput in MB/s
● IOPS in Hz (op/s)
● Both for read and write
■ The larger the “speed” numbers are – the better the disk should be
6. Now why my IO sucks?
■ SSD block overwrite problem
■ Internal caching
■ Internal parallelism
■ Bandwidth depends on buffer size
■ Mixed IO
■ Noisy neighbours (in clouds)
8. Internal structure
■ Read/Write is done in pages (e.g. 4k)
■ Erasure is done in blocks (e.g. 128 pages)
■ Overwrite is not possible
■ Disk controller has
● a mapping table to map IO offset to in-disk offset
● relocates pages in the background
9. IO sucks because ...
■ Disk is aged out
● Virtually sequential IO results in physically random one
● Background GC is taking place
10. How to make it suck faster?
■ Sequential IO with large buffers is back from the dead
■ Discard unused blocks
● Filesystems may do it for you
12. More on internal structure
■ Flash cells are prepended with faster cache
● Read-ahead
■ Parallel IO lanes
● Lare (N * page size) IO may be served by several chips in parallel
● Internal indirection may hide it
13. What’s measured in ads
■ Reported numbers can show burst performance
■ Sustained IO may be, and usually is, somewhat slower
14. How to live with it?
■ Get your disk’s sustained performance
16. Throughput vs IOPS
■ IOPS limit is the ability to process requests
● Measured with minimally possible buffers (usually a page-size)
■ Throughput is the ability to process data
● Measured with “large” buffers (~1MB and larger)
17. What if the buffer size is in between?
■ It depends on the disk
■ Some drop down to 70% of both bandwidth and IOPS peaks
18. What’s the optimal IO size?
■ Depends on the application
■ Less IO size – better latency
■ Larger IO size – better throughput, but it really scales
20. Is my WRITE safe?
■ Write can be cached at many levels
● Application
● Linux page cache
● In-disk cache
■ Cache means faster but less reliable writes
21. Is my WRITE safe? (cont.)
■ There are different buzzwords that refer to writing for real
● O_DIRECT – prevent Linux from caching
● O_DSYNC – prevent disk from caching
● FUA – do write the data into energy-independent place
■ Not all disks handle O_DSYNC at the same speed as regular writes
22. How to write the data?
■ Check if the disk is O_DSYNC-friendly
● Most cloud disks are
■ Chose between speed and safety
● It may happen that losing last few seconds of writes is not critical
24. What’s really-really measured in ads
■ Bandwidth and IOPS of a pure IO
● Only read or only write
■ Mixed mode is incredibly worse
● Concurrency matters
● Disks often prefer writes over reads
25. What if I do read and write at the same time?
■ Not much
■ Hold on requests for better latencies