SlideShare a Scribd company logo
1 of 26
Download to read offline
Brought to you by
What We Need to Unlearn
About Persistent Storage
Pavel Emelyanov
Principal Engineer @ ScyllaDB
HDD vs SSD
Why HDD is hard to deal with
■ HDD has moving parts inside
● Each IOP is probably a seek
● Seek time can be milliseconds
■ Working with HDD in an efficient way: try not to move the head
● Use sequential IO
● Use larger buffers (batch)
■ DB commitlog was designed with that in mind
Why SSD is cool
■ SSD has RAM-like storage inside
● Each IO is can be constant time
■ Working with SSD in an efficient way: just do the IO
● Spoiler: not really
Is your disk fast or slow?
■ SSD is usually described by 4 “speeds”
● Throughput in MB/s
● IOPS in Hz (op/s)
● Both for read and write
■ The larger the “speed” numbers are – the better the disk should be
Now why my IO sucks?
■ SSD block overwrite problem
■ Internal caching
■ Internal parallelism
■ Bandwidth depends on buffer size
■ Mixed IO
■ Noisy neighbours (in clouds)
Overwriting
Internal structure
■ Read/Write is done in pages (e.g. 4k)
■ Erasure is done in blocks (e.g. 128 pages)
■ Overwrite is not possible
■ Disk controller has
● a mapping table to map IO offset to in-disk offset
● relocates pages in the background
IO sucks because ...
■ Disk is aged out
● Virtually sequential IO results in physically random one
● Background GC is taking place
How to make it suck faster?
■ Sequential IO with large buffers is back from the dead
■ Discard unused blocks
● Filesystems may do it for you
Burst vs Sustain
More on internal structure
■ Flash cells are prepended with faster cache
● Read-ahead
■ Parallel IO lanes
● Lare (N * page size) IO may be served by several chips in parallel
● Internal indirection may hide it
What’s measured in ads
■ Reported numbers can show burst performance
■ Sustained IO may be, and usually is, somewhat slower
How to live with it?
■ Get your disk’s sustained performance
IO size matters
Throughput vs IOPS
■ IOPS limit is the ability to process requests
● Measured with minimally possible buffers (usually a page-size)
■ Throughput is the ability to process data
● Measured with “large” buffers (~1MB and larger)
What if the buffer size is in between?
■ It depends on the disk
■ Some drop down to 70% of both bandwidth and IOPS peaks
What’s the optimal IO size?
■ Depends on the application
■ Less IO size – better latency
■ Larger IO size – better throughput, but it really scales
Write for real
Is my WRITE safe?
■ Write can be cached at many levels
● Application
● Linux page cache
● In-disk cache
■ Cache means faster but less reliable writes
Is my WRITE safe? (cont.)
■ There are different buzzwords that refer to writing for real
● O_DIRECT – prevent Linux from caching
● O_DSYNC – prevent disk from caching
● FUA – do write the data into energy-independent place
■ Not all disks handle O_DSYNC at the same speed as regular writes
How to write the data?
■ Check if the disk is O_DSYNC-friendly
● Most cloud disks are
■ Chose between speed and safety
● It may happen that losing last few seconds of writes is not critical
Read && Write
What’s really-really measured in ads
■ Bandwidth and IOPS of a pure IO
● Only read or only write
■ Mixed mode is incredibly worse
● Concurrency matters
● Disks often prefer writes over reads
What if I do read and write at the same time?
■ Not much
■ Hold on requests for better latencies
Brought to you by
Pavel Emelyanov
xemul@scylladb.com

More Related Content

What's hot

Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
ShapeBlue
 
Live migration: pros, cons and gotchas -- Pavel Emelyanov
Live migration: pros, cons and gotchas -- Pavel EmelyanovLive migration: pros, cons and gotchas -- Pavel Emelyanov
Live migration: pros, cons and gotchas -- Pavel Emelyanov
OpenVZ
 

What's hot (19)

Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
 
Whoops! I Rewrote It in Rust
Whoops! I Rewrote It in RustWhoops! I Rewrote It in Rust
Whoops! I Rewrote It in Rust
 
G1: To Infinity and Beyond
G1: To Infinity and BeyondG1: To Infinity and Beyond
G1: To Infinity and Beyond
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Sharding: Past, Present and Future with Krutika Dhananjay
Sharding: Past, Present and Future with Krutika DhananjaySharding: Past, Present and Future with Krutika Dhananjay
Sharding: Past, Present and Future with Krutika Dhananjay
 
Rust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency SystemsRust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency Systems
 
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
 
Scaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/DayScaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/Day
 
Where Did All These Cycles Go?
Where Did All These Cycles Go?Where Did All These Cycles Go?
Where Did All These Cycles Go?
 
Avoiding Data Hotspots at Scale
Avoiding Data Hotspots at ScaleAvoiding Data Hotspots at Scale
Avoiding Data Hotspots at Scale
 
Rust Primer
Rust PrimerRust Primer
Rust Primer
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
Integration of Glusterfs in to commvault simpana
Integration of Glusterfs in to commvault simpanaIntegration of Glusterfs in to commvault simpana
Integration of Glusterfs in to commvault simpana
 
Cassandra To Infinity And Beyond
Cassandra To Infinity And BeyondCassandra To Infinity And Beyond
Cassandra To Infinity And Beyond
 
[POSS 2019] OVirt and Ceph: Perfect Combination.?
[POSS 2019] OVirt and  Ceph: Perfect Combination.?[POSS 2019] OVirt and  Ceph: Perfect Combination.?
[POSS 2019] OVirt and Ceph: Perfect Combination.?
 
Using Ceph in OStack.de - Ceph Day Frankfurt
Using Ceph in OStack.de - Ceph Day Frankfurt Using Ceph in OStack.de - Ceph Day Frankfurt
Using Ceph in OStack.de - Ceph Day Frankfurt
 
Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
Live migration: pros, cons and gotchas -- Pavel Emelyanov
Live migration: pros, cons and gotchas -- Pavel EmelyanovLive migration: pros, cons and gotchas -- Pavel Emelyanov
Live migration: pros, cons and gotchas -- Pavel Emelyanov
 

Similar to P99CONF — What We Need to Unlearn About Persistent Storage

How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
JAXLondon2014
 
P1 – typical computer components
P1 – typical computer componentsP1 – typical computer components
P1 – typical computer components
Drew7Williams
 
[G2]fa ce deview_2012
[G2]fa ce deview_2012[G2]fa ce deview_2012
[G2]fa ce deview_2012
NAVER D2
 

Similar to P99CONF — What We Need to Unlearn About Persistent Storage (20)

SSD-Bondi.pptx
SSD-Bondi.pptxSSD-Bondi.pptx
SSD-Bondi.pptx
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Five steps perform_2009 (1)
Five steps perform_2009 (1)Five steps perform_2009 (1)
Five steps perform_2009 (1)
 
P1 – typical computer components
P1 – typical computer componentsP1 – typical computer components
P1 – typical computer components
 
The life and times
The life and timesThe life and times
The life and times
 
Solid State Drive (SSD)
Solid State Drive (SSD)Solid State Drive (SSD)
Solid State Drive (SSD)
 
Solid State Drive (SSD)
Solid State Drive (SSD)Solid State Drive (SSD)
Solid State Drive (SSD)
 
SSD - Solid State Drive PPT by Atishay Jain
SSD - Solid State Drive PPT by Atishay JainSSD - Solid State Drive PPT by Atishay Jain
SSD - Solid State Drive PPT by Atishay Jain
 
[G2]fa ce deview_2012
[G2]fa ce deview_2012[G2]fa ce deview_2012
[G2]fa ce deview_2012
 
Design Tradeoffs for SSD Performance
Design Tradeoffs for SSD PerformanceDesign Tradeoffs for SSD Performance
Design Tradeoffs for SSD Performance
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
SSD для вашей базы данных, Петр Зайцев (Percona)
SSD для вашей базы данных, Петр Зайцев (Percona)SSD для вашей базы данных, Петр Зайцев (Percona)
SSD для вашей базы данных, Петр Зайцев (Percona)
 
How to optimize your windows computer
How to optimize your windows computerHow to optimize your windows computer
How to optimize your windows computer
 
Distro Recipes 2013 : My ${favorite_linux_distro} is slow!
Distro Recipes 2013 : My ${favorite_linux_distro} is slow!Distro Recipes 2013 : My ${favorite_linux_distro} is slow!
Distro Recipes 2013 : My ${favorite_linux_distro} is slow!
 
LECTURE13nvjlfdihbkzbjvzbfmdnmzbxckbn.ppt
LECTURE13nvjlfdihbkzbjvzbfmdnmzbxckbn.pptLECTURE13nvjlfdihbkzbjvzbfmdnmzbxckbn.ppt
LECTURE13nvjlfdihbkzbjvzbfmdnmzbxckbn.ppt
 
Solid state drives
Solid state drivesSolid state drives
Solid state drives
 

More from ScyllaDB

More from ScyllaDB (20)

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 

Recently uploaded

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 

P99CONF — What We Need to Unlearn About Persistent Storage

  • 1. Brought to you by What We Need to Unlearn About Persistent Storage Pavel Emelyanov Principal Engineer @ ScyllaDB
  • 3. Why HDD is hard to deal with ■ HDD has moving parts inside ● Each IOP is probably a seek ● Seek time can be milliseconds ■ Working with HDD in an efficient way: try not to move the head ● Use sequential IO ● Use larger buffers (batch) ■ DB commitlog was designed with that in mind
  • 4. Why SSD is cool ■ SSD has RAM-like storage inside ● Each IO is can be constant time ■ Working with SSD in an efficient way: just do the IO ● Spoiler: not really
  • 5. Is your disk fast or slow? ■ SSD is usually described by 4 “speeds” ● Throughput in MB/s ● IOPS in Hz (op/s) ● Both for read and write ■ The larger the “speed” numbers are – the better the disk should be
  • 6. Now why my IO sucks? ■ SSD block overwrite problem ■ Internal caching ■ Internal parallelism ■ Bandwidth depends on buffer size ■ Mixed IO ■ Noisy neighbours (in clouds)
  • 8. Internal structure ■ Read/Write is done in pages (e.g. 4k) ■ Erasure is done in blocks (e.g. 128 pages) ■ Overwrite is not possible ■ Disk controller has ● a mapping table to map IO offset to in-disk offset ● relocates pages in the background
  • 9. IO sucks because ... ■ Disk is aged out ● Virtually sequential IO results in physically random one ● Background GC is taking place
  • 10. How to make it suck faster? ■ Sequential IO with large buffers is back from the dead ■ Discard unused blocks ● Filesystems may do it for you
  • 12. More on internal structure ■ Flash cells are prepended with faster cache ● Read-ahead ■ Parallel IO lanes ● Lare (N * page size) IO may be served by several chips in parallel ● Internal indirection may hide it
  • 13. What’s measured in ads ■ Reported numbers can show burst performance ■ Sustained IO may be, and usually is, somewhat slower
  • 14. How to live with it? ■ Get your disk’s sustained performance
  • 16. Throughput vs IOPS ■ IOPS limit is the ability to process requests ● Measured with minimally possible buffers (usually a page-size) ■ Throughput is the ability to process data ● Measured with “large” buffers (~1MB and larger)
  • 17. What if the buffer size is in between? ■ It depends on the disk ■ Some drop down to 70% of both bandwidth and IOPS peaks
  • 18. What’s the optimal IO size? ■ Depends on the application ■ Less IO size – better latency ■ Larger IO size – better throughput, but it really scales
  • 20. Is my WRITE safe? ■ Write can be cached at many levels ● Application ● Linux page cache ● In-disk cache ■ Cache means faster but less reliable writes
  • 21. Is my WRITE safe? (cont.) ■ There are different buzzwords that refer to writing for real ● O_DIRECT – prevent Linux from caching ● O_DSYNC – prevent disk from caching ● FUA – do write the data into energy-independent place ■ Not all disks handle O_DSYNC at the same speed as regular writes
  • 22. How to write the data? ■ Check if the disk is O_DSYNC-friendly ● Most cloud disks are ■ Chose between speed and safety ● It may happen that losing last few seconds of writes is not critical
  • 24. What’s really-really measured in ads ■ Bandwidth and IOPS of a pure IO ● Only read or only write ■ Mixed mode is incredibly worse ● Concurrency matters ● Disks often prefer writes over reads
  • 25. What if I do read and write at the same time? ■ Not much ■ Hold on requests for better latencies
  • 26. Brought to you by Pavel Emelyanov xemul@scylladb.com