SlideShare a Scribd company logo
1 of 55
Design Tradeoffs for SSD Performance Ted Wobber Principal Researcher Microsoft Research, Silicon Valley
Rotating Disks vs. SSDs We have a good model ofhow rotating disks work… what about SSDs?
Rotating Disks vs. SSDsMain take-aways Forget everything you knew about rotating disks. SSDs are different SSDs are complex software systems One size doesn’t fit all
A Brief Introduction Microsoft Research – a focus on ideas and understanding
Will SSDs Fix All Our Storage Problems? Excellent read latency; sequential bandwidth Lower $/IOPS/GB Improved power consumption No moving parts Form factor, noise, … Performance surprises?
Performance/Surprises  Latency/bandwidth “How fast can I read or write?” Surprise:  Random writes can be slow Persistence “How soon must I replace this device?” Surprise:  Flash blocks wear out
What’s in This Talk Introduction Background on NAND flash, SSDs Points of comparison with rotating disks Write-in-place vs. write-logging Moving parts vs. parallelism Failure modes Conclusion
What’s *NOT* in This Talk Windows Analysis of specific SSDs Cost Power savings
Full Disclosure “Black box” study based on the properties of NAND flash A trace-based simulation of an “idealized” SSD Workloads TPC-C Exchange Postmark IOzone
BackgroundNAND flash blocks A flash block is a grid of cells 1 1 0 1 0 0 1 1 1 1 1 1 Erase:  Quantum release for all cells Program:  Quantuminjection for some cells Read:  NAND operationwith a page selected 4096 + 128 bit-lines 64 pagelines Can’t reset bits to 1 except with erase
Background4GB flash package (SLC) Serial out Register Reg Reg Reg Reg Reg Reg Plane Plane 3 Plane 3 Plane 0 Plane 1 Plane 2 Plane 0 Plane 1 Plane 2 Reg Reg Block ’09? 20μs Die 1 Die 0 MLC (multiple bits in cell): slower, less durable
BackgroundSSD Structure Flash Translation Layer (Proprietary firmware) Simplified block diagram of an SSD
Write-in-place vs. Logging(What latency can I expect?)
Write-in-Place vs. Logging Rotating disks Constant map fromLBA to on-disk location SSDs Writes always to new locations Superseded blocks cleaned later
Log-based WritesMap granularity = 1 block Flash Block LBA to Block Map P P P0 P1 Write order Block(P) Pages are moved – read-modify-write,(in foreground): Write Amplification
Log-based WritesMap granularity = 1 page LBA to Block Map P Q P P0 Q0 P1 Page(P) Page(Q) Blocks must be cleaned(in background): Write Amplification
Log-based WritesSimple simulation result Map granularity = flash block (256KB) TPC average I/O latency = 20 ms Map granularity = flash page (4KB) TPC-C average I/O latency = 0.2 ms
Log-based WritesBlock cleaning LBA to Page Map P Q R Q P R R0 P0 Q0 P0 R0 Q0 Page(P) Page(Q) Page(R) Move valid pages so block can be erased Cleaning efficiency:  Choose blocks to minimize page movement
Over-provisioningPutting off the work Keep extra (unadvertised) blocks Reduces “pressure” for cleaning Improves foreground latency Reduces write-amplification due to cleaning
Delete NotificationAvoiding the work SSD doesn’t know what LBAs are in use Logical disk is always full! If SSD can know what pages are unused, these can treated as “superseded” Better cleaning efficiency De-facto over-provisioning “Trim” API: An important step forward
Delete NotificationCleaning Efficiency Postmark trace One-third pages moved Cleaning efficiency improved by factor of 3 Block lifetime improved
LBA Map Tradeoffs Large granularity Simple; small map size Low overhead for sequential write workload Foreground write amplification (R-M-W) Fine granularity Complex; large map size Can tolerate random write workload  Background write amplification (cleaning)
Write-in-place vs. LoggingSummary Rotating disks Constant map fromLBA to on-disk location SSDs Dynamic LBA map Various possible strategies Best strategy deeply workload-dependent
Moving Parts vs. Parallelism(How many IOPS can I get?)
Moving Parts vs. Parallelism Rotating disks Minimize seek time andimpact of rotational delay SSDs Maximize number ofoperations in flight Keep chip interconnect manageable
Improving IOPSStrategies Request-queue sort by sector address Defragmentation Application-level block ordering Defragmentation for cleaning efficiencyis unproven:  next write might re-fragment One request at a time per disk head Null seek time
Flash Chip Bandwidth Serial interface is performance bottleneck Reads constrained by serial bus 25ns/byte = 40 MB/s (not so great) Reg Reg Reg Reg Reg Reg 8-bit serial bus Reg Reg Die 1 Die 0
SSD ParallelismStrategies Striping Multiple “channels” to host Background cleaning Operation interleaving Ganging of flash chips
Striping LBAs striped across flash packages Single request can span multiple chips Natural load balancing What’s the right stripe size? Controller  7 15 23 31 39 47  6 14 22 30 38 46  3 11 19 27 35 43  5 13 21 29 37 45  2 10 18 26 34 42  4 12 20 28 36 44 1 9 17 25 33 41 0 8 16 24 32 40
Operations in Parallel SSDs are akin to RAID controllers Multiple onboard parallel elements Multiple request streams are needed to achieve maximal bandwidth Cleaning on inactive flash elements Non-trivial scheduling issues Much like “Log-Structured File System”, but at a lower level of the storage stack
Interleaving Concurrent ops on a package or die E.g., register-to-flash “program” on die 0 concurrent with serial line transfer on die 1 25% extra throughput on reads, 100% on writes Erase is slow, can be concurrent with other ops Reg Reg Reg Reg Reg Reg Reg Reg Die 1 Die 0
InterleavingSimulation TPC-C and Exchange  No queuing, no benefit IOzone and Postmark Sequential I/O component results in queuing Increased throughput
Intra-plane Copy-back Block-to-block transfer internal to chip But only within the same plane! Cleaning on-chip! Optimizing for this can hurt load balance Conflicts with striping But data needn’t crossserial I/O pins Reg Reg Reg Reg
Cleaning with Copy-backSimulation Copy-back operation for intra-plane transfer TPC-C shows 40% improvement in cleaning costs No benefit for IOzone and Postmark Perfect cleaning efficiency
Ganging Optimally, all flash chips are independent In practice, too many wires! Flash packages can share a control bus with or/without separate data channels Operations in lock-step or coordinated Shared-control gang Shared-bus gang
Shared-bus GangSimulation Scaling capacity without scaling pin-density Workload (Exchange) requires 900 IOPS 16-gang fast enough
Parallelism Tradeoffs No one scheme optimal for all workloads With faster serial connect, intra-chip ops are less important
Moving Parts vs. ParallelismSummary Rotating disks Seek, rotational optimization Built-in assumptions everywhere SSDs Operations in parallel are key Lots of opportunities forparallelism, but with tradeoffs
Failure Modes(When will it wear out?)
Failure ModesRotating disks Media imperfections, loose particles, vibration Latent sector errors [Bairavasundaram 07] E.g., with uncorrectable ECC Frequency of affected disks increases linearly with time Most affected disks (80%) have < 50 errors Temporal and spatial locality Correlation with recovered errors Disk scrubbing helps
Failure ModesSSDs Types of NAND flash errors (mostly when erases > wear limit) Write errors:  Probability varies with # of erasures Read disturb:  Increases with # of reads Data retention errors:  Charge leaks over time Little spatial or temporal locality(within equally worn blocks) Better ECC can help Errors increase with wear:  Need wear-leveling
Wear-levelingMotivation Example: 25% over-provisioning to enhance foreground performance
Wear-levelingMotivation Premature worn blocks = reduced over-provisioning = poorer performance
Wear-levelingMotivation Over-provisioning budget consumed : writes no longer possible! Must ensure even wear
Wear-levelingModified "greedy" algorithm Expiry Meter for block A Cold content Block B Block A Q R P Q R Q0 R0 P0 Q0 R0 If Remaining(A) < Throttle-Threshold, reduce probability of cleaning A If Remaining(A) < Migrate-Threshold, clean A, but migrate cold data into A If Remaining(A) >= Migrate-Threshold, clean A
Wear-leveling Results Fewer blocks reach expiry with rate-limiting Smaller standard deviation of remaining lifetimes with cold-content migration Cost to migrating cold pages (~5% avg. latency) Block wear in IOzone
Failure ModesSummary Rotating disks Reduce media tolerances Scrubbing to deal with latentsector errors SSDs Better ECC Wear-leveling is critical Greater density  more errors?
Rotating Disks vs. SSDs ≠ Don’t think of an SSD as just a faster rotating disk Complex firmware/hardware system with substantial tradeoffs
SSD Design Tradeoffs Write amplification more wear
Call To Action Users need help in rationalizing workload-sensitive SSD performance Operation latency Bandwidth Persistence One size doesn’t fit all… manufacturers should help users determine the right fit Open the “black box” a bit Need software-visible metrics
Thanks for your attention!
Additional Resources USENIX paper:http://research.microsoft.com/users/vijayanp/papers/ssd-usenix08.pdf SSD Simulator download:http://research.microsoft.com/downloads Related Sessions ENT-C628:  Solid State Storage in Server and Data Center Environments (2pm, 11/5)
Please Complete A Session Evaluation FormYour input is important! Visit the WinHECCommNet and complete a Session Evaluation for this session and be entered to win one of 150 Maxtor®BlackArmor™ 160GB External Hard Drives50 drives will be given away daily! http://www.winhec2008.com BlackArmorHard Drives provided by:
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

More Related Content

What's hot

Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailInternet World
 
Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Johnny Miller
 
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Community
 
Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State DrivesRick Branson
 
Sql saturday powerpoint dc_san
Sql saturday powerpoint dc_sanSql saturday powerpoint dc_san
Sql saturday powerpoint dc_sanJoseph D'Antoni
 
The have no fear guide to virtualizing databases
The have no fear guide to virtualizing databasesThe have no fear guide to virtualizing databases
The have no fear guide to virtualizing databasesSolarWinds
 
10 things, an Oracle DBA should care about when moving to PostgreSQL
10 things, an Oracle DBA should care about when moving to PostgreSQL10 things, an Oracle DBA should care about when moving to PostgreSQL
10 things, an Oracle DBA should care about when moving to PostgreSQLPostgreSQL-Consulting
 
Why new hardware may not make SQL Server faster
Why new hardware may not make SQL Server fasterWhy new hardware may not make SQL Server faster
Why new hardware may not make SQL Server fasterSolarWinds
 
How to fix IO problems for faster SQL Server performance
How to fix IO problems for faster SQL Server performanceHow to fix IO problems for faster SQL Server performance
How to fix IO problems for faster SQL Server performanceSolarWinds
 
MySQL Server Settings Tuning
MySQL Server Settings TuningMySQL Server Settings Tuning
MySQL Server Settings Tuningguest5ca94b
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014marvin herrera
 
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...PostgreSQL-Consulting
 
Right-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual MachineRight-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual Machineheraflux
 
How to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machinesHow to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machinesSolarWinds
 
PostgreSQL Portland Performance Practice Project - Database Test 2 Filesystem...
PostgreSQL Portland Performance Practice Project - Database Test 2 Filesystem...PostgreSQL Portland Performance Practice Project - Database Test 2 Filesystem...
PostgreSQL Portland Performance Practice Project - Database Test 2 Filesystem...Mark Wong
 
PostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya KosmodemianskyPostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya KosmodemianskyPostgreSQL-Consulting
 

What's hot (18)

IO Dubi Lebel
IO Dubi LebelIO Dubi Lebel
IO Dubi Lebel
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
 
Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?
 
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
 
Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
 
Sql saturday powerpoint dc_san
Sql saturday powerpoint dc_sanSql saturday powerpoint dc_san
Sql saturday powerpoint dc_san
 
The have no fear guide to virtualizing databases
The have no fear guide to virtualizing databasesThe have no fear guide to virtualizing databases
The have no fear guide to virtualizing databases
 
10 things, an Oracle DBA should care about when moving to PostgreSQL
10 things, an Oracle DBA should care about when moving to PostgreSQL10 things, an Oracle DBA should care about when moving to PostgreSQL
10 things, an Oracle DBA should care about when moving to PostgreSQL
 
Why new hardware may not make SQL Server faster
Why new hardware may not make SQL Server fasterWhy new hardware may not make SQL Server faster
Why new hardware may not make SQL Server faster
 
Accordion HBaseCon 2017
Accordion HBaseCon 2017Accordion HBaseCon 2017
Accordion HBaseCon 2017
 
How to fix IO problems for faster SQL Server performance
How to fix IO problems for faster SQL Server performanceHow to fix IO problems for faster SQL Server performance
How to fix IO problems for faster SQL Server performance
 
MySQL Server Settings Tuning
MySQL Server Settings TuningMySQL Server Settings Tuning
MySQL Server Settings Tuning
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014
 
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
Linux IO internals for database administrators (SCaLE 2017 and PGDay Nordic 2...
 
Right-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual MachineRight-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual Machine
 
How to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machinesHow to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machines
 
PostgreSQL Portland Performance Practice Project - Database Test 2 Filesystem...
PostgreSQL Portland Performance Practice Project - Database Test 2 Filesystem...PostgreSQL Portland Performance Practice Project - Database Test 2 Filesystem...
PostgreSQL Portland Performance Practice Project - Database Test 2 Filesystem...
 
PostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya KosmodemianskyPostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
 

Viewers also liked

SSD - Solid State Drive PPT by Atishay Jain
SSD - Solid State Drive PPT by Atishay JainSSD - Solid State Drive PPT by Atishay Jain
SSD - Solid State Drive PPT by Atishay JainAtishay Jain
 
Introduction to Solid State Drives
Introduction to Solid State DrivesIntroduction to Solid State Drives
Introduction to Solid State DrivesMatt Simmons
 
Solid State Drives - Seminar Report for Semester 6 Computer Engineering - VIT...
Solid State Drives - Seminar Report for Semester 6 Computer Engineering - VIT...Solid State Drives - Seminar Report for Semester 6 Computer Engineering - VIT...
Solid State Drives - Seminar Report for Semester 6 Computer Engineering - VIT...ravipbhat
 
Seminar report on third generation solid state drive
Seminar report on third generation solid state driveSeminar report on third generation solid state drive
Seminar report on third generation solid state driveAtishay Jain
 
Solid State Drives (Third Generation) 2013
Solid State Drives (Third Generation) 2013Solid State Drives (Third Generation) 2013
Solid State Drives (Third Generation) 2013Hemanth HR
 

Viewers also liked (6)

SSD - Solid State Drive PPT by Atishay Jain
SSD - Solid State Drive PPT by Atishay JainSSD - Solid State Drive PPT by Atishay Jain
SSD - Solid State Drive PPT by Atishay Jain
 
Introduction to Solid State Drives
Introduction to Solid State DrivesIntroduction to Solid State Drives
Introduction to Solid State Drives
 
Solid State Drives - Seminar Report for Semester 6 Computer Engineering - VIT...
Solid State Drives - Seminar Report for Semester 6 Computer Engineering - VIT...Solid State Drives - Seminar Report for Semester 6 Computer Engineering - VIT...
Solid State Drives - Seminar Report for Semester 6 Computer Engineering - VIT...
 
Seminar report on third generation solid state drive
Seminar report on third generation solid state driveSeminar report on third generation solid state drive
Seminar report on third generation solid state drive
 
Solid state drives
Solid state drivesSolid state drives
Solid state drives
 
Solid State Drives (Third Generation) 2013
Solid State Drives (Third Generation) 2013Solid State Drives (Third Generation) 2013
Solid State Drives (Third Generation) 2013
 

Similar to SSD Performance Tradeoffs and Best Practices

CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementJ Singh
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkI Goo Lee
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...JAXLondon2014
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonUri Cohen
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheDavid Grier
 
Ssd(solid state drive )
Ssd(solid state drive )Ssd(solid state drive )
Ssd(solid state drive )Karthik m
 
2015 deploying flash in the data center
2015 deploying flash in the data center2015 deploying flash in the data center
2015 deploying flash in the data centerHoward Marks
 
2015 deploying flash in the data center
2015 deploying flash in the data center2015 deploying flash in the data center
2015 deploying flash in the data centerHoward Marks
 
Ssd And Enteprise Storage
Ssd And Enteprise StorageSsd And Enteprise Storage
Ssd And Enteprise StorageFrank Zhao
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesHazelcast
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Demystifying Storage - Building large SANs
Demystifying  Storage - Building large SANsDemystifying  Storage - Building large SANs
Demystifying Storage - Building large SANsDirecti Group
 
SSD Seminar Report
SSD Seminar ReportSSD Seminar Report
SSD Seminar ReportVishalKSetti
 

Similar to SSD Performance Tradeoffs and Best Practices (20)

Nachos 2
Nachos 2Nachos 2
Nachos 2
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
SQL 2005 Disk IO Performance
SQL 2005 Disk IO PerformanceSQL 2005 Disk IO Performance
SQL 2005 Disk IO Performance
 
Ssd(solid state drive )
Ssd(solid state drive )Ssd(solid state drive )
Ssd(solid state drive )
 
2015 deploying flash in the data center
2015 deploying flash in the data center2015 deploying flash in the data center
2015 deploying flash in the data center
 
2015 deploying flash in the data center
2015 deploying flash in the data center2015 deploying flash in the data center
2015 deploying flash in the data center
 
Ssd And Enteprise Storage
Ssd And Enteprise StorageSsd And Enteprise Storage
Ssd And Enteprise Storage
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Nachos Extra Points
Nachos Extra PointsNachos Extra Points
Nachos Extra Points
 
Mysql talk
Mysql talkMysql talk
Mysql talk
 
SSD-Bondi.pptx
SSD-Bondi.pptxSSD-Bondi.pptx
SSD-Bondi.pptx
 
Demystifying Storage - Building large SANs
Demystifying  Storage - Building large SANsDemystifying  Storage - Building large SANs
Demystifying Storage - Building large SANs
 
SSD Seminar Report
SSD Seminar ReportSSD Seminar Report
SSD Seminar Report
 

SSD Performance Tradeoffs and Best Practices

  • 1.
  • 2. Design Tradeoffs for SSD Performance Ted Wobber Principal Researcher Microsoft Research, Silicon Valley
  • 3. Rotating Disks vs. SSDs We have a good model ofhow rotating disks work… what about SSDs?
  • 4. Rotating Disks vs. SSDsMain take-aways Forget everything you knew about rotating disks. SSDs are different SSDs are complex software systems One size doesn’t fit all
  • 5. A Brief Introduction Microsoft Research – a focus on ideas and understanding
  • 6. Will SSDs Fix All Our Storage Problems? Excellent read latency; sequential bandwidth Lower $/IOPS/GB Improved power consumption No moving parts Form factor, noise, … Performance surprises?
  • 7. Performance/Surprises Latency/bandwidth “How fast can I read or write?” Surprise: Random writes can be slow Persistence “How soon must I replace this device?” Surprise: Flash blocks wear out
  • 8. What’s in This Talk Introduction Background on NAND flash, SSDs Points of comparison with rotating disks Write-in-place vs. write-logging Moving parts vs. parallelism Failure modes Conclusion
  • 9. What’s *NOT* in This Talk Windows Analysis of specific SSDs Cost Power savings
  • 10. Full Disclosure “Black box” study based on the properties of NAND flash A trace-based simulation of an “idealized” SSD Workloads TPC-C Exchange Postmark IOzone
  • 11. BackgroundNAND flash blocks A flash block is a grid of cells 1 1 0 1 0 0 1 1 1 1 1 1 Erase: Quantum release for all cells Program: Quantuminjection for some cells Read: NAND operationwith a page selected 4096 + 128 bit-lines 64 pagelines Can’t reset bits to 1 except with erase
  • 12. Background4GB flash package (SLC) Serial out Register Reg Reg Reg Reg Reg Reg Plane Plane 3 Plane 3 Plane 0 Plane 1 Plane 2 Plane 0 Plane 1 Plane 2 Reg Reg Block ’09? 20μs Die 1 Die 0 MLC (multiple bits in cell): slower, less durable
  • 13. BackgroundSSD Structure Flash Translation Layer (Proprietary firmware) Simplified block diagram of an SSD
  • 14. Write-in-place vs. Logging(What latency can I expect?)
  • 15. Write-in-Place vs. Logging Rotating disks Constant map fromLBA to on-disk location SSDs Writes always to new locations Superseded blocks cleaned later
  • 16. Log-based WritesMap granularity = 1 block Flash Block LBA to Block Map P P P0 P1 Write order Block(P) Pages are moved – read-modify-write,(in foreground): Write Amplification
  • 17. Log-based WritesMap granularity = 1 page LBA to Block Map P Q P P0 Q0 P1 Page(P) Page(Q) Blocks must be cleaned(in background): Write Amplification
  • 18. Log-based WritesSimple simulation result Map granularity = flash block (256KB) TPC average I/O latency = 20 ms Map granularity = flash page (4KB) TPC-C average I/O latency = 0.2 ms
  • 19. Log-based WritesBlock cleaning LBA to Page Map P Q R Q P R R0 P0 Q0 P0 R0 Q0 Page(P) Page(Q) Page(R) Move valid pages so block can be erased Cleaning efficiency: Choose blocks to minimize page movement
  • 20. Over-provisioningPutting off the work Keep extra (unadvertised) blocks Reduces “pressure” for cleaning Improves foreground latency Reduces write-amplification due to cleaning
  • 21. Delete NotificationAvoiding the work SSD doesn’t know what LBAs are in use Logical disk is always full! If SSD can know what pages are unused, these can treated as “superseded” Better cleaning efficiency De-facto over-provisioning “Trim” API: An important step forward
  • 22. Delete NotificationCleaning Efficiency Postmark trace One-third pages moved Cleaning efficiency improved by factor of 3 Block lifetime improved
  • 23. LBA Map Tradeoffs Large granularity Simple; small map size Low overhead for sequential write workload Foreground write amplification (R-M-W) Fine granularity Complex; large map size Can tolerate random write workload Background write amplification (cleaning)
  • 24. Write-in-place vs. LoggingSummary Rotating disks Constant map fromLBA to on-disk location SSDs Dynamic LBA map Various possible strategies Best strategy deeply workload-dependent
  • 25. Moving Parts vs. Parallelism(How many IOPS can I get?)
  • 26. Moving Parts vs. Parallelism Rotating disks Minimize seek time andimpact of rotational delay SSDs Maximize number ofoperations in flight Keep chip interconnect manageable
  • 27. Improving IOPSStrategies Request-queue sort by sector address Defragmentation Application-level block ordering Defragmentation for cleaning efficiencyis unproven: next write might re-fragment One request at a time per disk head Null seek time
  • 28. Flash Chip Bandwidth Serial interface is performance bottleneck Reads constrained by serial bus 25ns/byte = 40 MB/s (not so great) Reg Reg Reg Reg Reg Reg 8-bit serial bus Reg Reg Die 1 Die 0
  • 29. SSD ParallelismStrategies Striping Multiple “channels” to host Background cleaning Operation interleaving Ganging of flash chips
  • 30. Striping LBAs striped across flash packages Single request can span multiple chips Natural load balancing What’s the right stripe size? Controller 7 15 23 31 39 47 6 14 22 30 38 46 3 11 19 27 35 43 5 13 21 29 37 45 2 10 18 26 34 42 4 12 20 28 36 44 1 9 17 25 33 41 0 8 16 24 32 40
  • 31. Operations in Parallel SSDs are akin to RAID controllers Multiple onboard parallel elements Multiple request streams are needed to achieve maximal bandwidth Cleaning on inactive flash elements Non-trivial scheduling issues Much like “Log-Structured File System”, but at a lower level of the storage stack
  • 32. Interleaving Concurrent ops on a package or die E.g., register-to-flash “program” on die 0 concurrent with serial line transfer on die 1 25% extra throughput on reads, 100% on writes Erase is slow, can be concurrent with other ops Reg Reg Reg Reg Reg Reg Reg Reg Die 1 Die 0
  • 33. InterleavingSimulation TPC-C and Exchange No queuing, no benefit IOzone and Postmark Sequential I/O component results in queuing Increased throughput
  • 34. Intra-plane Copy-back Block-to-block transfer internal to chip But only within the same plane! Cleaning on-chip! Optimizing for this can hurt load balance Conflicts with striping But data needn’t crossserial I/O pins Reg Reg Reg Reg
  • 35. Cleaning with Copy-backSimulation Copy-back operation for intra-plane transfer TPC-C shows 40% improvement in cleaning costs No benefit for IOzone and Postmark Perfect cleaning efficiency
  • 36. Ganging Optimally, all flash chips are independent In practice, too many wires! Flash packages can share a control bus with or/without separate data channels Operations in lock-step or coordinated Shared-control gang Shared-bus gang
  • 37. Shared-bus GangSimulation Scaling capacity without scaling pin-density Workload (Exchange) requires 900 IOPS 16-gang fast enough
  • 38. Parallelism Tradeoffs No one scheme optimal for all workloads With faster serial connect, intra-chip ops are less important
  • 39. Moving Parts vs. ParallelismSummary Rotating disks Seek, rotational optimization Built-in assumptions everywhere SSDs Operations in parallel are key Lots of opportunities forparallelism, but with tradeoffs
  • 40. Failure Modes(When will it wear out?)
  • 41. Failure ModesRotating disks Media imperfections, loose particles, vibration Latent sector errors [Bairavasundaram 07] E.g., with uncorrectable ECC Frequency of affected disks increases linearly with time Most affected disks (80%) have < 50 errors Temporal and spatial locality Correlation with recovered errors Disk scrubbing helps
  • 42. Failure ModesSSDs Types of NAND flash errors (mostly when erases > wear limit) Write errors: Probability varies with # of erasures Read disturb: Increases with # of reads Data retention errors: Charge leaks over time Little spatial or temporal locality(within equally worn blocks) Better ECC can help Errors increase with wear: Need wear-leveling
  • 43. Wear-levelingMotivation Example: 25% over-provisioning to enhance foreground performance
  • 44. Wear-levelingMotivation Premature worn blocks = reduced over-provisioning = poorer performance
  • 45. Wear-levelingMotivation Over-provisioning budget consumed : writes no longer possible! Must ensure even wear
  • 46. Wear-levelingModified "greedy" algorithm Expiry Meter for block A Cold content Block B Block A Q R P Q R Q0 R0 P0 Q0 R0 If Remaining(A) < Throttle-Threshold, reduce probability of cleaning A If Remaining(A) < Migrate-Threshold, clean A, but migrate cold data into A If Remaining(A) >= Migrate-Threshold, clean A
  • 47. Wear-leveling Results Fewer blocks reach expiry with rate-limiting Smaller standard deviation of remaining lifetimes with cold-content migration Cost to migrating cold pages (~5% avg. latency) Block wear in IOzone
  • 48. Failure ModesSummary Rotating disks Reduce media tolerances Scrubbing to deal with latentsector errors SSDs Better ECC Wear-leveling is critical Greater density  more errors?
  • 49. Rotating Disks vs. SSDs ≠ Don’t think of an SSD as just a faster rotating disk Complex firmware/hardware system with substantial tradeoffs
  • 50. SSD Design Tradeoffs Write amplification more wear
  • 51. Call To Action Users need help in rationalizing workload-sensitive SSD performance Operation latency Bandwidth Persistence One size doesn’t fit all… manufacturers should help users determine the right fit Open the “black box” a bit Need software-visible metrics
  • 52. Thanks for your attention!
  • 53. Additional Resources USENIX paper:http://research.microsoft.com/users/vijayanp/papers/ssd-usenix08.pdf SSD Simulator download:http://research.microsoft.com/downloads Related Sessions ENT-C628: Solid State Storage in Server and Data Center Environments (2pm, 11/5)
  • 54. Please Complete A Session Evaluation FormYour input is important! Visit the WinHECCommNet and complete a Session Evaluation for this session and be entered to win one of 150 Maxtor®BlackArmor™ 160GB External Hard Drives50 drives will be given away daily! http://www.winhec2008.com BlackArmorHard Drives provided by:
  • 55. © 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.