Design Tradeoffs for SSD Reliability
Speaker: Po-Chuan, Chen
Contents
1. Abstract
2. Introduction
3. Background
4. Design Tradeoffs for SSD Reliability
5. Evaluation of SSD Reliability
6. Holistic Reliability Management
7. Related Work
8. Conclusion
Abstract
• Non of modern SSDs offer a one size fits all solution when considering
the multi-dimensional requirement.
• Examining the design tradeoffs
• Proposing a holistic reliability management scheme
1. Selectively employs redundancy
2. Conditionally re-reads
3. Judiciously selects data to scrub
Introduction
• Flash memory-based SSDs has become a mainstream storage device
• But, the drive for high storage density has caused the flash memory
to become less reliable and more error prone
• The high error rates in today’s flash memory are caused by various
reasons
 wear-and-tear
 gradual charge leakage
 data disturbance
This variety is caused by the fact that there is no one size fits all solution for data
protection and recovery:
each technique has a multi-dimensional design tradeoff that makes it necessary to
compositionally combine complementary solutions
But, these reliability enhancements, in fact, cause performance degradation in SSD
When?
 The use of data re-read mechanisms should be managed
 In the absence of random and sporadic errors, the overheads of
intra-SSD redundancy outweigh its benefits in terms of
performance, write amplification, and reliability.
 SSD-internal scrubbing reduces the error-induced long-tail latencies,
but it increases the internal traffic that negates its benefits
How ?
• A holistic reliability management scheme
 Conditionally uses data re-read mechanism to reduce the effects of
read disturbance
 Judiciously selects data to scrub so that the internal relocation
traffic is managed.
 Redundancy is applied only to infrequently accessed cold data to
reduce write amplification
 Frequently read read-hot data are selected for scrubbing based on a
cost-benefit analysis
Error in Flash Memory
• Three major sources of flash memory errors
 Wear
 Retention loss
 Disturbance
Disturbance and retention errors are opposing error mechanisms, but
they do not necessarily cancel each other out
始誤碼率RBER(Raw Bit Error Rate)
確切反應Nand Flash的初始可靠性狀況
SSD Reliability Enhancement Techniques
讀取數據後,將存儲的 ECC 代碼與讀取數據時生成的 ECC 代碼進行比較。
如果在任何情況下都存在不匹配,則奇偶校驗位對其進行解密以確定哪個位有錯誤並立即糾正
Design Tradeoffs for SSD Reliability
 Error-Prone Flash Memory
 Mechanism in Flash Memory Controller
 Role of Flash Translation Layer
Error-Prone Flash Memory
Mechanism in Flash Memory Controller
• Flash memory controller not only abstracts the operational details of
flash memory, but also handles common case error correction.
• Errors beyond the hard-decision ECC’s correction strength are
subsequently handled by the flash memory controller with data re-
reads.
• If the data cannot be recovered after max retry count, the firmware is
notified of a read error.
Solution
Modeling threshold voltage tuning and soft-decision decoding in a way
that each successive reads effectively reduces the RBER of the data by
retry scale factor.
Role of Flash Translation Layer
The flash translation layer (FTL) consists of a number of SSD-internal
housekeeping tasks that collectively hide the quirks of flash memory
and provide an illusion of a traditional block device
 Intra-SSD redundancy (reconstruct data when ECC fails)
 Data scrubbing (prevent errors from accumulating by relocating
them in the background)
Evaluation of SSD Reliability
DiskSim environment (simulate SSD environment)
• Error Correction Code
• Intra-SSD Redundancy
• Background Scrubbing
• Retention Test
Error Correction Code
The average response time for read requests for the three types of
SSDs at various wear states
Write amplification
實際寫入的物理資料量是寫入資料量的多倍
Ref : https://en.wikipedia.org/wiki/Write_amplification
Intra-SSD Redundancy
• It performs no better, accelerates wear through increased write
amplification, and, what’s worse, may not fully recover data due to
correlated failures.
Background Scrubbing
The scrubber’s performance overhead is less than the redundancy
scheme, and the increase in write amplification only occurs towards
the end-of-life phase.
Retention Test
In this section, exploring the effects of data loss due to charge leakage
by initializing a non-zero time-since-written value for each data.
Discussion
• Data re-read :
Degrade performance, increase the average response time
• Intra-SSD redundancy :
Bad for performance, write amplification, and reliability
But the only way to recover data when encountering a random chip and
word line failure
• Data scrubbing :
Reduce the performance degradation but not a cure-for-all
However
 The data re-read mechanism we modeled is too optimistic, as it
eventually corrects errors given enough re-read.
 Second, the short 1 hour experiments are insufficient to show UBER
< 10−15. I/Os in the order of petascale are required to experimentally
show this level of reliability.
 Real flash memories nevertheless exhibit random and sporadic fault
Holistic Reliability Management
• Redundancy should be selectively applied only to infrequently accessed
cold data to reduce write amplification while providing protection against
retention errors
• Frequently read read-hot data should be
relocated through scrubbing to reduce
the data re-reads, but the benefit of
scrubbing should be compared
against the cost of data relocation
Workload and Test Settings
Use which one ?
 ∞-bit ECC : corrects all errors (baseline)
(performance degradation due to queueing delays & garbage collection)
 ECC + re-read : flash memory controller
(repeatedly re-read until correct)
 Oracle scrub : knows the errors and preventively relocates them before
errors accumulate
(if fails → ECC + re-read approach)
 HRM (best one)
Experimental Results
Related work
 Reliability Enhancement
 Voltage prediction
 Intra-SSD redundancy
 QoS Performance
 AutoSSD
 RLGC
 ttFlash
Conclusion
• Proposing a reliability management scheme that selectively applies
appropriate techniques to different data
 The limitation of our study reveals the necessity to integrate the
SSD-level design framework
 Mathematically model the effectiveness of data rereads and data
scrubbing

Design Tradeoffs for SSD Reliability.pptx

  • 1.
    Design Tradeoffs forSSD Reliability Speaker: Po-Chuan, Chen
  • 2.
    Contents 1. Abstract 2. Introduction 3.Background 4. Design Tradeoffs for SSD Reliability 5. Evaluation of SSD Reliability 6. Holistic Reliability Management 7. Related Work 8. Conclusion
  • 3.
    Abstract • Non ofmodern SSDs offer a one size fits all solution when considering the multi-dimensional requirement. • Examining the design tradeoffs • Proposing a holistic reliability management scheme 1. Selectively employs redundancy 2. Conditionally re-reads 3. Judiciously selects data to scrub
  • 4.
    Introduction • Flash memory-basedSSDs has become a mainstream storage device • But, the drive for high storage density has caused the flash memory to become less reliable and more error prone • The high error rates in today’s flash memory are caused by various reasons  wear-and-tear  gradual charge leakage  data disturbance
  • 5.
    This variety iscaused by the fact that there is no one size fits all solution for data protection and recovery: each technique has a multi-dimensional design tradeoff that makes it necessary to compositionally combine complementary solutions But, these reliability enhancements, in fact, cause performance degradation in SSD
  • 6.
    When?  The useof data re-read mechanisms should be managed  In the absence of random and sporadic errors, the overheads of intra-SSD redundancy outweigh its benefits in terms of performance, write amplification, and reliability.  SSD-internal scrubbing reduces the error-induced long-tail latencies, but it increases the internal traffic that negates its benefits
  • 7.
    How ? • Aholistic reliability management scheme  Conditionally uses data re-read mechanism to reduce the effects of read disturbance  Judiciously selects data to scrub so that the internal relocation traffic is managed.  Redundancy is applied only to infrequently accessed cold data to reduce write amplification  Frequently read read-hot data are selected for scrubbing based on a cost-benefit analysis
  • 8.
    Error in FlashMemory • Three major sources of flash memory errors  Wear  Retention loss  Disturbance Disturbance and retention errors are opposing error mechanisms, but they do not necessarily cancel each other out 始誤碼率RBER(Raw Bit Error Rate) 確切反應Nand Flash的初始可靠性狀況
  • 9.
    SSD Reliability EnhancementTechniques 讀取數據後,將存儲的 ECC 代碼與讀取數據時生成的 ECC 代碼進行比較。 如果在任何情況下都存在不匹配,則奇偶校驗位對其進行解密以確定哪個位有錯誤並立即糾正
  • 10.
    Design Tradeoffs forSSD Reliability  Error-Prone Flash Memory  Mechanism in Flash Memory Controller  Role of Flash Translation Layer
  • 11.
  • 12.
    Mechanism in FlashMemory Controller • Flash memory controller not only abstracts the operational details of flash memory, but also handles common case error correction. • Errors beyond the hard-decision ECC’s correction strength are subsequently handled by the flash memory controller with data re- reads. • If the data cannot be recovered after max retry count, the firmware is notified of a read error.
  • 13.
    Solution Modeling threshold voltagetuning and soft-decision decoding in a way that each successive reads effectively reduces the RBER of the data by retry scale factor.
  • 14.
    Role of FlashTranslation Layer The flash translation layer (FTL) consists of a number of SSD-internal housekeeping tasks that collectively hide the quirks of flash memory and provide an illusion of a traditional block device  Intra-SSD redundancy (reconstruct data when ECC fails)  Data scrubbing (prevent errors from accumulating by relocating them in the background)
  • 15.
    Evaluation of SSDReliability DiskSim environment (simulate SSD environment) • Error Correction Code • Intra-SSD Redundancy • Background Scrubbing • Retention Test
  • 16.
    Error Correction Code Theaverage response time for read requests for the three types of SSDs at various wear states
  • 17.
  • 18.
    Intra-SSD Redundancy • Itperforms no better, accelerates wear through increased write amplification, and, what’s worse, may not fully recover data due to correlated failures.
  • 19.
    Background Scrubbing The scrubber’sperformance overhead is less than the redundancy scheme, and the increase in write amplification only occurs towards the end-of-life phase.
  • 20.
    Retention Test In thissection, exploring the effects of data loss due to charge leakage by initializing a non-zero time-since-written value for each data.
  • 21.
    Discussion • Data re-read: Degrade performance, increase the average response time • Intra-SSD redundancy : Bad for performance, write amplification, and reliability But the only way to recover data when encountering a random chip and word line failure • Data scrubbing : Reduce the performance degradation but not a cure-for-all
  • 22.
    However  The datare-read mechanism we modeled is too optimistic, as it eventually corrects errors given enough re-read.  Second, the short 1 hour experiments are insufficient to show UBER < 10−15. I/Os in the order of petascale are required to experimentally show this level of reliability.  Real flash memories nevertheless exhibit random and sporadic fault
  • 23.
    Holistic Reliability Management •Redundancy should be selectively applied only to infrequently accessed cold data to reduce write amplification while providing protection against retention errors • Frequently read read-hot data should be relocated through scrubbing to reduce the data re-reads, but the benefit of scrubbing should be compared against the cost of data relocation
  • 24.
  • 25.
    Use which one?  ∞-bit ECC : corrects all errors (baseline) (performance degradation due to queueing delays & garbage collection)  ECC + re-read : flash memory controller (repeatedly re-read until correct)  Oracle scrub : knows the errors and preventively relocates them before errors accumulate (if fails → ECC + re-read approach)  HRM (best one)
  • 26.
  • 27.
    Related work  ReliabilityEnhancement  Voltage prediction  Intra-SSD redundancy  QoS Performance  AutoSSD  RLGC  ttFlash
  • 28.
    Conclusion • Proposing areliability management scheme that selectively applies appropriate techniques to different data  The limitation of our study reveals the necessity to integrate the SSD-level design framework  Mathematically model the effectiveness of data rereads and data scrubbing