Design Tradeoffs for SSD Reliability.pptx

Design Tradeoffs for SSD Reliability
Speaker: Po-Chuan, Chen

Contents
1. Abstract
2. Introduction
3. Background
4. Design Tradeoffs for SSD Reliability
5. Evaluation of SSD Reliability
6. Holistic Reliability Management
7. Related Work
8. Conclusion

Abstract
• Non of modern SSDs offer a one size fits all solution when considering
the multi-dimensional requirement.
• Examining the design tradeoffs
• Proposing a holistic reliability management scheme
1. Selectively employs redundancy
2. Conditionally re-reads
3. Judiciously selects data to scrub

Introduction
• Flash memory-based SSDs has become a mainstream storage device
• But, the drive for high storage density has caused the flash memory
to become less reliable and more error prone
• The high error rates in today’s flash memory are caused by various
reasons
 wear-and-tear
 gradual charge leakage
 data disturbance

This variety is caused by the fact that there is no one size fits all solution for data
protection and recovery:
each technique has a multi-dimensional design tradeoff that makes it necessary to
compositionally combine complementary solutions
But, these reliability enhancements, in fact, cause performance degradation in SSD

When?
 The use of data re-read mechanisms should be managed
 In the absence of random and sporadic errors, the overheads of
intra-SSD redundancy outweigh its benefits in terms of
performance, write amplification, and reliability.
 SSD-internal scrubbing reduces the error-induced long-tail latencies,
but it increases the internal traffic that negates its benefits

How ?
• A holistic reliability management scheme
 Conditionally uses data re-read mechanism to reduce the effects of
read disturbance
 Judiciously selects data to scrub so that the internal relocation
traffic is managed.
 Redundancy is applied only to infrequently accessed cold data to
reduce write amplification
 Frequently read read-hot data are selected for scrubbing based on a
cost-benefit analysis

Error in Flash Memory
• Three major sources of flash memory errors
 Wear
 Retention loss
 Disturbance
Disturbance and retention errors are opposing error mechanisms, but
they do not necessarily cancel each other out
始誤碼率RBER（Raw Bit Error Rate）
確切反應Nand Flash的初始可靠性狀況

SSD Reliability Enhancement Techniques
讀取數據後，將存儲的 ECC 代碼與讀取數據時生成的 ECC 代碼進行比較。
如果在任何情況下都存在不匹配，則奇偶校驗位對其進行解密以確定哪個位有錯誤並立即糾正

Design Tradeoffs for SSD Reliability
 Error-Prone Flash Memory
 Mechanism in Flash Memory Controller
 Role of Flash Translation Layer

Mechanism in Flash Memory Controller
• Flash memory controller not only abstracts the operational details of
flash memory, but also handles common case error correction.
• Errors beyond the hard-decision ECC’s correction strength are
subsequently handled by the flash memory controller with data re-
reads.
• If the data cannot be recovered after max retry count, the firmware is
notified of a read error.

Solution
Modeling threshold voltage tuning and soft-decision decoding in a way
that each successive reads effectively reduces the RBER of the data by
retry scale factor.

Role of Flash Translation Layer
The flash translation layer (FTL) consists of a number of SSD-internal
housekeeping tasks that collectively hide the quirks of flash memory
and provide an illusion of a traditional block device
 Intra-SSD redundancy (reconstruct data when ECC fails)
 Data scrubbing (prevent errors from accumulating by relocating
them in the background)

Evaluation of SSD Reliability
DiskSim environment (simulate SSD environment)
• Error Correction Code
• Intra-SSD Redundancy
• Background Scrubbing
• Retention Test

Error Correction Code
The average response time for read requests for the three types of
SSDs at various wear states

Write amplification
實際寫入的物理資料量是寫入資料量的多倍
Ref : https://en.wikipedia.org/wiki/Write_amplification

Intra-SSD Redundancy
• It performs no better, accelerates wear through increased write
amplification, and, what’s worse, may not fully recover data due to
correlated failures.

Background Scrubbing
The scrubber’s performance overhead is less than the redundancy
scheme, and the increase in write amplification only occurs towards
the end-of-life phase.

Retention Test
In this section, exploring the effects of data loss due to charge leakage
by initializing a non-zero time-since-written value for each data.

Discussion
• Data re-read :
Degrade performance, increase the average response time
• Intra-SSD redundancy :
Bad for performance, write amplification, and reliability
But the only way to recover data when encountering a random chip and
word line failure
• Data scrubbing :
Reduce the performance degradation but not a cure-for-all

However
 The data re-read mechanism we modeled is too optimistic, as it
eventually corrects errors given enough re-read.
 Second, the short 1 hour experiments are insufficient to show UBER
< 10−15. I/Os in the order of petascale are required to experimentally
show this level of reliability.
 Real flash memories nevertheless exhibit random and sporadic fault

Holistic Reliability Management
• Redundancy should be selectively applied only to infrequently accessed
cold data to reduce write amplification while providing protection against
retention errors
• Frequently read read-hot data should be
relocated through scrubbing to reduce
the data re-reads, but the benefit of
scrubbing should be compared
against the cost of data relocation

Use which one ?
 ∞-bit ECC : corrects all errors (baseline)
(performance degradation due to queueing delays & garbage collection)
 ECC + re-read : flash memory controller
(repeatedly re-read until correct)
 Oracle scrub : knows the errors and preventively relocates them before
errors accumulate
(if fails → ECC + re-read approach)
 HRM (best one)

Related work
 Reliability Enhancement
 Voltage prediction
 Intra-SSD redundancy
 QoS Performance
 AutoSSD
 RLGC
 ttFlash

Conclusion
• Proposing a reliability management scheme that selectively applies
appropriate techniques to different data
 The limitation of our study reveals the necessity to integrate the
SSD-level design framework
 Mathematically model the effectiveness of data rereads and data
scrubbing

Design Tradeoffs for SSD Reliability.pptx

More Related Content

What's hot

Similar to Design Tradeoffs for SSD Reliability.pptx

More from Po-Chuan Chen

Recently uploaded

Design Tradeoffs for SSD Reliability.pptx