As high-end product requirements increase from generation to generation, and NAND cell distribution has higher variance due to process shrinking and 3D stacking, reliability increase becomes a real need.
Learn how you can leverage overprovisioning to improve the reliability of flash.
What if we could achieve 50% more flash reliability for the same cost?
Employing ECCs via Overprovisioning to Improve Flash Reliability:
1. Flash Memory Summit 2016 | Santa Clara, CA 1c
Stella Achtenberg, Eran Sharon, Idan Alrod
Employing ECCs via Overprovisioning to Improve Flash
Reliability:
A New, Cost Efficient Approach
Advanced Memory Solutions,
3-8-2016
1
Flash Memory Summit 2016 | Santa Clara, CA
2. Flash Memory Summit 2016 | Santa Clara, CA 2
NAND Memory
Enterprise SSD
Client SSD
Mobile
3. Flash Memory Summit 2016 | Santa Clara, CA 3
Handling Random Errors
DSP techniquesAdvanced
ECC
Second Level
Error Correction
Errors
1e-11e-21
Raw NAND
Many ErrorsFew Errors
6. Flash Memory Summit 2016 | Santa Clara, CA 6
Handling Physical Defects
Overprovisioning
RAID
Reliability Performance
7. Flash Memory Summit 2016 | Santa Clara, CA 7
Storage Reliability Requirements
A metric for occurrence of data errors per bits read:
𝑈𝐵𝐸𝑅 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑒𝑟𝑟𝑜𝑟𝑠
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑖𝑡𝑠 𝑟𝑒𝑎𝑑
Extremely Low UBER requirements < 𝟏𝟎−18
DPPM = Defective Parts per Million
Enterprise SSD
Client SSD
8. Flash Memory Summit 2016 | Santa Clara, CA 8
Problem statement
3-Dimensional stacking
and process scaling
increase RBER variability
< 𝟏𝟎−18
Less reliability
< 𝟏𝟎−18
Low High
9. Flash Memory Summit 2016 | Santa Clara, CA 9
Overprovisioning potential
RAID
Overprovisioning
ECC
Overprovisioning
Random Errors Physical Defects
Joint RAID & ECC
• Lower
UBER/DPPM
• Higher
Endurance
10. Flash Memory Summit 2016 | Santa Clara, CA 10
Case study: 32 Die XOR RAID
Data Page 0 (ECC 0)
Data Page 1 (ECC 1)
Data Page 30 (ECC 30)
.
.
.
Parity Page 31
11. Flash Memory Summit 2016 | Santa Clara, CA 11
Current solution
Decode using soft information
In case of failure, decode the entire
RAID stripe
– Single error Recovery
– More than single error UECC
1st failure
+
…
𝐸0
𝐸31
XOR
2nd
failure
UECC (Data loss)
𝐿𝐷𝑃𝐶𝑖
12. Flash Memory Summit 2016 | Santa Clara, CA 12
New methodology
Optimal information exchange
between RAID & LDPC
– Extrinsic “soft” LDPC output
into the RAID
– Updated “soft” RAID output
back to the LDPC
– Iterate until convergence or
timeout
𝑃𝑖
-
𝑄𝑖𝑛𝑖
𝑇𝑖
𝑄𝑜𝑢𝑡𝑖
𝐸𝑖
+ 𝐿𝐷𝑃𝐶𝑖
+ …
𝐸0
𝐸31
SXOR
𝑇𝑖 = 𝜑−1
𝑗𝑖 𝜑 𝐸𝑗 ,
𝜑 𝑥 = {sign 𝑥 , − log tanh
𝑥
2
}
𝑄𝑜𝑢𝑡𝑖
13. Flash Memory Summit 2016 | Santa Clara, CA 13
Previous Work
“Error Correction Using Multiple Data Sources” –
US patent application by Sharon et al (2014, SanDisk)
“Soft Decision Decoding of RAID Stripe for Higher Endurance of Flash
Memory Based Solid State Drives” –
Ravi Motwani and Chong Ong (2015, Intel)
17. Flash Memory Summit 2016 | Santa Clara, CA 17
Simple joint RAID & ECC
Simple variant using standard
hardware:
– XOR page as “Virtual” Soft Bit
page
– Dedicated LLR table emulates
LLR summation
Standard
LDPC
Standard
XOR
LLR
Table
Virtual Soft Bit
SBs
NAND
CorrectedHardBit
18. Flash Memory Summit 2016 | Santa Clara, CA 18
UBER improvement
X 1.5 correction
capability
X 2 correction capability
Low High
19. Flash Memory Summit 2016 | Santa Clara, CA 19
Existing versus New
1st failure
+
…
𝐸0
𝐸31
XOR
2nd failure
Data loss
𝐿𝐷𝑃𝐶𝑖 Standard
LDPC
Standard
XOR
LLR
Table
Virtual Soft Bit
SBs
NAND
CorrectedHardBit
Independent LDPC &
RAID
Single failure recovery
Joint LDPC & RAID
Standard HW
Correcting up to 32 failures
Substantially reduces UBER
21. Flash Memory Summit 2016 | Santa Clara, CA 21
BCH (Bose, Chaudhuri, Hocquenghem)
Simple hardware
Constant latency
Can not use soft information
Lower correction capability
RBERECCFailureProbability
100%
BCH
LDPC using soft
information
> X 3 correction capabilities
22. Flash Memory Summit 2016 | Santa Clara, CA 22
Soft-Bit read (+/-∆ around the read thresholds):
Soft-Bit divides the cells population into two categories:
– Population of reliable cells, exhibiting low BER
– Population of unreliable cells, exhibiting high BER
Generating Soft Information
less reliable less reliable less reliable
23. Flash Memory Summit 2016 | Santa Clara, CA 23
Codeword recovered from XOR
BERXOR = ½ ∙(1-(1-2∙RBER)k) ≈
k ∙ RBER
BCH Fails decoding
Joint Hard Decoding and RAID
Codeword read from Flash
RBER
BCH Fails decoding
24. Flash Memory Summit 2016 | Santa Clara, CA 24
Codeword recovered from XOR
BERXOR = ½ ∙(1-(1-2∙RBER)k) ≈
k ∙ RBER
BCH Fails decoding
Joint Hard Decoding and RAID
Codeword read from Flash
RBER
BCH Fails decoding
Codeword read from Flash
RBER = α∙phigh + (1-α)∙plow
High RBER (phigh) Low RBER (plow)
Read SB indicating unreliable cells
25. Flash Memory Summit 2016 | Santa Clara, CA 25
Joint Hard Decoding and RAID
Combined codeword
RBERcombined = α∙ RBERXOR + (1-α)∙plow < RBER
Low RBER original CW Moderate RBER XOR CW
k ∙ BER
BCH Success
26. Flash Memory Summit 2016 | Santa Clara, CA 26
UBER improvement
X 1.5 correction capability
27. Flash Memory Summit 2016 | Santa Clara, CA 27
Summary
Storage systems require very high reliability
3-Dimensional stacking and process scaling increase RBER
variability, compromising reliability
Joint RAID & ECC enhance reliability without adding cost:
– Soft Decoder – Low complexity joint RAID & LDPC
– Hard Decoder – Joint RAID & BCH
28. Flash Memory Summit 2016 | Santa Clara, CA 28
Summary
Raw Bit Error Rate
ECCFailureProbability
BCH LDPC
X 1.5
New: Joint
BCH&RAID
New: Joint
LDPC&RAID
X 1.5
What if we could achieve 50% more flash reliability for the same cost?
As high-end product requirements increase from generation to generation, and NAND cell distribution has higher variance due to process shrinking and 3-Dimentional stacking, reliability increase becomes a real need.
High end products are designed to handle both random errors and colossal physical defects by employing two independent protection levels – the ECC and the RAID with dedicated overprovisioning for each.
My focus today is how to get more with less. How to combine two existing protection levels and leverage this special combination by cost effective algorithms.
One approach is based on performing iterative soft information exchange between the RAID and a soft decoding ECC such as LDPC.
A second approach, enables utilization of soft information and the RAID overprovisioning for hard decoding ECC such as BCH.
Both soft and hard joint decoding schemes provide 50% higher resilience to random errors, with the same overprovisioning.
Substantial portion of the NVM storage market is the NAND storage.
Lets use this 3-D building as an analogy to our Flash memory.
The building is a die, each floor is a WL and each window a cell.
NAND storage supplies solution to many product lines such as: Mobile, Client CSS, High End CSS and Enterprise.
All product line have to be resilient to random errors.
For example, three windows broke in our building.
In order to repair the windows, the building has to be protected with Error Correcting code (ECC) which needs overprovisioning to correct errors.
Raw NAND exhibits Raw Bit Error Rate, the RBER.
It is caused by the cell voltage distribution which changes along to life time of the device with cycling and time.
DSP techniques are reduce RBER, enabling the cost effective Error Correction Coding (ECC).
Less RBER less Overprovisioning.
Advanced ECC techniques ensure low error rate.
In high end system a second protection level is employed to recover ECC failures, and further reducing error rate.
For client and Enterprise SSD product lines, ECC alone might not provide sufficiently high reliability.
These systems also have to handle physical detects such as WL failure (Click) or even an entire die failure (Click).
To handle such failures an extra overprovisioning has to be put aside to recover.
This type of overprovisioning is called-RAID - long known in the hard disk industry.
One example of Raid could be to take several physical pages,
and combine them into an external overprovisioning as a function of those physical pages
Once we get a complete fail in one of these pages
we use the other good pages and the extra overprovisioning
in order to recover the data stored on the failing page
RAID in our context stands for Redundant Array of Independent Dies.
RAID presents tradeoffs between performance, overprovisioning and reliability.
For example RAID 0 scheme employs parallel write and read to all the dies providing with high performance wo handle any defects.
RAID 1 consists of data mirroring.
Any read request can be serviced by any drive in the set, improving performance (latency).
The array continues to operate as long as at least one drive is functioning.
50% overprovisioning is need making this scheme not cost effective.
RAID 5 consists of block-level striping with distributed parity.
RAID 6 consists of block-level striping with double distributed parity.
Just to recall the flow: Voltage cell distribution causes RBER, which causes UBER.
For such distributions most of the time the RBER is low and high performance and reliability is achieved. But in order to maintain extremely low UBER, the rare high RBER cases have to be handled as well.
For this example distribution in order to maintain UBER bellow 10^-18, ECC needs to be able to correct up to this RBER.
3-Dimentional stacking and process scaling increase RBER variability, compromising reliability.
For the new distribution a higher RBER values needs to be handled.
One approach is to over-engineer and add additional overprovisioning to the ECC (which would increase the die cost).
But can we do better than that?
Can we utilize the existing overprovisioning without adding cost to the system?
This amassing photograph of the grand canyon cliffs stand for the two independent protection levels – ECC and the RAID with dedicated overprovisioning for each.
Significantly higher ECC correction capability by combining the two protection levels and leveraging the RAID overprovisioning by performing joint RAID and ECC information exchange.
Once the entire overprovisioning is used by the ECC lower UBER and DPPM are achieved on one hand and higher endurance on the other hand (increasing endurance includes extending the specs for more cycling, data retention, read disturb and cross temperature.
The best way to understand these concepts is to examine a case study.
In this example we have 31 information dies and one overprovisioning dies which is the XOR of all the other dies
One a page level - each page protected by ECC LDPC, having 10% overprovisioning
Single XOR page for every 31 data pages having, 3% overprovisioning
The bottom line is that this scheme can correct only one error in a stripe
The bottom line is that this scheme can correct only one error in a stripe
Error Correction Using Multiple Data Sources discloses error correction capability of a data storage device improved by combining multiple inputs of data having different reliabilities in order to generate a combined input to a decoder of the data storage device.
For example, the data storage device may replace unreliable bits of a first logical page with bits of a second logical page to generate the combined input.
If the multiple inputs are combined prior to attempting to decode any of the multiple inputs, reliability of the inputs is taken into account prior to initiating a decoding operation.
The techniques illustrated herein may therefore enable use of reliability information in connection with a "hard" decoder, such as a hard Bose-Chaudhuri-Hocquenghem (BCH) decoder, that does not use soft bit information.
Alternatively or in addition, techniques of the present disclosure may be used to increase error correction capability of a soft decoder.
The XOR RAID is based on two 32 page stripes, one across even planes and one across odd planes.
This means that for every 31 data pages, one extra XOR parity page is stored.
Each data page is protected by its own LDPC codeword C with ~X% redundancy, represented by a bipartite graph 𝐺.
So we have the bits in red, and the parity checks in blue.
The graph defines the parity check constraints that each bit has to satisfy.
Each ECC is decoded iteratively by exchanging reliability information between the bits and their parity check constraints
The RAID introduces an extra XOR codeword 𝐶 31 (represented by 𝐺 31 ), incurring additional ~3% redundancy:
𝐶 31 = 𝐶 0 +…+ 𝐶 30
Enhanced correction capability can be achieved by iteratively decoding the unified graph 𝐺 0 ,…, 𝐺 31 with ~X+3% redundancy. The unified graph has Enhanced correction capability
It also has longer code length, which provides better code qualities.
The suboptimal approach sacrifices the extrinsic information exchange to save complexity and RAM.
Use the “hard” RAID XOR page as a “virtual” SB page that is fed into the LDPC with special LLR tables
Pros:
Low cost and complexity
No additional RAM
Hard XOR
Low latency
Cons:
Compromising correction capability
So far we have talked about information exchange between RAID and LDPC.
This exchange is based on soft reliability information as LDPC is a soft decoder.
But what about systems employing hard ECC decoder?
Our next challenge is finding a way to exchange information between RAID and hard ECC decoder despite hard decoders cannot directly use soft reliability information.
An example of hard decoder is a BCH code. BCH decoders are very fast and have simple HW implementations.
However BCH cannot utilize soft information hence have low correction capability, 3 times lower than LDPC with the same overprovisioning.
This is also an obstacle if we what to exchange information between RAID and ECC.
Lets recall what soft information is:
Assume that we have 4 states with 4 cell voltage distributions. Read thresholds distinguish between the different states.
Cells with voltage far from the read threshold are reliable cells and have very low BER.
Cells near the read threshold are unreliable cells exhibiting high BER.
By sensing in +-delta around the read threshold, cells are divided into two distinct populations – the reliable cells and unreliable cells.
The Soft-Bit page identifies error prone cells.