Employing ECCs via Overprovisioning to Improve Flash Reliability:

Flash Memory Summit 2016 | Santa Clara, CA 1c
Stella Achtenberg, Eran Sharon, Idan Alrod
Employing ECCs via Overprovisioning to Improve Flash
Reliability:
A New, Cost Efficient Approach
Advanced Memory Solutions,
3-8-2016
1
Flash Memory Summit 2016 | Santa Clara, CA

Flash Memory Summit 2016 | Santa Clara, CA 2
NAND Memory
Enterprise SSD
Client SSD
Mobile

Handling Random Errors
DSP techniquesAdvanced
ECC
Second Level
Error Correction
Errors
1e-11e-21
Raw NAND
Many ErrorsFew Errors

NAND Memory

RAID Example

Handling Physical Defects
Overprovisioning
RAID
Reliability Performance

Storage Reliability Requirements
 A metric for occurrence of data errors per bits read:
𝑈𝐵𝐸𝑅 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑒𝑟𝑟𝑜𝑟𝑠
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑖𝑡𝑠 𝑟𝑒𝑎𝑑
 Extremely Low UBER requirements < 𝟏𝟎−18
 DPPM = Defective Parts per Million
Enterprise SSD
Client SSD

Problem statement
3-Dimensional stacking
and process scaling
increase RBER variability
< 𝟏𝟎−18
Less reliability
< 𝟏𝟎−18
Low High

Overprovisioning potential
RAID
Overprovisioning
ECC
Overprovisioning
Random Errors Physical Defects
Joint RAID & ECC
• Lower
UBER/DPPM
• Higher
Endurance

Case study: 32 Die XOR RAID
Data Page 0 (ECC 0)
Data Page 1 (ECC 1)
Data Page 30 (ECC 30)
.
.
.
Parity Page 31

Current solution
 Decode using soft information
 In case of failure, decode the entire
RAID stripe
– Single error  Recovery
– More than single error  UECC
1st failure
+
…
𝐸0
𝐸31
XOR
2nd
failure
UECC (Data loss)
𝐿𝐷𝑃𝐶𝑖

New methodology
 Optimal information exchange
between RAID & LDPC
– Extrinsic “soft” LDPC output 
into the RAID
– Updated “soft” RAID output 
back to the LDPC
– Iterate until convergence or
timeout
𝑃𝑖
-
𝑄𝑖𝑛𝑖
𝑇𝑖
𝑄𝑜𝑢𝑡𝑖
𝐸𝑖
+ 𝐿𝐷𝑃𝐶𝑖
+ …
𝐸0
𝐸31
SXOR
𝑇𝑖 = 𝜑−1
𝑗𝑖 𝜑 𝐸𝑗 ,
𝜑 𝑥 = {sign 𝑥 , − log tanh
𝑥
2
}

Previous Work
“Error Correction Using Multiple Data Sources” –
US patent application by Sharon et al (2014, SanDisk)
“Soft Decision Decoding of RAID Stripe for Higher Endurance of Flash
Memory Based Solid State Drives” –
Ravi Motwani and Chong Ong (2015, Intel)

 XOR RAID
 LDPC codewords
 XOR page is a codeword
 Effectively a long code with
joint ECC and RAID
overprovisioning
Data Page 0 (ECC 0)
Data Page 1 (ECC 1)
Data Page 30 (ECC 30)
.
.
.
Parity Page 31
v1
v2
v3
v4
v5
v6
c1
c2
c3
c4
𝑮 𝟑𝟎
v1
v2
v3
v4
v5
v6
c1
c2
c3
c4
𝑮 𝟎
c1
c2
c3
c4
c5
c6
v1
v2
v3
v4
v5
v6
c1
c2
c3
c4
𝑮 𝟏
v1
v2
v3
v4
v5
v6
c1
c2
c3
c4
𝑮 𝟑𝟏
.
.
.
Graph Representation

Recovery flow
v1
v2
v3
v4
v5
v6
c1
c2
c3
c4
v1
v2
v3
v4
v5
v6
c1
c2
c3
variables checks
c4
𝑮 𝟑𝟎
𝑮 𝟎
𝑮 𝟑𝟏
c1
c2
c3
c4
c5
c6
𝑮 𝟏
v1
v2
v3
v4
v5
v6
c1
c2
c3
c4
v1
v2
v3
v4
v5
v6
c1
c2
c3
c4
𝑮 𝟑𝟏
𝑮 𝟎
𝑬 𝟎
𝑬 𝟑𝟎
𝑬 𝟑𝟏𝑻 𝟏
𝑮 𝟏
𝑬 𝟎
𝑬 𝟏 𝑬 𝟑𝟏
𝑻 𝟑𝟎
𝑮 𝟑𝟎
𝑷 𝟎
𝑷 𝟏
𝑷 𝟑𝟎
𝑷 𝟑𝟏
𝑇30 = 𝜑−1
𝑗30 𝜑 𝐸𝑗 ,𝑇1 = 𝜑−1
𝑗1 𝜑 𝐸𝑗 ,
.
.
.
𝑮 𝟏
𝑮 𝟑𝟎

Pros and Cons
Pros:
 Enhanced correction capability
Cons:
 High cost and complexity
– RAM (Storing 𝐸0, … , 𝐸31)
– Soft XOR instead of XOR
– Latency
𝑃𝑖
-
𝑄𝑖𝑛𝑖
𝑇𝑖
𝐸𝑖
+ 𝐿𝐷𝑃𝐶𝑖
+ …
𝐸0
𝐸31
SXOR
𝑇𝑖 = 𝜑−1
𝑗𝑖 𝜑 𝐸𝑗 ,
𝜑 𝑥 = {sign 𝑥 , − log tanh
𝑥
2
}

Simple joint RAID & ECC
 Simple variant using standard
hardware:
– XOR page as “Virtual” Soft Bit
page
– Dedicated LLR table emulates
LLR summation
Standard
LDPC
Standard
XOR
LLR
Table
Virtual Soft Bit
SBs
NAND
CorrectedHardBit

UBER improvement
X 1.5 correction
capability
X 2 correction capability
Low High

Existing versus New
1st failure
+
…
𝐸0
𝐸31
XOR
2nd failure
Data loss
𝐿𝐷𝑃𝐶𝑖 Standard
LDPC
Standard
XOR
LLR
Table
Virtual Soft Bit
SBs
NAND
CorrectedHardBit
 Independent LDPC &
RAID
 Single failure recovery
 Joint LDPC & RAID
 Standard HW
 Correcting up to 32 failures
 Substantially reduces UBER

Joint Hard Decoder and RAID

BCH (Bose, Chaudhuri, Hocquenghem)
 Simple hardware
 Constant latency
 Can not use soft information
 Lower correction capability
RBERECCFailureProbability
100%
BCH
LDPC using soft
information
> X 3 correction capabilities

 Soft-Bit read (+/-∆ around the read thresholds):
 Soft-Bit divides the cells population into two categories:
– Population of reliable cells, exhibiting low BER
– Population of unreliable cells, exhibiting high BER
Generating Soft Information
less reliable less reliable less reliable

Codeword recovered from XOR
BERXOR = ½ ∙(1-(1-2∙RBER)k) ≈
k ∙ RBER
BCH Fails decoding
Joint Hard Decoding and RAID
Codeword read from Flash
RBER
BCH Fails decoding

Codeword recovered from XOR
BERXOR = ½ ∙(1-(1-2∙RBER)k) ≈
k ∙ RBER
BCH Fails decoding
RBER
BCH Fails decoding
RBER = α∙phigh + (1-α)∙plow
High RBER (phigh) Low RBER (plow)
Read SB indicating unreliable cells

Combined codeword
RBERcombined = α∙ RBERXOR + (1-α)∙plow < RBER
Low RBER original CW Moderate RBER XOR CW
k ∙ BER
BCH Success

UBER improvement
X 1.5 correction capability

Summary
 Storage systems require very high reliability
 3-Dimensional stacking and process scaling increase RBER
variability, compromising reliability
 Joint RAID & ECC enhance reliability without adding cost:
– Soft Decoder – Low complexity joint RAID & LDPC
– Hard Decoder – Joint RAID & BCH

Summary
Raw Bit Error Rate
ECCFailureProbability
BCH LDPC
X 1.5
New: Joint
BCH&RAID
New: Joint
LDPC&RAID
X 1.5

Thank you!
Questions?
Contact: stella.achtenberg@sandisk.com
© 2016 Western Digital Corporation or its affiliates. All rights reserved. SanDisk, SanDisk logo, iNAND and Lightning are trademarks of Western Digital Corporation or its affiliates, registered
in the U.S. and other countries.
Other brand names mentioned herein are for identification purposes only and may be the trademark(s) of their respective holder(s).

Employing ECCs via Overprovisioning to Improve Flash Reliability:

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Employing ECCs via Overprovisioning to Improve Flash Reliability:

Similar to Employing ECCs via Overprovisioning to Improve Flash Reliability: (20)

Recently uploaded

Recently uploaded (20)

Employing ECCs via Overprovisioning to Improve Flash Reliability:

Editor's Notes