2. Table of contents
• Abstract
• Introduction (3 related studies)
• Errors in flash & protective technique
• Flash reliability with operator’s view
• Raw bit error rate (Lab verse Filed study)
• Uncorrectable errors (relations between other characteristics)
• Hardware failure (bad blocks & bad chips)
• Forecasting field reliability
• Future work & Summary
3. Abstract
• The goal of this paper is to provide an overview of what we have
learned about flash reliability in production, and where appropriate
contrasting it with prior studies performing controlled experiments.
4. Introduction (3 related studies)
Examines uncorrectable errors in flash-based SSD in Facebook’s server.
A range of different errors and types of hardware failures in SSDs
in Google data centers.
Fail-stop failures of SSDs at Microsoft data centers
Data centers verse Lab & assumptions
5. Focus on what ?
• The different types of errors experienced by flash drives and their
frequency in the field
• Raw bit error rates (RBERs) and their relationship with other errors
• Uncorrectable errors
• The field characteristics of different types of hardware failures
• Fail-stop events
• Comparison (different flash technique)
• Comparison (different between HDDs and flash drives)
6. 1
• Errors in flash & protective technique
• Flash reliability with operator’s view
• Raw bit error rate (Lab verse Filed study)
• Uncorrectable errors (relations between other characteristics)
• Hardware failure (bad blocks & bad chips)
• Fail-stop Failures (symptoms & predictors)
• Forecasting field reliability
• Future work & Summary
8. Other Sources of Data Loss or Corruption in SSDs
• Flash drives’ firmware
Data corruption (serious in HDDs, but not in SSDs)
• Power lose
The reason that SSDs have data corruption
9. Device-Level Protection Against Errors
• The drives at Google mark a block bad after it experiences an
uncorrectable error or a failed program or erase operation
• Simply remove a bad chip from further usage and continue to operate
with reduced capacity
• Drive-internal chips in a RAID-like structure
11. 2
• Flash reliability with operator’s view
• Raw bit error rate (Lab verse Filed study)
• Uncorrectable errors (relations between other characteristics)
• Hardware failure (bad blocks & bad chips)
• Fail-stop Failures (symptoms & predictors)
• Forecasting field reliability
• Future work & Summary
12. Drive repair and replacements
These replacements include only replacements
that were due to suspected hardware issues
with the drive
13. Uncorrectable Errors
• Uncorrectable errors are common:
depending on the drive model
between 26% to more than 90% of
drives experience at least one
uncorrectable error
14. Difference between 3 studies
• The Facebook study (nonstandard SSD design) :
20% ~ 35% has uncorrectable errors
UBERs : 10−9
~ 10−11
• The Microsoft study :
UBERs : 10−11
~ 10−14
Counts of uncorrectable errors has
highly variable distribution with heavily-tails
16. Difference between 3 studies
• As the drives in the Microsoft study are commodity drives,
it makes sense to compare the fail-stop rates with the annual failure rates one
would expect based on vendor specifications
• Lower fail-stop rates at Microsoft than drive repairs and replacements at Google
Reason why lower fail-stop rates at Microsoft study :
Fail-stop events might not be the only events that trigger drive replacements at Microsoft
The usage of the drives
The Google drive models have larger capacities
17. Comparison with HDDs
• Replacement rates for SSDs are significantly lower than for HDDs.
• SSDs have significantly higher rates of nontransparent error.
(Ex : uncorrectable error)
• Only 3.4% of them develop latent sector errors over a 32-months period
• For nearly all SSD models, more than 20% of the drives in the field experience an
uncorrectable error.
18. 3
• Raw bit error rate (Lab verse Filed study)
• Uncorrectable errors (relations between other characteristics)
• Hardware failure (bad blocks & bad chips)
• Fail-stop Failures (symptoms & predictors)
• Forecasting field reliability
• Future work & Summary
19. What is RBER ?
A common metric to quantify flash reliability is the RBER, which is
computed as the number of corrupted bits divided by the number of
bits read Drives in the first generation
For each page they only report the number of
corrupted bits in the data chunk that had the most
corrupted bits
20. What factor impact RBER in the Field ?
Wear out
Age
Workload
Lithography
MLC vs. SLC
Other factors
21. Wear out ?
• Our goal is to study in detail how RBER grows with P/E cycles in the field.
• Both median and 95th percentile RBER increase as a function of the
number of P/E cycles. (Linear increase)
4 X
22. Observation in wear out
供應商 P/E cycle 給的保守
不能使用本實驗數據測量
P/E cycle 給的數據是舊版本
23. Age ?
• Older drives are more likely to
have higher P/E cycles counts,
which are correlated with RBER.
• An older drive with the same number of P/E cycles had more retention time
between than a younger drive with the same P/E cycles.
Longer retention times might have lead to retention errors that increase the RBER.
24. Workload ?
• Errors are correlated with workload :
• Retention error
• Read disturb error
• Write error
• When field data shows no correlation between RBER and the number
of read operations
(Might indicate the presence pf read disturb)
25. Lithography ?
• Models with a smaller lithography tend to have higher RBER
• Differences in lithography might also explain why the RBER for the
eMLC drives is several orders of magnitude higher than that of the
MLC drives.
26. MLC vs. SLC ?
• MLC cells store multiple bits per cell and as a result the voltage
window separating different values is smaller
• For MLC drives, which have a significantly lower P/E cycle limit
27. Other factor ?
• The RBER for a particular drive model varies depending on the cluster
where the drive is deployed, even when controlling for P/E cycles.
• One possible reason could be different types of workloads running in
different clusters.
28. What factor impact RBER in the Field ?
Wear out (Linear increase)
Age (Old drives higher)
Workload (No influence)
Lithography (smaller higher)
MLC vs. SLC (MLC higher)
Other factors
(different clusters)
29. 4
• Uncorrectable errors (relations between other characteristics)
• Fail-stop Failures (symptoms & predictors)
• Hardware failure (bad blocks & bad chips)
• Forecasting field reliability
• Future work & Summary
30. UBER is not useful
• None of the measures provides any evidence for a correlation
between the number of uncorrectable errors and the number of read
operations or the amount of data read.
• UBER is not a meaningful metric to compare the reliability of different
drives (or drive types) in the field.
𝑼𝑩𝑬𝑹 =
𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒖𝒏𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒂𝒃𝒍𝒆 𝒃𝒊𝒕𝒔
𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒘𝒓𝒊𝒕𝒆/𝒆𝒓𝒂𝒔𝒆
31. What factors impact UEs in the field ?
Wear out
Infant Morality
Workload intensity
Workload patterns
Temperature
Lithography
MLC vs. SLC
32. Wear out ?
Similarly to RBER, the probability of UEs grows
continuously with P/E cycles and visual inspection as
well as curve fitting suggest a linear growth rate.
The models with the lowest
RBER are not necessarily
those with the lowest
incidence of UEs
33. Infant Morality ?
Infant mortality
Make bad blocks and swap
Wear out starts
Early detection period
34. Workload intensity ?
• Neither the Google data nor the Facebook data exhibit a significant
correlation between read operations and the number of UEs
• No correlation
35. Workload patterns ?
• As more DRAM buffer is used, the rate of uncorrectable errors increases
• This is because DRAM buffer usage is higher when data is sparsely allocated,
as more metadata is needed for the same total amount of data stored
36. Temperature ?
Drive-internal mechanisms deployed by the SSD controller, which try to
protect the drive under higher temperatures by throttling workload
and power, might explain the stable or decreasing failure rates under
higher temperatures for some models.
37. Lithography ?
• The effects of lithography are much less obvious in the case of
uncorrectable errors
• It is possible that fabrication process improvements can compensate
for the challenges of smaller feature sizes, and that some contributors
to UEs like firmware bugs do not depend on lithography at all.
38. SLC vs. MLC ?
• This paper do not find that SLC drives are superior for those reliability
metrics that matter most in practice:
(Neither the rate of repairs and replacements nor the rate of
nontransparent errors)
• While SLC drives might be more reliable at very high cycle counts,
they are not generally more reliable than MLC drives when comparing
the two drive types within the cycle limit of MLC drives.
39. What factors impact UEs in the field ?
Wear out (linear growth)
Infant Morality (3 steps)
Workload intensity (no evidence)
Workload patterns (more DRAM buffers used, get higher UEs)
Temperature (Drive-internal mechanisms affect)
Lithography (no strong correlation)
MLC vs. SLC (no strong correlation)
40. Correlations between Drives’ relationship
• Same drives :
Both the Google and the Facebook studies provide clear evidence of
correlations between UEs on the same drive.
• Different drives :
There are correlations between errors in different SSDs on the same
machine.
42. 5
• Hardware failure (bad blocks & bad chips)
• Fail-stop Failures (symptoms & predictors)
• Forecasting field reliability
• Future work & Summary
43. Bad Blocks
It is highly variable with a long tail:
most drives with bad blocks develop only
a small number of them (medians are in the 2–4 range),
but once a drive exceeds this number
it is likely to develop many more bad blocks
MLC : blue
SLC : red
44. Factory bad blocks
• The vast majority (more than 97%) of drives are shipped with factory bad blocks.
• The drives above the 95th percentile of factory bad blocks experience a higher
rate of uncorrectable errors in the field.
45. Bad chips (2% ~ 7 %)
Table 1 shows that failed chips are not a rare occurrence.
We find that in two thirds of the cases,
a chip was marked bad because of the number of failed blocks it had experienced.
46. 6
• Fail-stop Failures (symptoms & predictors)
• Forecasting field reliability
• Future work & Summary
47. Symptoms & predictors
• Data errors
(triggered by the cyclic-redundancy-check)
• Program or erase failures
(they are often symptomatic of block or chips failures)
• SATA downshift
• Reallocated sectors
(the number of sectors that the drive declared bad)
50. Accelerated life tests
• When aging and wear out become a factor, it is common to use
techniques for test acceleration.
• RBER in the field is markedly higher than what the accelerated tests
had indicated.
• That some error mechanisms seem to be difficult to trigger in
accelerated testing.
51. Why it doesn’t work ?
• One of the main difficulties is likely that workload characteristics in the
field can vary widely and are not always captured by standard tests.
• There are also workload-related reasons why error rates in the field can
turn out higher than under test.
Ex : Read disturb errors
52. Projecting reliability based on RBER
• RBER is still a widely used metric for flash reliability is that it can be
measured easily for raw flash chips and then be used as an indicator for
the likelihood of experiencing UEs when using these chips inside an SSD.
53. • Performing an analysis at an even finer time granularity
• Also studied the relationship between RBER and a number of other
types of errors, but find that correlation coefficients are even lower
for other error types.
• This paper conclude that per-drive RBER is a poor predictor of UEs or
other types of errors seen in the field.
55. Future work
• Repeating prior analyses with an emphasis on controlling for confounding
factors would be useful.
• Study patterns of UEs in more detail to gather additional evidence
• On repair actions or results from failure analyses that might be performed
by the manufacturer on returned devices could provide additional insights
• Analysis of bad blocks and bad chips
56. Summary
• RBER, the standard metric for drive reliability, is not a good predictor of
those failure modes that are the major concern in practice.
a common root cause of UEs in the field are defects or
firmware/controller bugs, rather than single cell errors that accumulate.
57. Summary
• No correlation between UEs and number of reads, so normalizing
uncorrectable errors by the number of bits read will artificially inflate
the reported error rate for drives with low read count.
• Both RBER and the number of uncorrectable errors grow with P/E cycles.
(linear growth)
58. Summary
• SLC drives are not more reliable than the lower end MLC drives with
respect to uncorrectable errors for the P/E cycle ranges within the
MLC cycle limits.
• The effect of temperature :
drive internal protection mechanisms that throttle drive operation
under higher temperatures.
59. Summary
• Flash drives offer lower field replacement rates than HDDs, they have a
higher rate of problems that can impact the user, such as uncorrectable
errors.
• A drive with a large number of factory bad blocks has a higher chance of
developing more bad blocks in the field, as well as certain types of errors.