SlideShare a Scribd company logo
1 of 59
Reliability of NAND-Based SSDs
What Field Studies Tell Us
Speaker: Po-Chuan, Chen
Table of contents
• Abstract
• Introduction (3 related studies)
• Errors in flash & protective technique
• Flash reliability with operator’s view
• Raw bit error rate (Lab verse Filed study)
• Uncorrectable errors (relations between other characteristics)
• Hardware failure (bad blocks & bad chips)
• Forecasting field reliability
• Future work & Summary
Abstract
• The goal of this paper is to provide an overview of what we have
learned about flash reliability in production, and where appropriate
contrasting it with prior studies performing controlled experiments.
Introduction (3 related studies)
 Examines uncorrectable errors in flash-based SSD in Facebook’s server.
 A range of different errors and types of hardware failures in SSDs
in Google data centers.
 Fail-stop failures of SSDs at Microsoft data centers
 Data centers verse Lab & assumptions
Focus on what ?
• The different types of errors experienced by flash drives and their
frequency in the field
• Raw bit error rates (RBERs) and their relationship with other errors
• Uncorrectable errors
• The field characteristics of different types of hardware failures
• Fail-stop events
• Comparison (different flash technique)
• Comparison (different between HDDs and flash drives)
1
• Errors in flash & protective technique
• Flash reliability with operator’s view
• Raw bit error rate (Lab verse Filed study)
• Uncorrectable errors (relations between other characteristics)
• Hardware failure (bad blocks & bad chips)
• Fail-stop Failures (symptoms & predictors)
• Forecasting field reliability
• Future work & Summary
Flash-Specific Sources of Errors
• Retention errors
• Read disturb errors
• Write errors
• Wear out
Other Sources of Data Loss or Corruption in SSDs
• Flash drives’ firmware
Data corruption (serious in HDDs, but not in SSDs)
• Power lose
The reason that SSDs have data corruption
Device-Level Protection Against Errors
• The drives at Google mark a block bad after it experiences an
uncorrectable error or a failed program or erase operation
• Simply remove a bad chip from further usage and continue to operate
with reduced capacity
• Drive-internal chips in a RAID-like structure
Uncorrectable errors
in flash-based SSD
Different errors and
types of
hardware failures
in SSDs
Fail-stop failures of SSDs
2
• Flash reliability with operator’s view
• Raw bit error rate (Lab verse Filed study)
• Uncorrectable errors (relations between other characteristics)
• Hardware failure (bad blocks & bad chips)
• Fail-stop Failures (symptoms & predictors)
• Forecasting field reliability
• Future work & Summary
Drive repair and replacements
These replacements include only replacements
that were due to suspected hardware issues
with the drive
Uncorrectable Errors
• Uncorrectable errors are common:
depending on the drive model
between 26% to more than 90% of
drives experience at least one
uncorrectable error
Difference between 3 studies
• The Facebook study (nonstandard SSD design) :
20% ~ 35% has uncorrectable errors
UBERs : 10−9
~ 10−11
• The Microsoft study :
UBERs : 10−11
~ 10−14
Counts of uncorrectable errors has
highly variable distribution with heavily-tails
Fail-Stop Failures
• Nearly 80% of the fail-stopped SSDs were replaced
Difference between 3 studies
• As the drives in the Microsoft study are commodity drives,
it makes sense to compare the fail-stop rates with the annual failure rates one
would expect based on vendor specifications
• Lower fail-stop rates at Microsoft than drive repairs and replacements at Google
Reason why lower fail-stop rates at Microsoft study :
 Fail-stop events might not be the only events that trigger drive replacements at Microsoft
 The usage of the drives
 The Google drive models have larger capacities
Comparison with HDDs
• Replacement rates for SSDs are significantly lower than for HDDs.
• SSDs have significantly higher rates of nontransparent error.
(Ex : uncorrectable error)
• Only 3.4% of them develop latent sector errors over a 32-months period
• For nearly all SSD models, more than 20% of the drives in the field experience an
uncorrectable error.
3
• Raw bit error rate (Lab verse Filed study)
• Uncorrectable errors (relations between other characteristics)
• Hardware failure (bad blocks & bad chips)
• Fail-stop Failures (symptoms & predictors)
• Forecasting field reliability
• Future work & Summary
What is RBER ?
A common metric to quantify flash reliability is the RBER, which is
computed as the number of corrupted bits divided by the number of
bits read Drives in the first generation
For each page they only report the number of
corrupted bits in the data chunk that had the most
corrupted bits
What factor impact RBER in the Field ?
 Wear out
 Age
 Workload
 Lithography
 MLC vs. SLC
 Other factors
Wear out ?
• Our goal is to study in detail how RBER grows with P/E cycles in the field.
• Both median and 95th percentile RBER increase as a function of the
number of P/E cycles. (Linear increase)
4 X
Observation in wear out
 供應商 P/E cycle 給的保守
 不能使用本實驗數據測量
 P/E cycle 給的數據是舊版本
Age ?
• Older drives are more likely to
have higher P/E cycles counts,
which are correlated with RBER.
• An older drive with the same number of P/E cycles had more retention time
between than a younger drive with the same P/E cycles.
Longer retention times might have lead to retention errors that increase the RBER.
Workload ?
• Errors are correlated with workload :
• Retention error
• Read disturb error
• Write error
• When field data shows no correlation between RBER and the number
of read operations
(Might indicate the presence pf read disturb)
Lithography ?
• Models with a smaller lithography tend to have higher RBER
• Differences in lithography might also explain why the RBER for the
eMLC drives is several orders of magnitude higher than that of the
MLC drives.
MLC vs. SLC ?
• MLC cells store multiple bits per cell and as a result the voltage
window separating different values is smaller
• For MLC drives, which have a significantly lower P/E cycle limit
Other factor ?
• The RBER for a particular drive model varies depending on the cluster
where the drive is deployed, even when controlling for P/E cycles.
• One possible reason could be different types of workloads running in
different clusters.
What factor impact RBER in the Field ?
 Wear out (Linear increase)
 Age (Old drives higher)
 Workload (No influence)
 Lithography (smaller higher)
 MLC vs. SLC (MLC higher)
 Other factors
(different clusters)
4
• Uncorrectable errors (relations between other characteristics)
• Fail-stop Failures (symptoms & predictors)
• Hardware failure (bad blocks & bad chips)
• Forecasting field reliability
• Future work & Summary
UBER is not useful
• None of the measures provides any evidence for a correlation
between the number of uncorrectable errors and the number of read
operations or the amount of data read.
• UBER is not a meaningful metric to compare the reliability of different
drives (or drive types) in the field.
𝑼𝑩𝑬𝑹 =
𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒖𝒏𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒂𝒃𝒍𝒆 𝒃𝒊𝒕𝒔
𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒘𝒓𝒊𝒕𝒆/𝒆𝒓𝒂𝒔𝒆
What factors impact UEs in the field ?
 Wear out
 Infant Morality
 Workload intensity
 Workload patterns
 Temperature
 Lithography
 MLC vs. SLC
Wear out ?
Similarly to RBER, the probability of UEs grows
continuously with P/E cycles and visual inspection as
well as curve fitting suggest a linear growth rate.
The models with the lowest
RBER are not necessarily
those with the lowest
incidence of UEs
Infant Morality ?
 Infant mortality
 Make bad blocks and swap
 Wear out starts
Early detection period
Workload intensity ?
• Neither the Google data nor the Facebook data exhibit a significant
correlation between read operations and the number of UEs
• No correlation
Workload patterns ?
• As more DRAM buffer is used, the rate of uncorrectable errors increases
• This is because DRAM buffer usage is higher when data is sparsely allocated,
as more metadata is needed for the same total amount of data stored
Temperature ?
Drive-internal mechanisms deployed by the SSD controller, which try to
protect the drive under higher temperatures by throttling workload
and power, might explain the stable or decreasing failure rates under
higher temperatures for some models.
Lithography ?
• The effects of lithography are much less obvious in the case of
uncorrectable errors
• It is possible that fabrication process improvements can compensate
for the challenges of smaller feature sizes, and that some contributors
to UEs like firmware bugs do not depend on lithography at all.
SLC vs. MLC ?
• This paper do not find that SLC drives are superior for those reliability
metrics that matter most in practice:
(Neither the rate of repairs and replacements nor the rate of
nontransparent errors)
• While SLC drives might be more reliable at very high cycle counts,
they are not generally more reliable than MLC drives when comparing
the two drive types within the cycle limit of MLC drives.
What factors impact UEs in the field ?
 Wear out (linear growth)
 Infant Morality (3 steps)
 Workload intensity (no evidence)
 Workload patterns (more DRAM buffers used, get higher UEs)
 Temperature (Drive-internal mechanisms affect)
 Lithography (no strong correlation)
 MLC vs. SLC (no strong correlation)
Correlations between Drives’ relationship
• Same drives :
Both the Google and the Facebook studies provide clear evidence of
correlations between UEs on the same drive.
• Different drives :
There are correlations between errors in different SSDs on the same
machine.
Between other types of errors
5
• Hardware failure (bad blocks & bad chips)
• Fail-stop Failures (symptoms & predictors)
• Forecasting field reliability
• Future work & Summary
Bad Blocks
It is highly variable with a long tail:
most drives with bad blocks develop only
a small number of them (medians are in the 2–4 range),
but once a drive exceeds this number
it is likely to develop many more bad blocks
MLC : blue
SLC : red
Factory bad blocks
• The vast majority (more than 97%) of drives are shipped with factory bad blocks.
• The drives above the 95th percentile of factory bad blocks experience a higher
rate of uncorrectable errors in the field.
Bad chips (2% ~ 7 %)
Table 1 shows that failed chips are not a rare occurrence.
We find that in two thirds of the cases,
a chip was marked bad because of the number of failed blocks it had experienced.
6
• Fail-stop Failures (symptoms & predictors)
• Forecasting field reliability
• Future work & Summary
Symptoms & predictors
• Data errors
(triggered by the cyclic-redundancy-check)
• Program or erase failures
(they are often symptomatic of block or chips failures)
• SATA downshift
• Reallocated sectors
(the number of sectors that the drive declared bad)
Fail-stop Failures
少 38 %
7
• Forecasting field reliability
• Future work & Summary
Accelerated life tests
• When aging and wear out become a factor, it is common to use
techniques for test acceleration.
• RBER in the field is markedly higher than what the accelerated tests
had indicated.
• That some error mechanisms seem to be difficult to trigger in
accelerated testing.
Why it doesn’t work ?
• One of the main difficulties is likely that workload characteristics in the
field can vary widely and are not always captured by standard tests.
• There are also workload-related reasons why error rates in the field can
turn out higher than under test.
Ex : Read disturb errors
Projecting reliability based on RBER
• RBER is still a widely used metric for flash reliability is that it can be
measured easily for raw flash chips and then be used as an indicator for
the likelihood of experiencing UEs when using these chips inside an SSD.
• Performing an analysis at an even finer time granularity
• Also studied the relationship between RBER and a number of other
types of errors, but find that correlation coefficients are even lower
for other error types.
• This paper conclude that per-drive RBER is a poor predictor of UEs or
other types of errors seen in the field.
8
• Future work & Summary
Future work
• Repeating prior analyses with an emphasis on controlling for confounding
factors would be useful.
• Study patterns of UEs in more detail to gather additional evidence
• On repair actions or results from failure analyses that might be performed
by the manufacturer on returned devices could provide additional insights
• Analysis of bad blocks and bad chips
Summary
• RBER, the standard metric for drive reliability, is not a good predictor of
those failure modes that are the major concern in practice.
a common root cause of UEs in the field are defects or
firmware/controller bugs, rather than single cell errors that accumulate.
Summary
• No correlation between UEs and number of reads, so normalizing
uncorrectable errors by the number of bits read will artificially inflate
the reported error rate for drives with low read count.
• Both RBER and the number of uncorrectable errors grow with P/E cycles.
(linear growth)
Summary
• SLC drives are not more reliable than the lower end MLC drives with
respect to uncorrectable errors for the P/E cycle ranges within the
MLC cycle limits.
• The effect of temperature :
drive internal protection mechanisms that throttle drive operation
under higher temperatures.
Summary
• Flash drives offer lower field replacement rates than HDDs, they have a
higher rate of problems that can impact the user, such as uncorrectable
errors.
• A drive with a large number of factory bad blocks has a higher chance of
developing more bad blocks in the field, as well as certain types of errors.

More Related Content

Similar to Reliability of NAND-Based SSDs What Field Studies Tell Us.pptx

Cassandra Applications Benchmarking
Cassandra Applications BenchmarkingCassandra Applications Benchmarking
Cassandra Applications Benchmarkingniallmilton
 
Performance Testing Principles
Performance Testing PrinciplesPerformance Testing Principles
Performance Testing PrinciplesDariusz Kozon
 
How Do I Know My SQL & Virtual Environments Are Ready for SSD?
How Do I Know My SQL & Virtual Environments Are Ready for SSD?How Do I Know My SQL & Virtual Environments Are Ready for SSD?
How Do I Know My SQL & Virtual Environments Are Ready for SSD?SolarWinds
 
Google Study: Could those failures be caused by design flaws
Google Study: Could those failures be caused by design flawsGoogle Study: Could those failures be caused by design flaws
Google Study: Could those failures be caused by design flawsBarbara Aichinger
 
How do you know you really need ssd
How do you know you really need ssdHow do you know you really need ssd
How do you know you really need ssdJohn McDonald
 
Load Testing Best Practices
Load Testing Best PracticesLoad Testing Best Practices
Load Testing Best PracticesApica
 
Tutorial databasetestingusingsql
Tutorial databasetestingusingsqlTutorial databasetestingusingsql
Tutorial databasetestingusingsqlRenuka Ballal
 
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...Aaron Shilo
 
Neotys PAC - Stephen Townshend
Neotys PAC - Stephen TownshendNeotys PAC - Stephen Townshend
Neotys PAC - Stephen TownshendNeotys_Partner
 
שבוע אורקל 2016
שבוע אורקל 2016שבוע אורקל 2016
שבוע אורקל 2016Aaron Shilo
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Maarten Smeets
 
Performance Testing
Performance TestingPerformance Testing
Performance TestingAnu Shaji
 
Владимир Бронников (Senior .NET Developer, Perfectial) “Performance optimizat...
Владимир Бронников (Senior .NET Developer, Perfectial) “Performance optimizat...Владимир Бронников (Senior .NET Developer, Perfectial) “Performance optimizat...
Владимир Бронников (Senior .NET Developer, Perfectial) “Performance optimizat...DataArt
 
Oracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningOracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningScott Jenner
 
Performance Tuning
Performance TuningPerformance Tuning
Performance TuningJannet Peetz
 
DDR4 Memory Compliance Testing Barbara Aichinger FuturePlus Systems
DDR4 Memory Compliance Testing   Barbara Aichinger FuturePlus SystemsDDR4 Memory Compliance Testing   Barbara Aichinger FuturePlus Systems
DDR4 Memory Compliance Testing Barbara Aichinger FuturePlus SystemsBarbara Aichinger
 
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB
 

Similar to Reliability of NAND-Based SSDs What Field Studies Tell Us.pptx (20)

Cassandra Applications Benchmarking
Cassandra Applications BenchmarkingCassandra Applications Benchmarking
Cassandra Applications Benchmarking
 
Performance Testing Principles
Performance Testing PrinciplesPerformance Testing Principles
Performance Testing Principles
 
Dariusz Kozon - Performance testing principles
Dariusz Kozon - Performance testing principles Dariusz Kozon - Performance testing principles
Dariusz Kozon - Performance testing principles
 
How Do I Know My SQL & Virtual Environments Are Ready for SSD?
How Do I Know My SQL & Virtual Environments Are Ready for SSD?How Do I Know My SQL & Virtual Environments Are Ready for SSD?
How Do I Know My SQL & Virtual Environments Are Ready for SSD?
 
Google Study: Could those failures be caused by design flaws
Google Study: Could those failures be caused by design flawsGoogle Study: Could those failures be caused by design flaws
Google Study: Could those failures be caused by design flaws
 
How do you know you really need ssd
How do you know you really need ssdHow do you know you really need ssd
How do you know you really need ssd
 
Load Testing Best Practices
Load Testing Best PracticesLoad Testing Best Practices
Load Testing Best Practices
 
Tutorial databasetestingusingsql
Tutorial databasetestingusingsqlTutorial databasetestingusingsql
Tutorial databasetestingusingsql
 
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
 
Neotys PAC - Stephen Townshend
Neotys PAC - Stephen TownshendNeotys PAC - Stephen Townshend
Neotys PAC - Stephen Townshend
 
שבוע אורקל 2016
שבוע אורקל 2016שבוע אורקל 2016
שבוע אורקל 2016
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
 
Performance Testing
Performance TestingPerformance Testing
Performance Testing
 
Владимир Бронников (Senior .NET Developer, Perfectial) “Performance optimizat...
Владимир Бронников (Senior .NET Developer, Perfectial) “Performance optimizat...Владимир Бронников (Senior .NET Developer, Perfectial) “Performance optimizat...
Владимир Бронников (Senior .NET Developer, Perfectial) “Performance optimizat...
 
Oracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningOracle R12 EBS Performance Tuning
Oracle R12 EBS Performance Tuning
 
Storage and I/O
Storage and I/OStorage and I/O
Storage and I/O
 
Performance Tuning
Performance TuningPerformance Tuning
Performance Tuning
 
DDR4 Memory Compliance Testing Barbara Aichinger FuturePlus Systems
DDR4 Memory Compliance Testing   Barbara Aichinger FuturePlus SystemsDDR4 Memory Compliance Testing   Barbara Aichinger FuturePlus Systems
DDR4 Memory Compliance Testing Barbara Aichinger FuturePlus Systems
 
Fastest Servlets in the West
Fastest Servlets in the WestFastest Servlets in the West
Fastest Servlets in the West
 
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
 

More from Po-Chuan Chen

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfPo-Chuan Chen
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Po-Chuan Chen
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfPo-Chuan Chen
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Po-Chuan Chen
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfPo-Chuan Chen
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...Po-Chuan Chen
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfPo-Chuan Chen
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfPo-Chuan Chen
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfPo-Chuan Chen
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfPo-Chuan Chen
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfPo-Chuan Chen
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfPo-Chuan Chen
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdfPo-Chuan Chen
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfPo-Chuan Chen
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfPo-Chuan Chen
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdfPo-Chuan Chen
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfPo-Chuan Chen
 

More from Po-Chuan Chen (20)

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdf
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
 

Recently uploaded

Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdfKamal Acharya
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 

Recently uploaded (20)

Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 

Reliability of NAND-Based SSDs What Field Studies Tell Us.pptx

  • 1. Reliability of NAND-Based SSDs What Field Studies Tell Us Speaker: Po-Chuan, Chen
  • 2. Table of contents • Abstract • Introduction (3 related studies) • Errors in flash & protective technique • Flash reliability with operator’s view • Raw bit error rate (Lab verse Filed study) • Uncorrectable errors (relations between other characteristics) • Hardware failure (bad blocks & bad chips) • Forecasting field reliability • Future work & Summary
  • 3. Abstract • The goal of this paper is to provide an overview of what we have learned about flash reliability in production, and where appropriate contrasting it with prior studies performing controlled experiments.
  • 4. Introduction (3 related studies)  Examines uncorrectable errors in flash-based SSD in Facebook’s server.  A range of different errors and types of hardware failures in SSDs in Google data centers.  Fail-stop failures of SSDs at Microsoft data centers  Data centers verse Lab & assumptions
  • 5. Focus on what ? • The different types of errors experienced by flash drives and their frequency in the field • Raw bit error rates (RBERs) and their relationship with other errors • Uncorrectable errors • The field characteristics of different types of hardware failures • Fail-stop events • Comparison (different flash technique) • Comparison (different between HDDs and flash drives)
  • 6. 1 • Errors in flash & protective technique • Flash reliability with operator’s view • Raw bit error rate (Lab verse Filed study) • Uncorrectable errors (relations between other characteristics) • Hardware failure (bad blocks & bad chips) • Fail-stop Failures (symptoms & predictors) • Forecasting field reliability • Future work & Summary
  • 7. Flash-Specific Sources of Errors • Retention errors • Read disturb errors • Write errors • Wear out
  • 8. Other Sources of Data Loss or Corruption in SSDs • Flash drives’ firmware Data corruption (serious in HDDs, but not in SSDs) • Power lose The reason that SSDs have data corruption
  • 9. Device-Level Protection Against Errors • The drives at Google mark a block bad after it experiences an uncorrectable error or a failed program or erase operation • Simply remove a bad chip from further usage and continue to operate with reduced capacity • Drive-internal chips in a RAID-like structure
  • 10. Uncorrectable errors in flash-based SSD Different errors and types of hardware failures in SSDs Fail-stop failures of SSDs
  • 11. 2 • Flash reliability with operator’s view • Raw bit error rate (Lab verse Filed study) • Uncorrectable errors (relations between other characteristics) • Hardware failure (bad blocks & bad chips) • Fail-stop Failures (symptoms & predictors) • Forecasting field reliability • Future work & Summary
  • 12. Drive repair and replacements These replacements include only replacements that were due to suspected hardware issues with the drive
  • 13. Uncorrectable Errors • Uncorrectable errors are common: depending on the drive model between 26% to more than 90% of drives experience at least one uncorrectable error
  • 14. Difference between 3 studies • The Facebook study (nonstandard SSD design) : 20% ~ 35% has uncorrectable errors UBERs : 10−9 ~ 10−11 • The Microsoft study : UBERs : 10−11 ~ 10−14 Counts of uncorrectable errors has highly variable distribution with heavily-tails
  • 15. Fail-Stop Failures • Nearly 80% of the fail-stopped SSDs were replaced
  • 16. Difference between 3 studies • As the drives in the Microsoft study are commodity drives, it makes sense to compare the fail-stop rates with the annual failure rates one would expect based on vendor specifications • Lower fail-stop rates at Microsoft than drive repairs and replacements at Google Reason why lower fail-stop rates at Microsoft study :  Fail-stop events might not be the only events that trigger drive replacements at Microsoft  The usage of the drives  The Google drive models have larger capacities
  • 17. Comparison with HDDs • Replacement rates for SSDs are significantly lower than for HDDs. • SSDs have significantly higher rates of nontransparent error. (Ex : uncorrectable error) • Only 3.4% of them develop latent sector errors over a 32-months period • For nearly all SSD models, more than 20% of the drives in the field experience an uncorrectable error.
  • 18. 3 • Raw bit error rate (Lab verse Filed study) • Uncorrectable errors (relations between other characteristics) • Hardware failure (bad blocks & bad chips) • Fail-stop Failures (symptoms & predictors) • Forecasting field reliability • Future work & Summary
  • 19. What is RBER ? A common metric to quantify flash reliability is the RBER, which is computed as the number of corrupted bits divided by the number of bits read Drives in the first generation For each page they only report the number of corrupted bits in the data chunk that had the most corrupted bits
  • 20. What factor impact RBER in the Field ?  Wear out  Age  Workload  Lithography  MLC vs. SLC  Other factors
  • 21. Wear out ? • Our goal is to study in detail how RBER grows with P/E cycles in the field. • Both median and 95th percentile RBER increase as a function of the number of P/E cycles. (Linear increase) 4 X
  • 22. Observation in wear out  供應商 P/E cycle 給的保守  不能使用本實驗數據測量  P/E cycle 給的數據是舊版本
  • 23. Age ? • Older drives are more likely to have higher P/E cycles counts, which are correlated with RBER. • An older drive with the same number of P/E cycles had more retention time between than a younger drive with the same P/E cycles. Longer retention times might have lead to retention errors that increase the RBER.
  • 24. Workload ? • Errors are correlated with workload : • Retention error • Read disturb error • Write error • When field data shows no correlation between RBER and the number of read operations (Might indicate the presence pf read disturb)
  • 25. Lithography ? • Models with a smaller lithography tend to have higher RBER • Differences in lithography might also explain why the RBER for the eMLC drives is several orders of magnitude higher than that of the MLC drives.
  • 26. MLC vs. SLC ? • MLC cells store multiple bits per cell and as a result the voltage window separating different values is smaller • For MLC drives, which have a significantly lower P/E cycle limit
  • 27. Other factor ? • The RBER for a particular drive model varies depending on the cluster where the drive is deployed, even when controlling for P/E cycles. • One possible reason could be different types of workloads running in different clusters.
  • 28. What factor impact RBER in the Field ?  Wear out (Linear increase)  Age (Old drives higher)  Workload (No influence)  Lithography (smaller higher)  MLC vs. SLC (MLC higher)  Other factors (different clusters)
  • 29. 4 • Uncorrectable errors (relations between other characteristics) • Fail-stop Failures (symptoms & predictors) • Hardware failure (bad blocks & bad chips) • Forecasting field reliability • Future work & Summary
  • 30. UBER is not useful • None of the measures provides any evidence for a correlation between the number of uncorrectable errors and the number of read operations or the amount of data read. • UBER is not a meaningful metric to compare the reliability of different drives (or drive types) in the field. 𝑼𝑩𝑬𝑹 = 𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒖𝒏𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒂𝒃𝒍𝒆 𝒃𝒊𝒕𝒔 𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒘𝒓𝒊𝒕𝒆/𝒆𝒓𝒂𝒔𝒆
  • 31. What factors impact UEs in the field ?  Wear out  Infant Morality  Workload intensity  Workload patterns  Temperature  Lithography  MLC vs. SLC
  • 32. Wear out ? Similarly to RBER, the probability of UEs grows continuously with P/E cycles and visual inspection as well as curve fitting suggest a linear growth rate. The models with the lowest RBER are not necessarily those with the lowest incidence of UEs
  • 33. Infant Morality ?  Infant mortality  Make bad blocks and swap  Wear out starts Early detection period
  • 34. Workload intensity ? • Neither the Google data nor the Facebook data exhibit a significant correlation between read operations and the number of UEs • No correlation
  • 35. Workload patterns ? • As more DRAM buffer is used, the rate of uncorrectable errors increases • This is because DRAM buffer usage is higher when data is sparsely allocated, as more metadata is needed for the same total amount of data stored
  • 36. Temperature ? Drive-internal mechanisms deployed by the SSD controller, which try to protect the drive under higher temperatures by throttling workload and power, might explain the stable or decreasing failure rates under higher temperatures for some models.
  • 37. Lithography ? • The effects of lithography are much less obvious in the case of uncorrectable errors • It is possible that fabrication process improvements can compensate for the challenges of smaller feature sizes, and that some contributors to UEs like firmware bugs do not depend on lithography at all.
  • 38. SLC vs. MLC ? • This paper do not find that SLC drives are superior for those reliability metrics that matter most in practice: (Neither the rate of repairs and replacements nor the rate of nontransparent errors) • While SLC drives might be more reliable at very high cycle counts, they are not generally more reliable than MLC drives when comparing the two drive types within the cycle limit of MLC drives.
  • 39. What factors impact UEs in the field ?  Wear out (linear growth)  Infant Morality (3 steps)  Workload intensity (no evidence)  Workload patterns (more DRAM buffers used, get higher UEs)  Temperature (Drive-internal mechanisms affect)  Lithography (no strong correlation)  MLC vs. SLC (no strong correlation)
  • 40. Correlations between Drives’ relationship • Same drives : Both the Google and the Facebook studies provide clear evidence of correlations between UEs on the same drive. • Different drives : There are correlations between errors in different SSDs on the same machine.
  • 41. Between other types of errors
  • 42. 5 • Hardware failure (bad blocks & bad chips) • Fail-stop Failures (symptoms & predictors) • Forecasting field reliability • Future work & Summary
  • 43. Bad Blocks It is highly variable with a long tail: most drives with bad blocks develop only a small number of them (medians are in the 2–4 range), but once a drive exceeds this number it is likely to develop many more bad blocks MLC : blue SLC : red
  • 44. Factory bad blocks • The vast majority (more than 97%) of drives are shipped with factory bad blocks. • The drives above the 95th percentile of factory bad blocks experience a higher rate of uncorrectable errors in the field.
  • 45. Bad chips (2% ~ 7 %) Table 1 shows that failed chips are not a rare occurrence. We find that in two thirds of the cases, a chip was marked bad because of the number of failed blocks it had experienced.
  • 46. 6 • Fail-stop Failures (symptoms & predictors) • Forecasting field reliability • Future work & Summary
  • 47. Symptoms & predictors • Data errors (triggered by the cyclic-redundancy-check) • Program or erase failures (they are often symptomatic of block or chips failures) • SATA downshift • Reallocated sectors (the number of sectors that the drive declared bad)
  • 49. 7 • Forecasting field reliability • Future work & Summary
  • 50. Accelerated life tests • When aging and wear out become a factor, it is common to use techniques for test acceleration. • RBER in the field is markedly higher than what the accelerated tests had indicated. • That some error mechanisms seem to be difficult to trigger in accelerated testing.
  • 51. Why it doesn’t work ? • One of the main difficulties is likely that workload characteristics in the field can vary widely and are not always captured by standard tests. • There are also workload-related reasons why error rates in the field can turn out higher than under test. Ex : Read disturb errors
  • 52. Projecting reliability based on RBER • RBER is still a widely used metric for flash reliability is that it can be measured easily for raw flash chips and then be used as an indicator for the likelihood of experiencing UEs when using these chips inside an SSD.
  • 53. • Performing an analysis at an even finer time granularity • Also studied the relationship between RBER and a number of other types of errors, but find that correlation coefficients are even lower for other error types. • This paper conclude that per-drive RBER is a poor predictor of UEs or other types of errors seen in the field.
  • 54. 8 • Future work & Summary
  • 55. Future work • Repeating prior analyses with an emphasis on controlling for confounding factors would be useful. • Study patterns of UEs in more detail to gather additional evidence • On repair actions or results from failure analyses that might be performed by the manufacturer on returned devices could provide additional insights • Analysis of bad blocks and bad chips
  • 56. Summary • RBER, the standard metric for drive reliability, is not a good predictor of those failure modes that are the major concern in practice. a common root cause of UEs in the field are defects or firmware/controller bugs, rather than single cell errors that accumulate.
  • 57. Summary • No correlation between UEs and number of reads, so normalizing uncorrectable errors by the number of bits read will artificially inflate the reported error rate for drives with low read count. • Both RBER and the number of uncorrectable errors grow with P/E cycles. (linear growth)
  • 58. Summary • SLC drives are not more reliable than the lower end MLC drives with respect to uncorrectable errors for the P/E cycle ranges within the MLC cycle limits. • The effect of temperature : drive internal protection mechanisms that throttle drive operation under higher temperatures.
  • 59. Summary • Flash drives offer lower field replacement rates than HDDs, they have a higher rate of problems that can impact the user, such as uncorrectable errors. • A drive with a large number of factory bad blocks has a higher chance of developing more bad blocks in the field, as well as certain types of errors.