SlideShare a Scribd company logo
1 of 13
Download to read offline
Appears in the Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07), February 2007




                    Failure Trends in a Large Disk Drive Population
                       Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andr´ Barroso
                                                                          e
                                               Google Inc.
                                         1600 Amphitheatre Pkwy
                                         Mountain View, CA 94043
                                 {edpin,wolf,luiz}@google.com


Abstract                                                              for guiding the design of storage systems as well as de-
                                                                      vising deployment and maintenance strategies.
It is estimated that over 90% of all new information produced            Despite the importance of the subject, there are very
in the world is being stored on magnetic media, most of it on         few published studies on failure characteristics of disk
hard disk drives. Despite their importance, there is relatively
                                                                      drives. Most of the available information comes from
little published work on the failure patterns of disk drives, and
                                                                      the disk manufacturers themselves [2]. Their data are
the key factors that affect their lifetime. Most available data
                                                                      typically based on extrapolation from accelerated life
are either based on extrapolation from accelerated aging exper-
                                                                      test data of small populations or from returned unit
iments or from relatively modest sized field studies. Moreover,
                                                                      databases. Accelerated life tests, although useful in pro-
larger population studies rarely have the infrastructure in place
                                                                      viding insight into how some environmental factors can
to collect health signals from components in operation, which
                                                                      affect disk drive lifetime, have been known to be poor
is critical information for detailed failure analysis.
                                                                      predictors of actual failure rates as seen by customers
    We present data collected from detailed observations of a
                                                                      in the field [7]. Statistics from returned units are typi-
large disk drive population in a production Internet services de-
                                                                      cally based on much larger populations, but since there
ployment. The population observed is many times larger than
that of previous studies. In addition to presenting failure statis-   is little or no visibility into the deployment characteris-
tics, we analyze the correlation between failures and several         tics, the analysis lacks valuable insight into what actu-
parameters generally believed to impact longevity.                    ally happened to the drive during operation. In addition,
    Our analysis identifies several parameters from the drive’s        since units are typically returned during the warranty pe-
self monitoring facility (SMART) that correlate highly with           riod (often three years or less), manufacturers’ databases
failures. Despite this high correlation, we conclude that mod-        may not be as helpful for the study of long-term effects.
els based on SMART parameters alone are unlikely to be useful
                                                                         A few recent studies have shed some light on field
for predicting individual drive failures. Surprisingly, we found
                                                                      failure behavior of disk drives [6, 7, 9, 16, 17, 19, 20].
that temperature and activity levels were much less correlated
                                                                      However, these studies have either reported on relatively
with drive failures than previously reported.
                                                                      modest populations or did not monitor the disks closely
                                                                      enough during deployment to provide insights into the
1     Introduction                                                    factors that might be associated with failures.
                                                                         Disk drives are generally very reliable but they are
The tremendous advances in low-cost, high-capacity                    also very complex components. This combination
magnetic disk drives have been among the key factors                  means that although they fail rarely, when they do fail,
helping establish a modern society that is deeply reliant             the possible causes of failure can be numerous. As a
on information technology. High-volume, consumer-                     result, detailed studies of very large populations are the
grade disk drives have become such a successful prod-                 only way to collect enough failure statistics to enable
uct that their deployments range from home computers                  meaningful conclusions. In this paper we present one
and appliances to large-scale server farms. In 2002, for              such study by examining the population of hard drives
example, it was estimated that over 90% of all new in-                under deployment within Google’s computing infras-
formation produced was stored on magnetic media, most                 tructure.
of it being hard disk drives [12]. It is therefore critical              We have built an infrastructure that collects vital in-
to improve our understanding of how robust these com-                 formation about all Google’s systems every few min-
ponents are and what main factors are associated with                 utes, and a repository that stores these data in time-
failures. Such understanding can be particularly useful               series format (essentially forever) for further analysis.
The information collected includes environmental fac-
tors (such as temperatures), activity levels and many of
the Self-Monitoring Analysis and Reporting Technology
(SMART) parameters that are believed to be good indi-
cators of disk drive health. We mine through these data
and attempt to find evidence that corroborates or con-
tradicts many of the commonly held beliefs about how
various factors can affect disk drive lifetime.
   Our paper is unique in that it is based on data from a
disk population size that is typically only available from
vendor warranty databases, but has the depth of deploy-
ment visibility and detailed lifetime follow-up that only
an end-user study can provide. Our key findings are:

    • Contrary to previously reported results, we found
      very little correlation between failure rates and ei-
      ther elevated temperature or activity levels.

    • Some SMART parameters (scan errors, realloca-
      tion counts, offline reallocation counts, and proba-
      tional counts) have a large impact on failure proba-
      bility.
                                                                  Figure 1: Collection, storage, and analysis architecture.
    • Given the lack of occurrence of predictive SMART
                                                                resources, error indications, and configuration informa-
      signals on a large fraction of failed drives, it is un-
                                                                tion. It is imperative that this daemon’s resource usage
      likely that an accurate predictive failure model can
                                                                be very light, so not to interfere with the applications.
      be built based on these signals alone.
                                                                One way to assure this is to have the machine-level col-
                                                                lector poll individual machines relatively infrequently
2      Background                                               (every few minutes). Other slower changing data (such
                                                                as configuration information) and data from other exist-
                                                                ing databases can be collected even less frequently than
In this section we describe the infrastructure that was
                                                                that. Most notably for this study, data regarding ma-
used to gather and process the data used in this study,
                                                                chine repairs and disk swaps are pulled in from another
the types of disk drives included in the analysis, and in-
                                                                database.
formation on how they are deployed.
                                                                   The System Health database is built upon Bigtable
                                                                [3], a distributed data repository widely used within
2.1     The System Health Infrastructure
                                                                Google, which itself is built upon the Google File Sys-
                                                                tem (GFS) [8]. Bigtable takes care of all the data layout,
The System Health infrastructure is a large distributed
                                                                compression, and access chores associated with a large
software system that collects and stores hundreds of
                                                                data store. It presents the abstraction of a 2-dimensional
attribute-value pairs from all of Google’s servers, and
                                                                table of data cells, with different versions over time mak-
provides the interface for arbitrary analysis jobs to pro-
                                                                ing up a third dimension. It is a natural fit for keeping
cess that data.
                                                                track of the values of different variables (columns) for
   The architecture of the System Health infrastructure
                                                                different machines (rows) over time. The System Health
is shown in Figure 1. It consists of a data collection
                                                                database thus retains a complete time-ordered history of
layer, a distributed repository and an analysis frame-
                                                                the environment, utilization, error, configuration, and re-
work. The collection layer is responsible for getting in-
                                                                pair events in each machine’s life.
formation from each of thousands of individual servers
                                                                   Analysis programs run on top of the System Health
into a centralized repository. Different flavors of col-
                                                                database, looking at information from individual ma-
lectors exist to gather different types of data. Much of
                                                                chines, or mining the data across thousands of machines.
the health information is obtained from the machines di-
                                                                Large-scale analysis programs are typically built upon
rectly. A daemon runs on every machine and gathers
                                                                Google’s Mapreduce [5] framework. Mapreduce auto-
local data related to that machine’s health, such as envi-
                                                                mates the mechanisms of large-scale distributed compu-
ronmental parameters, utilization information of various
tation (such as work distribution, load balancing, toler-    the user site are found to have no defect by the manu-
ance of failures), allowing the user to focus simply on      facturers upon returning the unit. Hughes et al. [11] ob-
the algorithms that make up the heart of the computa-        serve between 20-30% “no problem found” cases after
tion.                                                        analyzing failed drives from their study of 3477 disks.
   The analysis pipeline used for this study consists of        From an end-user’s perspective, a defective drive is
a Mapreduce job written in the Sawzall language and          one that misbehaves in a serious or consistent enough
framework [15] to extract and clean up periodic SMART        manner in the user’s specific deployment scenario that
data and repair data related to disks, followed by a pass    it is no longer suitable for service. Since failures are
through R [1] for statistical analysis and final graph gen-   sometimes the result of a combination of components
eration.                                                     (i.e., a particular drive with a particular controller or ca-
                                                             ble, etc), it is no surprise that a good number of drives
                                                             that fail for a given user could be still considered op-
2.2   Deployment Details                                     erational in a different test harness. We have observed
                                                             that phenomenon ourselves, including situations where
The data in this study are collected from a large num-
                                                             a drive tester consistently “green lights” a unit that in-
ber of disk drives, deployed in several types of systems
                                                             variably fails in the field. Therefore, the most accurate
across all of Google’s services. More than one hundred
                                                             definition we can present of a failure event for our study
thousand disk drives were used for all the results pre-
                                                             is: a drive is considered to have failed if it was replaced
sented here. The disks are a combination of serial and
                                                             as part of a repairs procedure. Note that this definition
parallel ATA consumer-grade hard disk drives, ranging
                                                             implicitly excludes drives that were replaced due to an
in speed from 5400 to 7200 rpm, and in size from 80 to
                                                             upgrade.
400 GB. All units in this study were put into production
                                                                Since it is not always clear when exactly a drive failed,
in or after 2001. The population contains several models
                                                             we consider the time of failure to be when the drive was
from many of the largest disk drive manufacturers and
                                                             replaced, which can sometimes be a few days after the
from at least nine different models. The data used for
                                                             observed failure event. It is also important to mention
this study were collected between December 2005 and
                                                             that the parameters we use in this study were not in use
August 2006.
                                                             as part of the repairs diagnostics procedure at the time
   As is common in server-class deployments, the disks
                                                             that these data were collected. Therefore there is no risk
were powered on, spinning, and generally in service for
                                                             of false (forced) correlations between these signals and
essentially all of their recorded life. They were deployed
                                                             repair outcomes.
in rack-mounted servers and housed in professionally-
managed datacenter facilities.                               Filtering. With such a large number of units monitored
   Before being put into production, all disk drives go      over a long period of time, data integrity issues invari-
through a short burn-in process, which consists of a         ably show up. Information can be lost or corrupted along
combination of read/write stress tests designed to catch     our collection pipeline. Therefore, some cleaning up of
many of the most common assembly, configuration, or           the data is necessary. In the case of missing values, the
component-level problems. The data shown here do not         individual values are marked as not available and that
include the fall-out from this phase, but instead begin      specific piece of data is excluded from the detailed stud-
when the systems are officially commissioned for use.         ies. Other records for that same drive are not discarded.
Therefore our data should be consistent with what a reg-        In cases where the data are clearly spurious, the entire
ular end-user should see, since most equipment manu-         record for the drive is removed, under the assumption
facturers put their systems through similar tests before     that one piece of spurious data draws into question other
shipment.                                                    fields for the same drive. Identifying spurious data, how-
                                                             ever, is a tricky task. Because part of the goal of studying
                                                             the data is to learn what the numbers mean, we must be
2.3   Data Preparation
                                                             careful not to discard too much data that might appear
                                                             invalid. So we define spurious simply as negative counts
Definition of Failure. Narrowly defining what consti-
                                                             or data values that are clearly impossible. For exam-
tutes a failure is a difficult task in such a large opera-
                                                             ple, some drives have reported temperatures that were
tion. Manufacturers and end-users often see different
                                                             hotter than the surface of the sun. Others have had neg-
statistics when computing failures since they use differ-
                                                             ative power cycles. These were deemed spurious and
ent definitions for it. While drive manufacturers often
                                                             removed. On the other hand, we have not filtered any
quote yearly failure rates below 2% [2], user studies have
                                                             suspiciously large counts from the SMART signals, un-
seen rates as high as 6% [9]. Elerath and Shah [7] report
                                                             der the hypothesis that large counts, while improbable as
between 15-60% of drives considered to have failed at
raw numbers, are likely to be good indicators of some-
thing really bad with the drive. Filtering for spurious
values reduced the sample set size by less than 0.1%.


3     Results
We now analyze the failure behavior of our fleet of disk
drives using detailed monitoring data collected over a
nine-month observation window. During this time we
recorded failure events as well as all the available en-
vironmental and activity data and most of the SMART
parameters from the drives themselves. Failure informa-
tion spanning a much longer interval (approximately five
                                                               Figure 2: Annualized failure rates broken down by age groups
years) was also mined from an older repairs database.
All the results presented here were tested for their statis-
                                                               ulation. The higher baseline AFR for 3 and 4 year old
tical significance using the appropriate tests.
                                                               drives is more strongly influenced by the underlying re-
                                                               liability of the particular models in that vintage than by
3.1    Baseline Failure Rates                                  disk drive aging effects. It is interesting to note that our
                                                               3-month, 6-months and 1-year data points do seem to
Figure 2 presents the average Annualized Failure Rates
                                                               indicate a noticeable influence of infant mortality phe-
(AFR) for all drives in our study, aged zero to 5 years,
                                                               nomena, with 1-year AFR dropping significantly from
and is derived from our older repairs database. The data
                                                               the AFR observed in the first three months.
are broken down by the age a drive was when it failed.
Note that this implies some overlap between the sample
sets for the 3-month, 6-month, and 1-year ages, because        3.2    Manufacturers, Models, and Vintages
a drive can reach its 3-month, 6-month and 1-year age
                                                               Failure rates are known to be highly correlated with drive
all within the observation period. Beyond 1-year there is
                                                               models, manufacturers and vintages [18]. Our results do
no more overlap.
                                                               not contradict this fact. For example, Figure 2 changes
   While it may be tempting to read this graph as strictly
                                                               significantly when we normalize failure rates per each
failure rate with drive age, drive model factors are
                                                               drive model. Most age-related results are impacted by
strongly mixed into these data as well. We tend to source
                                                               drive vintages. However, in this paper, we do not show a
a particular drive model only for a limited time (as new,
                                                               breakdown of drives per manufacturer, model, or vintage
more cost-effective models are constantly being intro-
                                                               due to the proprietary nature of these data.
duced), so it is often the case that when we look at sets
                                                                  Interestingly, this does not change our conclusions. In
of drives of different ages we are also looking at a very
                                                               contrast to age-related results, we note that all results
different mix of models. Consequently, these data are
                                                               shown in the rest of the paper are not affected signifi-
not directly useful in understanding the effects of disk
                                                               cantly by the population mix. None of our SMART data
age on failure rates (the exception being the first three
                                                               results change significantly when normalized by drive
data points, which are dominated by a relatively stable
                                                               model. The only exception is seek error rate, which is
mix of disk drive models). The graph is nevertheless a
                                                               dependent on one specific drive manufacturer, as we dis-
good way to provide a baseline characterization of fail-
                                                               cuss in section 3.5.5.
ures across our population. It is also useful for later
studies in the paper, where we can judge how consistent
the impact of a given parameter is across these diverse
                                                               3.3    Utilization
drive model groups. A consistent and noticeable impact
across all groups indicates strongly that the signal being     The literature generally refers to utilization metrics by
measured has a fundamentally powerful correlation with         employing the term duty cycle which unfortunately has
failures, given that it is observed across widely varying      no consistent and precise definition, but can be roughly
ages and models.                                               characterized as the fraction of time a drive is active out
   The observed range of AFRs (see Figure 2) varies            of the total powered-on time. What is widely reported in
from 1.7%, for drives that were in their first year of op-      the literature is that higher duty cycles affect disk drives
eration, to over 8.6%, observed in the 3-year old pop-         negatively [4, 21].
It is difficult for us to arrive at a meaningful numer-
ical utilization metric given that our measurements do
not provide enough detail to derive what 100% utiliza-
tion might be for any given disk model. We choose in-
stead to measure utilization in terms of weekly averages
of read/write bandwidth per drive. We categorize utiliza-
tion in three levels: low, medium and high, correspond-
ing respectively to the lowest 25th percentile, 50-75th
percentiles and top 75th percentile. This categorization
is performed for each drive model, since the maximum
bandwidths have significant variability across drive fam-
ilies. We note that using number of I/O operations and
bytes transferred as utilization metrics provide very sim-
ilar results. Figure 3 shows the impact of utilization on
AFR across the different age groups.                                           Figure 3: Utilization AFR
   Overall, we expected to notice a very strong and con-
sistent correlation between high utilization and higher        3.4    Temperature
failure rates. However our results appear to paint a more
complex picture. First, only very young and very old           Temperature is often quoted as the most important envi-
age groups appear to show the expected behavior. Af-           ronmental factor affecting disk drive reliability. Previous
ter the first year, the AFR of high utilization drives is       studies have indicated that temperature deltas as low as
at most moderately higher than that of low utilization         15C can nearly double disk drive failure rates [4]. Here
drives. The three-year group in fact appears to have the       we take temperature readings from the SMART records
opposite of the expected behavior, with low utilization        every few minutes during the entire 9-month window
drives having slightly higher failure rates than high uti-     of observation and try to understand the correlation be-
lization ones.                                                 tween temperature levels and failure rates.
   One possible explanation for this behavior is the sur-         We have aggregated temperature readings in several
vival of the fittest theory. It is possible that the fail-      different ways, including averages, maxima, fraction of
ure modes that are associated with higher utilization are      time spent above a given temperature value, number of
more prominent early in the drive’s lifetime. If that is the   times a temperature threshold is crossed, and last tem-
case, the drives that survive the infant mortality phase       perature before failure. Here we report data on averages
are the least susceptible to that failure mode, and result     and note that other aggregation forms have shown sim-
in a population that is more robust with respect to varia-     ilar trends and and therefore suggest the same conclu-
tions in utilization levels.                                   sions.
   Another possible explanation is that previous obser-           We first look at the correlation between average tem-
vations of high correlation between utilization and fail-      perature during the observation period and failure. Fig-
ures has been based on extrapolations from manufactur-         ure 4 shows the distribution of drives with average tem-
ers’ accelerated life experiments. Those experiments are       perature in increments of one degree and the correspond-
likely to better model early life failure characteristics,     ing annualized failure rates. The figure shows that fail-
and as such they agree with the trend we observe for the       ures do not increase when the average temperature in-
young age groups. It is possible, however, that longer         creases. In fact, there is a clear trend showing that lower
term population studies could uncover a less pronounced        temperatures are associated with higher failure rates.
effect later in a drive’s lifetime.                            Only at very high temperatures is there a slight reversal
                                                               of this trend.
   When we look at these results across individual mod-
els we again see a complex pattern, with varying pat-             Figure 5 looks at the average temperatures for differ-
terns of failure behavior across the three utilization lev-    ent age groups. The distributions are in sync with Figure
els. Taken as a whole, our data indicate a much weaker         4 showing a mostly flat failure rate at mid-range temper-
correlation between utilization levels and failures than       atures and a modest increase at the low end of the tem-
previous work has suggested.                                   perature distribution. What stands out are the 3 and 4-
                                                               year old drives, where the trend for higher failures with
                                                               higher temperature is much more constant and also more
                                                               pronounced.
                                                                  Overall our experiments can confirm previously re-
ones. At the end of this section we discuss our results
                                                              and reason about the usefulness of SMART parameters
                                                              in obtaining predictive models for individual disk drive
                                                              failures.
                                                                 We present results in three forms. First we compare
                                                              the AFR of drives with zero and non-zero counts for a
                                                              given parameter, broken down by the same age groups
                                                              as in figures 2 and 3. We also find it useful to plot the
                                                              probability of survival of drives over the nine-month ob-
                                                              servation window for different ranges of parameter val-
                                                              ues. Finally, in addition to the graphs, we devise a sin-
                                                              gle metric that could relay how relevant the values of
                                                              a given SMART parameter are in predicting imminent
                                                              failures. To that end, for each SMART parameter we
Figure 4: Distribution of average temperatures and failures   look for thresholds that increased the probability of fail-
rates.                                                        ure in the next 60 days by at least a factor of 10 with
                                                              respect to drives that have zero counts for that parame-
                                                              ter. We report such Critical Thresholds whenever we are
                                                              able to find them with high confidence (> 95%).

                                                              3.5.1   Scan Errors

                                                              Drives typically scan the disk surface in the background
                                                              and report errors as they discover them. Large scan error
                                                              counts can be indicative of surface defects, and therefore
                                                              are believed to be indicative of lower reliability. In our
                                                              population, fewer than 2% of the drives show scan errors
                                                              and they are nearly uniformly spread across various disk
                                                              models.
                                                                 Figure 6 shows the AFR values of two groups of
                                                              drives, those without scan errors and those with one or
         Figure 5: AFR for average drive temperature.         more. We plot bars across all age groups in which we
                                                              have statistically significant data. We find that the group
ported temperature effects only for the high end of our       of drives with scan errors are ten times more likely to fail
temperature range and especially for older drives. In the     than the group with no errors. This effect is also noticed
lower and middle temperature ranges, higher tempera-          when we further break down the groups by disk model.
tures are not associated with higher failure rates. This is      From Figure 8 we see a drastic and quick decrease in
a fairly surprising result, which could indicate that data-   survival probability after the first scan error (left graph).
center or server designers have more freedom than pre-        A little over 70% of the drives survive the first 8 months
viously thought when setting operating temperatures for       after their first scan error. The dashed lines represent the
equipment that contains disk drives. We can conclude          95% confidence interval. The middle plot in Figure 8
that at moderate temperature ranges it is likely that there   separates the population in four age groups (in months),
are other effects which affect failure rates much more        and shows an effect that is not visible in the AFR plots. It
strongly than temperatures do.                                appears that scan errors affect the survival probability of
                                                              young drives more dramatically very soon after the first
                                                              scan error occurs, but after the first month the curve flat-
3.5      SMART Data Analysis
                                                              tens out. Older drives, however, continue to see a steady
                                                              decline in survival probability throughout the 8-month
We now look at the various self-monitoring signals that
                                                              period. This behavior could be another manifestation of
are available from virtually all of our disk drives through
                                                              infant mortality phenomenon. The right graph in figure 8
the SMART standard interface. Our analysis indicates
                                                              looks at the effect of multiple scan errors. While drives
that some signals appear to be more relevant to the study
                                                              with one error are more likely to fail than those with
of failures than others. We first look at those in detail,
                                                              none, drives with multiple errors fail even more quickly.
and then list a summary of our findings for the remaining
Figure 6: AFR for scan errors.                                 Figure 7: AFR for reallocation counts.




Figure 8: Impact of scan errors on survival probability. Left figure shows aggregate survival probability for all drives after first
scan error. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down drives by their
number of scan errors.



   The critical threshold analysis confirms what the                 groups (Figure 7), even if slightly less pronounced.
charts visually imply: the critical threshold for scan er-          Drives with one or more reallocations do fail more of-
rors is one. After the first scan error, drives are 39 times         ten than those with none. The average impact on AFR
more likely to fail within 60 days than drives without              appears to be between a factor of 3-6x.
scan errors.                                                           Figure 11 shows the survival probability after the first
                                                                    reallocation. We truncate the graph to 8.5 months, due
                                                                    to a drastic decrease in the confidence levels after that
3.5.2   Reallocation Counts
                                                                    point. In general, the left graph shows, about 85% of the
When the drive’s logic believes that a sector is damaged            drives survive past 8 months after the first reallocation.
(typically as a result of recurring soft errors or a hard er-       The effect is more pronounced (middle graph) for drives
ror) it can remap the faulty sector number to a new phys-           in the age ranges [10,20) and [20, 60] months, while
ical sector drawn from a pool of spares. Reallocation               newer drives in the range [0,5) months suffer more than
counts reflect the number of times this has happened,                their next generation. This could again be due to infant
and is seen as an indication of drive surface wear. About           mortality effects, although it appears to be less drastic in
9% of our population has reallocation counts greater                this case than for scan errors.
than zero. Although some of our drive models show                      After their first reallocation, drives are over 14 times
higher absolute values than others, the trends we observe           more likely to fail within 60 days than drives without
are similar across all models.                                      reallocation counts, making the critical threshold for this
   As with scan errors, the presence of reallocations               parameter also one.
seems to have a consistent impact on AFR for all age
Figure 9: AFR for offline reallocation count.                        Figure 10: AFR for probational count.




Figure 11: Impact of reallocation count values on survival probability. Left figure shows aggregate survival probability for all
drives after first reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down
drives by their number of reallocations.

3.5.3   Offline Reallocations                                      points were not within high confidence intervals). Drives
                                                                  in the older age groups appear to be more highly affected
Offline reallocations are defined as a subset of the real-          by it, although we are unable to attribute this effect to
location counts studied previously, in which only real-           age given the different model mixes in the various age
located sectors found during background scrubbing are             groups.
counted. In other words, it should exclude sectors that
                                                                     After the first offline reallocation, drives have over
are reallocated as a result of errors found during actual
                                                                  21 times higher chances of failure within 60 days than
I/O operations. Although this definition mostly holds,
                                                                  drives without offline reallocations; an effect that is
we see evidence that certain disk models do not imple-
                                                                  again more drastic than total reallocations.
ment this definition. For instance, some models show
                                                                     Our data suggest that, although offline reallocations
more offline reallocations than total reallocations. Since
                                                                  could be an important parameter affecting failures, it is
the impact of offline reallocations appears to be signif-
                                                                  particularly important to interpret trends in these values
icant and not identical to that of total reallocations, we
                                                                  within specific models, since there is some evidence that
decided to present it separately (Figure 9). About 4% of
                                                                  different drive models may classify reallocations differ-
our population shows non-zero values for offline reallo-
                                                                  ently.
cations, and they tend to be concentrated on a particular
subset of drive models.
                                                                  3.5.4   Probational Counts
   Overall, the effects on survival probability of offline
reallocation seem to be more drastic than those of to-
                                                                  Disk drives put suspect bad sectors “on probation” un-
tal reallocations, as seen in Figure 12 (as before, some
                                                                  til they either fail permanently and are reallocated or
curves are clipped at 8 months because our data for those
                                                                  continue to work without problems. Probational counts,
Figure 12: Impact of offline reallocation on survival probability. Left figure shows aggregate survival probability for all drives
after first offline reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down
drives by their number offline reallocation.




Figure 13: Impact of probational count values on survival probability. Left figure shows aggregate survival probability for all
drives after first probational count. Middle figure breaks down survival probability per drive ages in months. Right figure breaks
down drives by their number of probational counts.

therefore, can be seen as a softer error indication. It           3.5.5    Miscellaneous Signals
could provide earlier warning of possible problems but
                                                                  In addition to the SMART parameters described in the
might also be a weaker signal, in that sectors on pro-
                                                                  previous sections, which we have found to most closely
bation may indeed never be reallocated. About 2% of
                                                                  impact failure rates, we have also studied several other
our drives had non-zero probational count values. We
                                                                  parameters from the SMART set as well as other envi-
note that this number is lower than both online and of-
                                                                  ronmental factors. Here we briefly mention our relevant
fline reallocation counts, likely indicating that sectors
                                                                  findings for some of those parameters.
may be removed from probation after further observa-
tion of their behavior. Once more, the distribution of            Seek Errors. Seek errors occur when a disk drive fails to
drives with non-zero probational counts are somewhat              properly track a sector and needs to wait for another rev-
skewed towards a subset of disk drive models.                     olution to read or write from or to a sector. Drives report
   Figures 10 and 13 show that probational count trends           it as a rate, and it is meant to be used in combination with
are generally similar to those observed for offline re-            model-specific thresholds. When examining our popu-
allocations, with age group being somewhat less pro-              lation, we find that seek errors are widespread within
nounced. The critical threshold for probational counts            drives of one manufacturer only, while others are more
is also one: after the first event, drives are 16 times more       conservative in showing this kind of errors. For this one
likely to fail within 60 days than drives with zero proba-        manufacturer, the trend in seek errors is not clear, chang-
tional counts.                                                    ing from one vintage to another. For other manufactur-
                                                                  ers, there is no correlation between failure rates and seek
                                                                  errors.
                                                                  CRC Errors. Cyclic redundancy check (CRC) errors
are detected during data transmission between the phys-         3.5.6   Predictive Power of SMART Parameters
ical media and the interface. Although we do observe
                                                                Given how strongly correlated some SMART parame-
some correlation between higher CRC counts and fail-
                                                                ters were found to be with higher failure rates, we were
ures, those effects are somewhat less pronounced. CRC
                                                                hopeful that accurate predictive failure models based on
errors are less indicative of drive failures than that of ca-
                                                                SMART signals could be created. Predictive models are
bles and connectors. About 2% of our population had
                                                                very useful in that they can reduce service disruption
CRC errors.
                                                                due to failed components and allow for more efficient
Power Cycles. The power cycles indicator counts the             scheduled maintenance processes to replace the less ef-
number of times a drive is powered up and down. In              ficient (and reactive) repairs procedures. In fact, one of
a server-class deployment, in which drives are powered          the main motivations for SMART was to provide enough
continuously, we do not expect to reach high enough             insight into disk drive behavior to enable such models to
power cycle counts to see any effects on failure rates.         be built.
Our results find that for drives aged up to two years, this         After our initial attempts to derive such models
is true, there is no significant correlation between fail-       yielded relatively unimpressive results, we turned to the
ures and high power cycles count. But for drives 3 years        question of what might be the upper bound of the accu-
and older, higher power cycle counts can increase the           racy of any model based solely on SMART parameters.
absolute failure rate by over 2%. We believe this is due        Our results are surprising, if not somewhat disappoint-
more to our population mix than to aging effects. More-         ing. Out of all failed drives, over 56% of them have no
over, this correlation could be the effect (not the cause)      count in any of the four strong SMART signals, namely
of troubled machines that require many repair iterations        scan errors, reallocation count, offline reallocation, and
and thus many power cycles to be fixed.                          probational count. In other words, models based only
                                                                on those signals can never predict more than half of the
Calibration Retries. We were unable to reach a consis-
                                                                failed drives. Figure 14 shows that even when we add
tent and clear definition of this SMART parameter from
                                                                all remaining SMART parameters (except temperature)
public documents as well as consultations with some of
                                                                we still find that over 36% of all failed drives had zero
the disk manufacturers. Nevertheless, our observations
                                                                counts on all variables. This population includes seek
do not indicate that this is a particularly useful parame-
                                                                error rates, which we have observed to be widespread in
ter for the goals of this study. Under 0.3% of our drives
                                                                our population (> 72% of our drives have it) which fur-
have calibration retries, and of that group only about 2%
                                                                ther reduces the sample size of drives without any errors.
have failed, making this a very weak and imprecise sig-
nal when compared with other SMART parameters.                     It is difficult to add temperature to this analysis since
                                                                despite it being reported as part of SMART there are no
Spin Retries. Counts the number of retries when the
                                                                crisp thresholds that directly indicate errors. However,
drive is attempting to spin up. We did not register a sin-
                                                                if we arbitrarily assume that spending more than 50%
gle count within our entire population.
                                                                of the observed time above 40C is an indication of pos-
                                                                sible problem, and add those drives to the set of pre-
Power-on hours Although we do not dispute that
                                                                dictable failures, we still are left with about 36% of all
power-on hours might have an effect on drive lifetime,
                                                                drives with no failure signals at all. Actual useful mod-
it happens that in our deployment the age of the drive is
                                                                els, which need to have small false-positive rates are in
an excellent approximation for that parameter, given that
                                                                fact likely to do much worse than these limits might sug-
our drives remain powered on for most of their life time.
                                                                gest.
Vibration This is not a parameter that is part of the
                                                                   We conclude that it is unlikely that SMART data alone
SMART set, but it is one that is of general concern in de-
                                                                can be effectively used to build models that predict fail-
signing drive enclosures as most manufacturers describe
                                                                ures of individual drives. SMART parameters still ap-
how vibration can affect both performance and reliabil-
                                                                pear to be useful in reasoning about the aggregate reli-
ity of disk drives. Unfortunately we do not have sensor
                                                                ability of large disk populations, which is still very im-
information to measure this effect directly for drives in
                                                                portant for logistics and supply-chain planning. It is pos-
service. We attempted to indirectly infer vibration ef-
                                                                sible, however, that models that use parameters beyond
fects by considering the differences in failure rates be-
                                                                those provided by SMART could achieve significantly
tween systems with a single drive and those with mul-
                                                                better accuracies. For example, performance anomalies
tiple drives, but those experiments were not controlled
                                                                and other application or operating system signals could
enough for other possible factors to allow us to reach
                                                                be useful in conjunction with SMART data to create
any conclusions.
                                                                more powerful models. We plan to explore this possi-
                                                                bility in our future work.
ployments than a typical disk drive vendor might have.
                                                               Although they do not report directly on the correlation
                                                               between SMART parameters or environmental factors
                                                               and failures (possibly for confidentiality concerns), their
                                                               work is useful in enabling a qualitative understanding
                                                               of factors what affect disk drive reliability. For exam-
                                                               ple, they comment that end-user failure rates can be as
                                                               much as ten times higher than what the drive manufac-
                                                               turer might expect [7]; they report in [6] a strong experi-
                                                               mental correlation between number of heads and higher
                                                               failure rates (an effect that is also predicted by the mod-
                                                               els in [4]); and they observe that different failure mech-
                                                               anisms are at play at different phases of a drive lifetime
                                                               [19]. Generally, our findings are in line with these re-
                                                               sults.
Figure 14: Percentage of failed drives with SMART errors.         User experience studies may lack the depth of insight
                                                               into the device inner workings that is possible in man-
4    Related Work                                              ufacturer reports, but they are essential in understand-
                                                               ing device behavior in real-world deployments. Unfortu-
                                                               nately, there are very few such studies to date, probably
Previous studies in this area generally fall into two cat-
                                                               due to the large number of devices needed to observe
egories: vendor (disk drive or storage appliance) tech-
                                                               statistically significant results and the complex infras-
nical papers and user experience studies. Disk ven-
                                                               tructure required to track failures and their contributing
dors studies provide valuable insight into the electro-
                                                               factors.
mechanical characteristics of disks and both model-
based and experimental data that suggests how several             Talagala and Patterson [20] perform a detailed er-
environmental factors and usage activities can affect de-      ror analysis of 368 SCSI disk drives over an eighteen
vice lifetime. Yang and Sun [21] and Cole [4] describe         month period, reporting a failure rate of 1.9%. Re-
the processes and experimental setup used by Quantum           sults on a larger number of desktop-class ATA drives
and Seagate to test new units and the models that attempt      under deployment at the Internet Archive are presented
to make long-term reliability predictions based on accel-      by Schwarz et al [17]. They report on a 2% failure rate
erated life tests of small populations. Power-on-hours,        for a population of 2489 disks during 2005, while men-
duty cycle, temperature are identified as the key deploy-       tioning that replacement rates have been as high as 6%
ment parameters that impact failure rates, each of them        in the past. Gray and van Ingen [9] cite observed fail-
having the potential to double failure rates when going        ure rates ranging from 3.3-6% in two large web prop-
from nominal to extreme values. For example, Cole              erties with 22,400 and 15,805 disks respectively. A re-
presents thermal de-rating models showing that MTBF            cent study by Schroeder and Gibson [16] helps shed light
could degrade by as much as 50% when going from op-            into the statistical properties of disk drive failures. The
erating temperatures of 30C to 40C. Cole’s report also         study uses failure data from several large scale deploy-
presents yearly failure rates from Seagate’s warranty          ments, including a large number of SATA drives. They
database, indicating a linear decrease in annual failure       report a significant overestimation of mean time to fail-
rates from 1.2% in the first year to 0.39% in the third         ure by manufacturers and a lack of infant mortality ef-
(and last year of record). In our study, we did not find        fects. None of these user studies have attempted to cor-
much correlation between failure rate and either elevated      relate failures with SMART parameters or other environ-
temperature or utilization. It is the most surprising result   mental factors.
of our study. Our annualized failure rates were generally         We are aware of two groups that have attempted
higher than those reported by vendors, and more consis-        to correlate SMART parameters with failure statistics.
tent with other user experience studies.                       Hughes et al [11, 13, 14] and Hamerly and Elkan [10].
   Shah and Elerath have written several papers based          The largest populations studied by these groups was of
on the behavior of disk drives inside Network Appli-           3744 and 1934 drives and they derive failure models that
ance storage products [6, 7, 19]. They use a reliability       achieve predictive rates as high as 30%, at false posi-
database that includes field failure statistics as well as      tive rates of about 0.2% (that false-positive rate corre-
support logs, and their position as an appliance vendor        sponded to a number of drives between 20-43% of the
enables them more control and visibility into actual de-       drives that actually failed in their studies). Hughes et al.
Acknowledgments
also cites an annualized failure rate of 4-6%, based on
their 2-3 month long experiment which appears to use
stress test logs provided by a disk manufacturer.             We wish to acknowledge the contribution of numer-
                                                              ous Google colleagues, particularly in the Platforms and
   Our study takes a next step towards a better under-
                                                              Hardware Operations teams, who made this study pos-
standing of disk drive failure characteristics by essen-
                                                              sible, directly or indirectly; among them: Xiaobo Fan,
tially combining some of the best characteristics of stud-
                                                              Greg Slaughter, Don Yang, Jeremy Kubica, Jim Winget,
ies from vendor database analysis, namely population
                                                              Caio Villela, Justin Moore, Henry Green, Taliver Heath,
size, with the kind of visibility into a real-world deploy-
                                                              and Walt Drummond. We are also thankful to our shep-
ment that is only possible with end-user data.
                                                              herd, Mary Baker for comments and guidance. A special
                                                              thanks to Urs H¨ lzle for his extensive feedback on our
                                                                               o
5    Conclusions                                              drafts.

In this study we report on the failure characteristics of
                                                              References
consumer-grade disk drives. To our knowledge, the
study is unprecedented in that it uses a much larger
                                                               [1] The r project for statistical          computing.
population size than has been previously reported and
                                                                   http://www.r-project.org.
presents a comprehensive analysis of the correlation be-
tween failures and several parameters that are believed to
                                                               [2] Dave Anderson, Jim Dykes, and Erik Riedel. More
affect disk lifetime. Such analysis is made possible by
                                                                   than an interface - scsi vs. ata. In Proceedings of
a new highly parallel health data collection and analysis
                                                                   the 2nd USENIX Conference on File and Storage
infrastructure, and by the sheer size of our computing
                                                                   Technologies (FAST’03), pages 245 – 257, Febru-
deployment.
                                                                   ary 2003.
   One of our key findings has been the lack of a con-
sistent pattern of higher failure rates for higher temper-     [3] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wil-
ature drives or for those drives at higher utilization lev-        son C. Hsieh, Deborah A. Wallach, Mike Burrows,
els. Such correlations have been repeatedly highlighted            Tushar Chandra, Andrew Fikes, and Robert E.
by previous studies, but we are unable to confirm them              Gruber. Bigtable: A distributed storage system for
by observing our population. Although our data do not              structured data. In Proceedings of the 7th USENIX
allow us to conclude that there is no such correlation,            Symposium on Operating Systems Design and Im-
it provides strong evidence to suggest that other effects          plementation (OSDI’06), November 2006.
may be more prominent in affecting disk drive reliabil-
                                                               [4] Gerry Cole. Estimating drive reliability in desktop
ity in the context of a professionally managed data center
                                                                   computers and consumer electronics systems. Sea-
deployment.
                                                                   gate Technology Paper TP-338.1, November 2000.
   Our results confirm the findings of previous smaller
population studies that suggest that some of the SMART         [5] Jeffrey Dean and Sanjay Ghemawat. Mapre-
parameters are well-correlated with higher failure prob-           duce: Simplified data processing on large clus-
abilities. We find, for example, that after their first scan         ters. In Proceedings of the 6th USENIX Symposium
error, drives are 39 times more likely to fail within 60           on Operating Systems Design and Implementation
days than drives with no such errors. First errors in re-          (OSDI’04), pages 137 – 150, December 2004.
allocations, offline reallocations, and probational counts
                                                               [6] Jon G. Elerath and Sandeep Shah. Disk drive re-
are also strongly correlated to higher failure probabil-
                                                                   liability case study: Dependence upon fly-height
ities. Despite those strong correlations, we find that
                                                                   and quantity of heads. In Proceedings of the An-
failure prediction models based on SMART parameters
                                                                   nual Symposium on Reliability and Maintainabil-
alone are likely to be severely limited in their prediction
                                                                   ity, January 2003.
accuracy, given that a large fraction of our failed drives
have shown no SMART error signals whatsoever. This
                                                               [7] Jon G. Elerath and Sandeep Shah. Server class
result suggests that SMART models are more useful in
                                                                   disk drives: How reliable are they? In Proceed-
predicting trends for large aggregate populations than for
                                                                   ings of the Annual Symposium on Reliability and
individual components. It also suggests that powerful
                                                                   Maintainability, pages 151 – 156, January 2004.
predictive models need to make use of signals beyond
those provided by SMART.                                       [8] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
                                                                   Leung. The google file system. In Proceedings of
the 19th ACM Symposium on Operating Systems           [18] Sandeep Shah and Jon G. Elerath. Disk drive vin-
     Principles, pages 29 – 43, December 2003.                  tage and its effect on reliability. In Proceedings
                                                                of the Annual Symposium on Reliability and Main-
 [9] Jim Gray and Catherine van Ingen. Empirical                tainability, pages 163 – 167, January 2004.
     measurements of disk failure rates and error rates.
     Technical Report MSR-TR-2005-166, December            [19] Sandeep Shah and Jon G. Elerath. Reliability anal-
     2005.                                                      ysis of disk drive failure mechanisms. In Proceed-
                                                                ings of the Annual Symposium on Reliability and
[10] Greg Hamerly and Charles Elkan. Bayesian ap-               Maintainability, pages 226 – 231, January 2005.
     proaches to failure prediction for disk drives. In
     Proceedings of the Eighteenth International Con-      [20] Nisha Talagala and David Patterson. An analysis
     ference on Machine Learning (ICML’01), June                of error behavior in a large storage system. Techni-
     2001.                                                      cal Report CSD-99-1042, University of California,
                                                                Berkeley, February 1999.
[11] Gordon F. Hughes, Joseph F. Murray, Kenneth
     Kreutz-Delgado, and Charles Elkan. Improved           [21] Jimmy Yang and Feng-Bin Sun. A comprehensive
     disk-drive failure warnings. IEEE Transactions on          review of hard-disk drive reliability. In Proceed-
     Reliability, 51(3):350 – 357, September 2002.              ings of the Annual Symposium on Reliability and
                                                                Maintainability, pages 403 – 409, January 1999.
[12] Peter     Lyman        and     Hal      R.Varian.
     How much information?                    October
     2003.              http://www2.sims.berkeley.edu/
     research/projects/how-much-info-2003/index.htm.
[13] Joseph F. Murray, Gordon F Hughes, and Kenneth
     Kreutz-Delgado. Hard drive failure prediction us-
     ing non-parametric statistical methods. Proceed-
     ings of ICANN/ICONIP, June 2003.
[14] Joseph F. Murray, Gordon F. Hughes, and Ken-
     neth Kreutz-Delgado. Machine learning methods
     for predicting failures in hard drives: A multiple-
     instance application. J. Mach. Learn. Res., 6:783–
     816, 2005.
[15] Rob Pike, Sean Dorward, Robert Griesemer, and
     Sean Quinlan. Interpreting the data: Parallel anal-
     ysis with sawzall. Scientific Programming Jour-
     nal, Special Issue on Grids and Worldwide Com-
     puting Programming Models and Infrastructure,
     13(4):227 – 298.
[16] Bianca Schroeder and Garth A. Gibson. Disk
     failures in the real world: What does an mttf of
     1,000,000 hours mean to you? In Proceedings of
     the 5th USENIX Conference on File and Storage
     Technologies (FAST), February 2007.
[17] Thomas Schwartz, Mary Baker, Steven Bassi,
     Bruce Baumgart, Wayne Flagg, Catherine van
     Ingen, Kobus Joste, Mark Manasse, and Mehul
     Shah. Disk failure investigations at the internet
     archive. 14th NASA Goddard, 23rd IEEE Confer-
     ence on Mass Storage Systems and Technologies,
     May 2006.

More Related Content

What's hot

What's hot (6)

Mastering disaster a data center checklist
Mastering disaster a data center checklistMastering disaster a data center checklist
Mastering disaster a data center checklist
 
2010 data protection best practices
2010 data protection best practices2010 data protection best practices
2010 data protection best practices
 
NDeX Data Centers: A Vault in the Cloud
NDeX Data Centers: A Vault in the CloudNDeX Data Centers: A Vault in the Cloud
NDeX Data Centers: A Vault in the Cloud
 
The Nuts and Bolts of Disaster Recovery
The Nuts and Bolts of Disaster RecoveryThe Nuts and Bolts of Disaster Recovery
The Nuts and Bolts of Disaster Recovery
 
Gaasterland Laboratory Simplifies Genomics Research with Panasas
Gaasterland Laboratory Simplifies Genomics Research with PanasasGaasterland Laboratory Simplifies Genomics Research with Panasas
Gaasterland Laboratory Simplifies Genomics Research with Panasas
 
Preparing Your Business For A Disaster
Preparing Your Business For A DisasterPreparing Your Business For A Disaster
Preparing Your Business For A Disaster
 

Similar to Disk Failures

Transf React Proact T&D Ass Management
Transf React Proact T&D Ass ManagementTransf React Proact T&D Ass Management
Transf React Proact T&D Ass Management
Bert Taube
 
Efficient usage of memory management in big data using “anti caching”
Efficient usage of memory management in big data using “anti caching”Efficient usage of memory management in big data using “anti caching”
Efficient usage of memory management in big data using “anti caching”
eSAT Journals
 
Dmg emc-avamar-optimized-backup-recovery-dedupe[1]
Dmg emc-avamar-optimized-backup-recovery-dedupe[1]Dmg emc-avamar-optimized-backup-recovery-dedupe[1]
Dmg emc-avamar-optimized-backup-recovery-dedupe[1]
Nitesh Bhat
 
Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...
butest
 
IOUG93 - Technical Architecture for the Data Warehouse - Paper
IOUG93 - Technical Architecture for the Data Warehouse - PaperIOUG93 - Technical Architecture for the Data Warehouse - Paper
IOUG93 - Technical Architecture for the Data Warehouse - Paper
David Walker
 
First Things First
First Things FirstFirst Things First
First Things First
MB Software & Consulting, Inc.
 
First Things First
First Things FirstFirst Things First
First Things First
MB Software & Consulting, Inc.
 
The economics of backup 5 ways disk backup can help your business
The economics of backup 5 ways disk backup can help your businessThe economics of backup 5 ways disk backup can help your business
The economics of backup 5 ways disk backup can help your business
Servium
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
TERN Australia
 

Similar to Disk Failures (20)

36575
3657536575
36575
 
Transf React Proact T&D Ass Management
Transf React Proact T&D Ass ManagementTransf React Proact T&D Ass Management
Transf React Proact T&D Ass Management
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
Efficient usage of memory management in big data using “anti caching”
Efficient usage of memory management in big data using “anti caching”Efficient usage of memory management in big data using “anti caching”
Efficient usage of memory management in big data using “anti caching”
 
Dmg emc-avamar-optimized-backup-recovery-dedupe[1]
Dmg emc-avamar-optimized-backup-recovery-dedupe[1]Dmg emc-avamar-optimized-backup-recovery-dedupe[1]
Dmg emc-avamar-optimized-backup-recovery-dedupe[1]
 
Oracle 0472
Oracle 0472Oracle 0472
Oracle 0472
 
Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...Machine Learning-Based Prefetch Optimization for Data Center ...
Machine Learning-Based Prefetch Optimization for Data Center ...
 
White Paper: Advanced Cyber Analytics with Greenplum Database
White Paper: Advanced Cyber Analytics with Greenplum DatabaseWhite Paper: Advanced Cyber Analytics with Greenplum Database
White Paper: Advanced Cyber Analytics with Greenplum Database
 
IOUG93 - Technical Architecture for the Data Warehouse - Paper
IOUG93 - Technical Architecture for the Data Warehouse - PaperIOUG93 - Technical Architecture for the Data Warehouse - Paper
IOUG93 - Technical Architecture for the Data Warehouse - Paper
 
First Things First
First Things FirstFirst Things First
First Things First
 
First Things First
First Things FirstFirst Things First
First Things First
 
Black Box Backup System
Black Box Backup SystemBlack Box Backup System
Black Box Backup System
 
Appistry WGDAS Presentation
Appistry WGDAS PresentationAppistry WGDAS Presentation
Appistry WGDAS Presentation
 
Data De-Duplication Engine for Efficient Storage Management
Data De-Duplication Engine for Efficient Storage ManagementData De-Duplication Engine for Efficient Storage Management
Data De-Duplication Engine for Efficient Storage Management
 
The Evolution of IP Storage and Its Impact on the Network
The Evolution of IP Storage and Its Impact on the NetworkThe Evolution of IP Storage and Its Impact on the Network
The Evolution of IP Storage and Its Impact on the Network
 
The economics of backup 5 ways disk backup can help your business
The economics of backup 5 ways disk backup can help your businessThe economics of backup 5 ways disk backup can help your business
The economics of backup 5 ways disk backup can help your business
 
Dedupe-Centric Storage for General Applications
Dedupe-Centric Storage for General Applications Dedupe-Centric Storage for General Applications
Dedupe-Centric Storage for General Applications
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
 
Error performance of jpeg and watermark
Error performance of jpeg and watermarkError performance of jpeg and watermark
Error performance of jpeg and watermark
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Disk Failures

  • 1. Appears in the Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07), February 2007 Failure Trends in a Large Disk Drive Population Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andr´ Barroso e Google Inc. 1600 Amphitheatre Pkwy Mountain View, CA 94043 {edpin,wolf,luiz}@google.com Abstract for guiding the design of storage systems as well as de- vising deployment and maintenance strategies. It is estimated that over 90% of all new information produced Despite the importance of the subject, there are very in the world is being stored on magnetic media, most of it on few published studies on failure characteristics of disk hard disk drives. Despite their importance, there is relatively drives. Most of the available information comes from little published work on the failure patterns of disk drives, and the disk manufacturers themselves [2]. Their data are the key factors that affect their lifetime. Most available data typically based on extrapolation from accelerated life are either based on extrapolation from accelerated aging exper- test data of small populations or from returned unit iments or from relatively modest sized field studies. Moreover, databases. Accelerated life tests, although useful in pro- larger population studies rarely have the infrastructure in place viding insight into how some environmental factors can to collect health signals from components in operation, which affect disk drive lifetime, have been known to be poor is critical information for detailed failure analysis. predictors of actual failure rates as seen by customers We present data collected from detailed observations of a in the field [7]. Statistics from returned units are typi- large disk drive population in a production Internet services de- cally based on much larger populations, but since there ployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statis- is little or no visibility into the deployment characteris- tics, we analyze the correlation between failures and several tics, the analysis lacks valuable insight into what actu- parameters generally believed to impact longevity. ally happened to the drive during operation. In addition, Our analysis identifies several parameters from the drive’s since units are typically returned during the warranty pe- self monitoring facility (SMART) that correlate highly with riod (often three years or less), manufacturers’ databases failures. Despite this high correlation, we conclude that mod- may not be as helpful for the study of long-term effects. els based on SMART parameters alone are unlikely to be useful A few recent studies have shed some light on field for predicting individual drive failures. Surprisingly, we found failure behavior of disk drives [6, 7, 9, 16, 17, 19, 20]. that temperature and activity levels were much less correlated However, these studies have either reported on relatively with drive failures than previously reported. modest populations or did not monitor the disks closely enough during deployment to provide insights into the 1 Introduction factors that might be associated with failures. Disk drives are generally very reliable but they are The tremendous advances in low-cost, high-capacity also very complex components. This combination magnetic disk drives have been among the key factors means that although they fail rarely, when they do fail, helping establish a modern society that is deeply reliant the possible causes of failure can be numerous. As a on information technology. High-volume, consumer- result, detailed studies of very large populations are the grade disk drives have become such a successful prod- only way to collect enough failure statistics to enable uct that their deployments range from home computers meaningful conclusions. In this paper we present one and appliances to large-scale server farms. In 2002, for such study by examining the population of hard drives example, it was estimated that over 90% of all new in- under deployment within Google’s computing infras- formation produced was stored on magnetic media, most tructure. of it being hard disk drives [12]. It is therefore critical We have built an infrastructure that collects vital in- to improve our understanding of how robust these com- formation about all Google’s systems every few min- ponents are and what main factors are associated with utes, and a repository that stores these data in time- failures. Such understanding can be particularly useful series format (essentially forever) for further analysis.
  • 2. The information collected includes environmental fac- tors (such as temperatures), activity levels and many of the Self-Monitoring Analysis and Reporting Technology (SMART) parameters that are believed to be good indi- cators of disk drive health. We mine through these data and attempt to find evidence that corroborates or con- tradicts many of the commonly held beliefs about how various factors can affect disk drive lifetime. Our paper is unique in that it is based on data from a disk population size that is typically only available from vendor warranty databases, but has the depth of deploy- ment visibility and detailed lifetime follow-up that only an end-user study can provide. Our key findings are: • Contrary to previously reported results, we found very little correlation between failure rates and ei- ther elevated temperature or activity levels. • Some SMART parameters (scan errors, realloca- tion counts, offline reallocation counts, and proba- tional counts) have a large impact on failure proba- bility. Figure 1: Collection, storage, and analysis architecture. • Given the lack of occurrence of predictive SMART resources, error indications, and configuration informa- signals on a large fraction of failed drives, it is un- tion. It is imperative that this daemon’s resource usage likely that an accurate predictive failure model can be very light, so not to interfere with the applications. be built based on these signals alone. One way to assure this is to have the machine-level col- lector poll individual machines relatively infrequently 2 Background (every few minutes). Other slower changing data (such as configuration information) and data from other exist- ing databases can be collected even less frequently than In this section we describe the infrastructure that was that. Most notably for this study, data regarding ma- used to gather and process the data used in this study, chine repairs and disk swaps are pulled in from another the types of disk drives included in the analysis, and in- database. formation on how they are deployed. The System Health database is built upon Bigtable [3], a distributed data repository widely used within 2.1 The System Health Infrastructure Google, which itself is built upon the Google File Sys- tem (GFS) [8]. Bigtable takes care of all the data layout, The System Health infrastructure is a large distributed compression, and access chores associated with a large software system that collects and stores hundreds of data store. It presents the abstraction of a 2-dimensional attribute-value pairs from all of Google’s servers, and table of data cells, with different versions over time mak- provides the interface for arbitrary analysis jobs to pro- ing up a third dimension. It is a natural fit for keeping cess that data. track of the values of different variables (columns) for The architecture of the System Health infrastructure different machines (rows) over time. The System Health is shown in Figure 1. It consists of a data collection database thus retains a complete time-ordered history of layer, a distributed repository and an analysis frame- the environment, utilization, error, configuration, and re- work. The collection layer is responsible for getting in- pair events in each machine’s life. formation from each of thousands of individual servers Analysis programs run on top of the System Health into a centralized repository. Different flavors of col- database, looking at information from individual ma- lectors exist to gather different types of data. Much of chines, or mining the data across thousands of machines. the health information is obtained from the machines di- Large-scale analysis programs are typically built upon rectly. A daemon runs on every machine and gathers Google’s Mapreduce [5] framework. Mapreduce auto- local data related to that machine’s health, such as envi- mates the mechanisms of large-scale distributed compu- ronmental parameters, utilization information of various
  • 3. tation (such as work distribution, load balancing, toler- the user site are found to have no defect by the manu- ance of failures), allowing the user to focus simply on facturers upon returning the unit. Hughes et al. [11] ob- the algorithms that make up the heart of the computa- serve between 20-30% “no problem found” cases after tion. analyzing failed drives from their study of 3477 disks. The analysis pipeline used for this study consists of From an end-user’s perspective, a defective drive is a Mapreduce job written in the Sawzall language and one that misbehaves in a serious or consistent enough framework [15] to extract and clean up periodic SMART manner in the user’s specific deployment scenario that data and repair data related to disks, followed by a pass it is no longer suitable for service. Since failures are through R [1] for statistical analysis and final graph gen- sometimes the result of a combination of components eration. (i.e., a particular drive with a particular controller or ca- ble, etc), it is no surprise that a good number of drives that fail for a given user could be still considered op- 2.2 Deployment Details erational in a different test harness. We have observed that phenomenon ourselves, including situations where The data in this study are collected from a large num- a drive tester consistently “green lights” a unit that in- ber of disk drives, deployed in several types of systems variably fails in the field. Therefore, the most accurate across all of Google’s services. More than one hundred definition we can present of a failure event for our study thousand disk drives were used for all the results pre- is: a drive is considered to have failed if it was replaced sented here. The disks are a combination of serial and as part of a repairs procedure. Note that this definition parallel ATA consumer-grade hard disk drives, ranging implicitly excludes drives that were replaced due to an in speed from 5400 to 7200 rpm, and in size from 80 to upgrade. 400 GB. All units in this study were put into production Since it is not always clear when exactly a drive failed, in or after 2001. The population contains several models we consider the time of failure to be when the drive was from many of the largest disk drive manufacturers and replaced, which can sometimes be a few days after the from at least nine different models. The data used for observed failure event. It is also important to mention this study were collected between December 2005 and that the parameters we use in this study were not in use August 2006. as part of the repairs diagnostics procedure at the time As is common in server-class deployments, the disks that these data were collected. Therefore there is no risk were powered on, spinning, and generally in service for of false (forced) correlations between these signals and essentially all of their recorded life. They were deployed repair outcomes. in rack-mounted servers and housed in professionally- managed datacenter facilities. Filtering. With such a large number of units monitored Before being put into production, all disk drives go over a long period of time, data integrity issues invari- through a short burn-in process, which consists of a ably show up. Information can be lost or corrupted along combination of read/write stress tests designed to catch our collection pipeline. Therefore, some cleaning up of many of the most common assembly, configuration, or the data is necessary. In the case of missing values, the component-level problems. The data shown here do not individual values are marked as not available and that include the fall-out from this phase, but instead begin specific piece of data is excluded from the detailed stud- when the systems are officially commissioned for use. ies. Other records for that same drive are not discarded. Therefore our data should be consistent with what a reg- In cases where the data are clearly spurious, the entire ular end-user should see, since most equipment manu- record for the drive is removed, under the assumption facturers put their systems through similar tests before that one piece of spurious data draws into question other shipment. fields for the same drive. Identifying spurious data, how- ever, is a tricky task. Because part of the goal of studying the data is to learn what the numbers mean, we must be 2.3 Data Preparation careful not to discard too much data that might appear invalid. So we define spurious simply as negative counts Definition of Failure. Narrowly defining what consti- or data values that are clearly impossible. For exam- tutes a failure is a difficult task in such a large opera- ple, some drives have reported temperatures that were tion. Manufacturers and end-users often see different hotter than the surface of the sun. Others have had neg- statistics when computing failures since they use differ- ative power cycles. These were deemed spurious and ent definitions for it. While drive manufacturers often removed. On the other hand, we have not filtered any quote yearly failure rates below 2% [2], user studies have suspiciously large counts from the SMART signals, un- seen rates as high as 6% [9]. Elerath and Shah [7] report der the hypothesis that large counts, while improbable as between 15-60% of drives considered to have failed at
  • 4. raw numbers, are likely to be good indicators of some- thing really bad with the drive. Filtering for spurious values reduced the sample set size by less than 0.1%. 3 Results We now analyze the failure behavior of our fleet of disk drives using detailed monitoring data collected over a nine-month observation window. During this time we recorded failure events as well as all the available en- vironmental and activity data and most of the SMART parameters from the drives themselves. Failure informa- tion spanning a much longer interval (approximately five Figure 2: Annualized failure rates broken down by age groups years) was also mined from an older repairs database. All the results presented here were tested for their statis- ulation. The higher baseline AFR for 3 and 4 year old tical significance using the appropriate tests. drives is more strongly influenced by the underlying re- liability of the particular models in that vintage than by 3.1 Baseline Failure Rates disk drive aging effects. It is interesting to note that our 3-month, 6-months and 1-year data points do seem to Figure 2 presents the average Annualized Failure Rates indicate a noticeable influence of infant mortality phe- (AFR) for all drives in our study, aged zero to 5 years, nomena, with 1-year AFR dropping significantly from and is derived from our older repairs database. The data the AFR observed in the first three months. are broken down by the age a drive was when it failed. Note that this implies some overlap between the sample sets for the 3-month, 6-month, and 1-year ages, because 3.2 Manufacturers, Models, and Vintages a drive can reach its 3-month, 6-month and 1-year age Failure rates are known to be highly correlated with drive all within the observation period. Beyond 1-year there is models, manufacturers and vintages [18]. Our results do no more overlap. not contradict this fact. For example, Figure 2 changes While it may be tempting to read this graph as strictly significantly when we normalize failure rates per each failure rate with drive age, drive model factors are drive model. Most age-related results are impacted by strongly mixed into these data as well. We tend to source drive vintages. However, in this paper, we do not show a a particular drive model only for a limited time (as new, breakdown of drives per manufacturer, model, or vintage more cost-effective models are constantly being intro- due to the proprietary nature of these data. duced), so it is often the case that when we look at sets Interestingly, this does not change our conclusions. In of drives of different ages we are also looking at a very contrast to age-related results, we note that all results different mix of models. Consequently, these data are shown in the rest of the paper are not affected signifi- not directly useful in understanding the effects of disk cantly by the population mix. None of our SMART data age on failure rates (the exception being the first three results change significantly when normalized by drive data points, which are dominated by a relatively stable model. The only exception is seek error rate, which is mix of disk drive models). The graph is nevertheless a dependent on one specific drive manufacturer, as we dis- good way to provide a baseline characterization of fail- cuss in section 3.5.5. ures across our population. It is also useful for later studies in the paper, where we can judge how consistent the impact of a given parameter is across these diverse 3.3 Utilization drive model groups. A consistent and noticeable impact across all groups indicates strongly that the signal being The literature generally refers to utilization metrics by measured has a fundamentally powerful correlation with employing the term duty cycle which unfortunately has failures, given that it is observed across widely varying no consistent and precise definition, but can be roughly ages and models. characterized as the fraction of time a drive is active out The observed range of AFRs (see Figure 2) varies of the total powered-on time. What is widely reported in from 1.7%, for drives that were in their first year of op- the literature is that higher duty cycles affect disk drives eration, to over 8.6%, observed in the 3-year old pop- negatively [4, 21].
  • 5. It is difficult for us to arrive at a meaningful numer- ical utilization metric given that our measurements do not provide enough detail to derive what 100% utiliza- tion might be for any given disk model. We choose in- stead to measure utilization in terms of weekly averages of read/write bandwidth per drive. We categorize utiliza- tion in three levels: low, medium and high, correspond- ing respectively to the lowest 25th percentile, 50-75th percentiles and top 75th percentile. This categorization is performed for each drive model, since the maximum bandwidths have significant variability across drive fam- ilies. We note that using number of I/O operations and bytes transferred as utilization metrics provide very sim- ilar results. Figure 3 shows the impact of utilization on AFR across the different age groups. Figure 3: Utilization AFR Overall, we expected to notice a very strong and con- sistent correlation between high utilization and higher 3.4 Temperature failure rates. However our results appear to paint a more complex picture. First, only very young and very old Temperature is often quoted as the most important envi- age groups appear to show the expected behavior. Af- ronmental factor affecting disk drive reliability. Previous ter the first year, the AFR of high utilization drives is studies have indicated that temperature deltas as low as at most moderately higher than that of low utilization 15C can nearly double disk drive failure rates [4]. Here drives. The three-year group in fact appears to have the we take temperature readings from the SMART records opposite of the expected behavior, with low utilization every few minutes during the entire 9-month window drives having slightly higher failure rates than high uti- of observation and try to understand the correlation be- lization ones. tween temperature levels and failure rates. One possible explanation for this behavior is the sur- We have aggregated temperature readings in several vival of the fittest theory. It is possible that the fail- different ways, including averages, maxima, fraction of ure modes that are associated with higher utilization are time spent above a given temperature value, number of more prominent early in the drive’s lifetime. If that is the times a temperature threshold is crossed, and last tem- case, the drives that survive the infant mortality phase perature before failure. Here we report data on averages are the least susceptible to that failure mode, and result and note that other aggregation forms have shown sim- in a population that is more robust with respect to varia- ilar trends and and therefore suggest the same conclu- tions in utilization levels. sions. Another possible explanation is that previous obser- We first look at the correlation between average tem- vations of high correlation between utilization and fail- perature during the observation period and failure. Fig- ures has been based on extrapolations from manufactur- ure 4 shows the distribution of drives with average tem- ers’ accelerated life experiments. Those experiments are perature in increments of one degree and the correspond- likely to better model early life failure characteristics, ing annualized failure rates. The figure shows that fail- and as such they agree with the trend we observe for the ures do not increase when the average temperature in- young age groups. It is possible, however, that longer creases. In fact, there is a clear trend showing that lower term population studies could uncover a less pronounced temperatures are associated with higher failure rates. effect later in a drive’s lifetime. Only at very high temperatures is there a slight reversal of this trend. When we look at these results across individual mod- els we again see a complex pattern, with varying pat- Figure 5 looks at the average temperatures for differ- terns of failure behavior across the three utilization lev- ent age groups. The distributions are in sync with Figure els. Taken as a whole, our data indicate a much weaker 4 showing a mostly flat failure rate at mid-range temper- correlation between utilization levels and failures than atures and a modest increase at the low end of the tem- previous work has suggested. perature distribution. What stands out are the 3 and 4- year old drives, where the trend for higher failures with higher temperature is much more constant and also more pronounced. Overall our experiments can confirm previously re-
  • 6. ones. At the end of this section we discuss our results and reason about the usefulness of SMART parameters in obtaining predictive models for individual disk drive failures. We present results in three forms. First we compare the AFR of drives with zero and non-zero counts for a given parameter, broken down by the same age groups as in figures 2 and 3. We also find it useful to plot the probability of survival of drives over the nine-month ob- servation window for different ranges of parameter val- ues. Finally, in addition to the graphs, we devise a sin- gle metric that could relay how relevant the values of a given SMART parameter are in predicting imminent failures. To that end, for each SMART parameter we Figure 4: Distribution of average temperatures and failures look for thresholds that increased the probability of fail- rates. ure in the next 60 days by at least a factor of 10 with respect to drives that have zero counts for that parame- ter. We report such Critical Thresholds whenever we are able to find them with high confidence (> 95%). 3.5.1 Scan Errors Drives typically scan the disk surface in the background and report errors as they discover them. Large scan error counts can be indicative of surface defects, and therefore are believed to be indicative of lower reliability. In our population, fewer than 2% of the drives show scan errors and they are nearly uniformly spread across various disk models. Figure 6 shows the AFR values of two groups of drives, those without scan errors and those with one or Figure 5: AFR for average drive temperature. more. We plot bars across all age groups in which we have statistically significant data. We find that the group ported temperature effects only for the high end of our of drives with scan errors are ten times more likely to fail temperature range and especially for older drives. In the than the group with no errors. This effect is also noticed lower and middle temperature ranges, higher tempera- when we further break down the groups by disk model. tures are not associated with higher failure rates. This is From Figure 8 we see a drastic and quick decrease in a fairly surprising result, which could indicate that data- survival probability after the first scan error (left graph). center or server designers have more freedom than pre- A little over 70% of the drives survive the first 8 months viously thought when setting operating temperatures for after their first scan error. The dashed lines represent the equipment that contains disk drives. We can conclude 95% confidence interval. The middle plot in Figure 8 that at moderate temperature ranges it is likely that there separates the population in four age groups (in months), are other effects which affect failure rates much more and shows an effect that is not visible in the AFR plots. It strongly than temperatures do. appears that scan errors affect the survival probability of young drives more dramatically very soon after the first scan error occurs, but after the first month the curve flat- 3.5 SMART Data Analysis tens out. Older drives, however, continue to see a steady decline in survival probability throughout the 8-month We now look at the various self-monitoring signals that period. This behavior could be another manifestation of are available from virtually all of our disk drives through infant mortality phenomenon. The right graph in figure 8 the SMART standard interface. Our analysis indicates looks at the effect of multiple scan errors. While drives that some signals appear to be more relevant to the study with one error are more likely to fail than those with of failures than others. We first look at those in detail, none, drives with multiple errors fail even more quickly. and then list a summary of our findings for the remaining
  • 7. Figure 6: AFR for scan errors. Figure 7: AFR for reallocation counts. Figure 8: Impact of scan errors on survival probability. Left figure shows aggregate survival probability for all drives after first scan error. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down drives by their number of scan errors. The critical threshold analysis confirms what the groups (Figure 7), even if slightly less pronounced. charts visually imply: the critical threshold for scan er- Drives with one or more reallocations do fail more of- rors is one. After the first scan error, drives are 39 times ten than those with none. The average impact on AFR more likely to fail within 60 days than drives without appears to be between a factor of 3-6x. scan errors. Figure 11 shows the survival probability after the first reallocation. We truncate the graph to 8.5 months, due to a drastic decrease in the confidence levels after that 3.5.2 Reallocation Counts point. In general, the left graph shows, about 85% of the When the drive’s logic believes that a sector is damaged drives survive past 8 months after the first reallocation. (typically as a result of recurring soft errors or a hard er- The effect is more pronounced (middle graph) for drives ror) it can remap the faulty sector number to a new phys- in the age ranges [10,20) and [20, 60] months, while ical sector drawn from a pool of spares. Reallocation newer drives in the range [0,5) months suffer more than counts reflect the number of times this has happened, their next generation. This could again be due to infant and is seen as an indication of drive surface wear. About mortality effects, although it appears to be less drastic in 9% of our population has reallocation counts greater this case than for scan errors. than zero. Although some of our drive models show After their first reallocation, drives are over 14 times higher absolute values than others, the trends we observe more likely to fail within 60 days than drives without are similar across all models. reallocation counts, making the critical threshold for this As with scan errors, the presence of reallocations parameter also one. seems to have a consistent impact on AFR for all age
  • 8. Figure 9: AFR for offline reallocation count. Figure 10: AFR for probational count. Figure 11: Impact of reallocation count values on survival probability. Left figure shows aggregate survival probability for all drives after first reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down drives by their number of reallocations. 3.5.3 Offline Reallocations points were not within high confidence intervals). Drives in the older age groups appear to be more highly affected Offline reallocations are defined as a subset of the real- by it, although we are unable to attribute this effect to location counts studied previously, in which only real- age given the different model mixes in the various age located sectors found during background scrubbing are groups. counted. In other words, it should exclude sectors that After the first offline reallocation, drives have over are reallocated as a result of errors found during actual 21 times higher chances of failure within 60 days than I/O operations. Although this definition mostly holds, drives without offline reallocations; an effect that is we see evidence that certain disk models do not imple- again more drastic than total reallocations. ment this definition. For instance, some models show Our data suggest that, although offline reallocations more offline reallocations than total reallocations. Since could be an important parameter affecting failures, it is the impact of offline reallocations appears to be signif- particularly important to interpret trends in these values icant and not identical to that of total reallocations, we within specific models, since there is some evidence that decided to present it separately (Figure 9). About 4% of different drive models may classify reallocations differ- our population shows non-zero values for offline reallo- ently. cations, and they tend to be concentrated on a particular subset of drive models. 3.5.4 Probational Counts Overall, the effects on survival probability of offline reallocation seem to be more drastic than those of to- Disk drives put suspect bad sectors “on probation” un- tal reallocations, as seen in Figure 12 (as before, some til they either fail permanently and are reallocated or curves are clipped at 8 months because our data for those continue to work without problems. Probational counts,
  • 9. Figure 12: Impact of offline reallocation on survival probability. Left figure shows aggregate survival probability for all drives after first offline reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down drives by their number offline reallocation. Figure 13: Impact of probational count values on survival probability. Left figure shows aggregate survival probability for all drives after first probational count. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down drives by their number of probational counts. therefore, can be seen as a softer error indication. It 3.5.5 Miscellaneous Signals could provide earlier warning of possible problems but In addition to the SMART parameters described in the might also be a weaker signal, in that sectors on pro- previous sections, which we have found to most closely bation may indeed never be reallocated. About 2% of impact failure rates, we have also studied several other our drives had non-zero probational count values. We parameters from the SMART set as well as other envi- note that this number is lower than both online and of- ronmental factors. Here we briefly mention our relevant fline reallocation counts, likely indicating that sectors findings for some of those parameters. may be removed from probation after further observa- tion of their behavior. Once more, the distribution of Seek Errors. Seek errors occur when a disk drive fails to drives with non-zero probational counts are somewhat properly track a sector and needs to wait for another rev- skewed towards a subset of disk drive models. olution to read or write from or to a sector. Drives report Figures 10 and 13 show that probational count trends it as a rate, and it is meant to be used in combination with are generally similar to those observed for offline re- model-specific thresholds. When examining our popu- allocations, with age group being somewhat less pro- lation, we find that seek errors are widespread within nounced. The critical threshold for probational counts drives of one manufacturer only, while others are more is also one: after the first event, drives are 16 times more conservative in showing this kind of errors. For this one likely to fail within 60 days than drives with zero proba- manufacturer, the trend in seek errors is not clear, chang- tional counts. ing from one vintage to another. For other manufactur- ers, there is no correlation between failure rates and seek errors. CRC Errors. Cyclic redundancy check (CRC) errors
  • 10. are detected during data transmission between the phys- 3.5.6 Predictive Power of SMART Parameters ical media and the interface. Although we do observe Given how strongly correlated some SMART parame- some correlation between higher CRC counts and fail- ters were found to be with higher failure rates, we were ures, those effects are somewhat less pronounced. CRC hopeful that accurate predictive failure models based on errors are less indicative of drive failures than that of ca- SMART signals could be created. Predictive models are bles and connectors. About 2% of our population had very useful in that they can reduce service disruption CRC errors. due to failed components and allow for more efficient Power Cycles. The power cycles indicator counts the scheduled maintenance processes to replace the less ef- number of times a drive is powered up and down. In ficient (and reactive) repairs procedures. In fact, one of a server-class deployment, in which drives are powered the main motivations for SMART was to provide enough continuously, we do not expect to reach high enough insight into disk drive behavior to enable such models to power cycle counts to see any effects on failure rates. be built. Our results find that for drives aged up to two years, this After our initial attempts to derive such models is true, there is no significant correlation between fail- yielded relatively unimpressive results, we turned to the ures and high power cycles count. But for drives 3 years question of what might be the upper bound of the accu- and older, higher power cycle counts can increase the racy of any model based solely on SMART parameters. absolute failure rate by over 2%. We believe this is due Our results are surprising, if not somewhat disappoint- more to our population mix than to aging effects. More- ing. Out of all failed drives, over 56% of them have no over, this correlation could be the effect (not the cause) count in any of the four strong SMART signals, namely of troubled machines that require many repair iterations scan errors, reallocation count, offline reallocation, and and thus many power cycles to be fixed. probational count. In other words, models based only on those signals can never predict more than half of the Calibration Retries. We were unable to reach a consis- failed drives. Figure 14 shows that even when we add tent and clear definition of this SMART parameter from all remaining SMART parameters (except temperature) public documents as well as consultations with some of we still find that over 36% of all failed drives had zero the disk manufacturers. Nevertheless, our observations counts on all variables. This population includes seek do not indicate that this is a particularly useful parame- error rates, which we have observed to be widespread in ter for the goals of this study. Under 0.3% of our drives our population (> 72% of our drives have it) which fur- have calibration retries, and of that group only about 2% ther reduces the sample size of drives without any errors. have failed, making this a very weak and imprecise sig- nal when compared with other SMART parameters. It is difficult to add temperature to this analysis since despite it being reported as part of SMART there are no Spin Retries. Counts the number of retries when the crisp thresholds that directly indicate errors. However, drive is attempting to spin up. We did not register a sin- if we arbitrarily assume that spending more than 50% gle count within our entire population. of the observed time above 40C is an indication of pos- sible problem, and add those drives to the set of pre- Power-on hours Although we do not dispute that dictable failures, we still are left with about 36% of all power-on hours might have an effect on drive lifetime, drives with no failure signals at all. Actual useful mod- it happens that in our deployment the age of the drive is els, which need to have small false-positive rates are in an excellent approximation for that parameter, given that fact likely to do much worse than these limits might sug- our drives remain powered on for most of their life time. gest. Vibration This is not a parameter that is part of the We conclude that it is unlikely that SMART data alone SMART set, but it is one that is of general concern in de- can be effectively used to build models that predict fail- signing drive enclosures as most manufacturers describe ures of individual drives. SMART parameters still ap- how vibration can affect both performance and reliabil- pear to be useful in reasoning about the aggregate reli- ity of disk drives. Unfortunately we do not have sensor ability of large disk populations, which is still very im- information to measure this effect directly for drives in portant for logistics and supply-chain planning. It is pos- service. We attempted to indirectly infer vibration ef- sible, however, that models that use parameters beyond fects by considering the differences in failure rates be- those provided by SMART could achieve significantly tween systems with a single drive and those with mul- better accuracies. For example, performance anomalies tiple drives, but those experiments were not controlled and other application or operating system signals could enough for other possible factors to allow us to reach be useful in conjunction with SMART data to create any conclusions. more powerful models. We plan to explore this possi- bility in our future work.
  • 11. ployments than a typical disk drive vendor might have. Although they do not report directly on the correlation between SMART parameters or environmental factors and failures (possibly for confidentiality concerns), their work is useful in enabling a qualitative understanding of factors what affect disk drive reliability. For exam- ple, they comment that end-user failure rates can be as much as ten times higher than what the drive manufac- turer might expect [7]; they report in [6] a strong experi- mental correlation between number of heads and higher failure rates (an effect that is also predicted by the mod- els in [4]); and they observe that different failure mech- anisms are at play at different phases of a drive lifetime [19]. Generally, our findings are in line with these re- sults. Figure 14: Percentage of failed drives with SMART errors. User experience studies may lack the depth of insight into the device inner workings that is possible in man- 4 Related Work ufacturer reports, but they are essential in understand- ing device behavior in real-world deployments. Unfortu- nately, there are very few such studies to date, probably Previous studies in this area generally fall into two cat- due to the large number of devices needed to observe egories: vendor (disk drive or storage appliance) tech- statistically significant results and the complex infras- nical papers and user experience studies. Disk ven- tructure required to track failures and their contributing dors studies provide valuable insight into the electro- factors. mechanical characteristics of disks and both model- based and experimental data that suggests how several Talagala and Patterson [20] perform a detailed er- environmental factors and usage activities can affect de- ror analysis of 368 SCSI disk drives over an eighteen vice lifetime. Yang and Sun [21] and Cole [4] describe month period, reporting a failure rate of 1.9%. Re- the processes and experimental setup used by Quantum sults on a larger number of desktop-class ATA drives and Seagate to test new units and the models that attempt under deployment at the Internet Archive are presented to make long-term reliability predictions based on accel- by Schwarz et al [17]. They report on a 2% failure rate erated life tests of small populations. Power-on-hours, for a population of 2489 disks during 2005, while men- duty cycle, temperature are identified as the key deploy- tioning that replacement rates have been as high as 6% ment parameters that impact failure rates, each of them in the past. Gray and van Ingen [9] cite observed fail- having the potential to double failure rates when going ure rates ranging from 3.3-6% in two large web prop- from nominal to extreme values. For example, Cole erties with 22,400 and 15,805 disks respectively. A re- presents thermal de-rating models showing that MTBF cent study by Schroeder and Gibson [16] helps shed light could degrade by as much as 50% when going from op- into the statistical properties of disk drive failures. The erating temperatures of 30C to 40C. Cole’s report also study uses failure data from several large scale deploy- presents yearly failure rates from Seagate’s warranty ments, including a large number of SATA drives. They database, indicating a linear decrease in annual failure report a significant overestimation of mean time to fail- rates from 1.2% in the first year to 0.39% in the third ure by manufacturers and a lack of infant mortality ef- (and last year of record). In our study, we did not find fects. None of these user studies have attempted to cor- much correlation between failure rate and either elevated relate failures with SMART parameters or other environ- temperature or utilization. It is the most surprising result mental factors. of our study. Our annualized failure rates were generally We are aware of two groups that have attempted higher than those reported by vendors, and more consis- to correlate SMART parameters with failure statistics. tent with other user experience studies. Hughes et al [11, 13, 14] and Hamerly and Elkan [10]. Shah and Elerath have written several papers based The largest populations studied by these groups was of on the behavior of disk drives inside Network Appli- 3744 and 1934 drives and they derive failure models that ance storage products [6, 7, 19]. They use a reliability achieve predictive rates as high as 30%, at false posi- database that includes field failure statistics as well as tive rates of about 0.2% (that false-positive rate corre- support logs, and their position as an appliance vendor sponded to a number of drives between 20-43% of the enables them more control and visibility into actual de- drives that actually failed in their studies). Hughes et al.
  • 12. Acknowledgments also cites an annualized failure rate of 4-6%, based on their 2-3 month long experiment which appears to use stress test logs provided by a disk manufacturer. We wish to acknowledge the contribution of numer- ous Google colleagues, particularly in the Platforms and Our study takes a next step towards a better under- Hardware Operations teams, who made this study pos- standing of disk drive failure characteristics by essen- sible, directly or indirectly; among them: Xiaobo Fan, tially combining some of the best characteristics of stud- Greg Slaughter, Don Yang, Jeremy Kubica, Jim Winget, ies from vendor database analysis, namely population Caio Villela, Justin Moore, Henry Green, Taliver Heath, size, with the kind of visibility into a real-world deploy- and Walt Drummond. We are also thankful to our shep- ment that is only possible with end-user data. herd, Mary Baker for comments and guidance. A special thanks to Urs H¨ lzle for his extensive feedback on our o 5 Conclusions drafts. In this study we report on the failure characteristics of References consumer-grade disk drives. To our knowledge, the study is unprecedented in that it uses a much larger [1] The r project for statistical computing. population size than has been previously reported and http://www.r-project.org. presents a comprehensive analysis of the correlation be- tween failures and several parameters that are believed to [2] Dave Anderson, Jim Dykes, and Erik Riedel. More affect disk lifetime. Such analysis is made possible by than an interface - scsi vs. ata. In Proceedings of a new highly parallel health data collection and analysis the 2nd USENIX Conference on File and Storage infrastructure, and by the sheer size of our computing Technologies (FAST’03), pages 245 – 257, Febru- deployment. ary 2003. One of our key findings has been the lack of a con- sistent pattern of higher failure rates for higher temper- [3] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wil- ature drives or for those drives at higher utilization lev- son C. Hsieh, Deborah A. Wallach, Mike Burrows, els. Such correlations have been repeatedly highlighted Tushar Chandra, Andrew Fikes, and Robert E. by previous studies, but we are unable to confirm them Gruber. Bigtable: A distributed storage system for by observing our population. Although our data do not structured data. In Proceedings of the 7th USENIX allow us to conclude that there is no such correlation, Symposium on Operating Systems Design and Im- it provides strong evidence to suggest that other effects plementation (OSDI’06), November 2006. may be more prominent in affecting disk drive reliabil- [4] Gerry Cole. Estimating drive reliability in desktop ity in the context of a professionally managed data center computers and consumer electronics systems. Sea- deployment. gate Technology Paper TP-338.1, November 2000. Our results confirm the findings of previous smaller population studies that suggest that some of the SMART [5] Jeffrey Dean and Sanjay Ghemawat. Mapre- parameters are well-correlated with higher failure prob- duce: Simplified data processing on large clus- abilities. We find, for example, that after their first scan ters. In Proceedings of the 6th USENIX Symposium error, drives are 39 times more likely to fail within 60 on Operating Systems Design and Implementation days than drives with no such errors. First errors in re- (OSDI’04), pages 137 – 150, December 2004. allocations, offline reallocations, and probational counts [6] Jon G. Elerath and Sandeep Shah. Disk drive re- are also strongly correlated to higher failure probabil- liability case study: Dependence upon fly-height ities. Despite those strong correlations, we find that and quantity of heads. In Proceedings of the An- failure prediction models based on SMART parameters nual Symposium on Reliability and Maintainabil- alone are likely to be severely limited in their prediction ity, January 2003. accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This [7] Jon G. Elerath and Sandeep Shah. Server class result suggests that SMART models are more useful in disk drives: How reliable are they? In Proceed- predicting trends for large aggregate populations than for ings of the Annual Symposium on Reliability and individual components. It also suggests that powerful Maintainability, pages 151 – 156, January 2004. predictive models need to make use of signals beyond those provided by SMART. [8] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In Proceedings of
  • 13. the 19th ACM Symposium on Operating Systems [18] Sandeep Shah and Jon G. Elerath. Disk drive vin- Principles, pages 29 – 43, December 2003. tage and its effect on reliability. In Proceedings of the Annual Symposium on Reliability and Main- [9] Jim Gray and Catherine van Ingen. Empirical tainability, pages 163 – 167, January 2004. measurements of disk failure rates and error rates. Technical Report MSR-TR-2005-166, December [19] Sandeep Shah and Jon G. Elerath. Reliability anal- 2005. ysis of disk drive failure mechanisms. In Proceed- ings of the Annual Symposium on Reliability and [10] Greg Hamerly and Charles Elkan. Bayesian ap- Maintainability, pages 226 – 231, January 2005. proaches to failure prediction for disk drives. In Proceedings of the Eighteenth International Con- [20] Nisha Talagala and David Patterson. An analysis ference on Machine Learning (ICML’01), June of error behavior in a large storage system. Techni- 2001. cal Report CSD-99-1042, University of California, Berkeley, February 1999. [11] Gordon F. Hughes, Joseph F. Murray, Kenneth Kreutz-Delgado, and Charles Elkan. Improved [21] Jimmy Yang and Feng-Bin Sun. A comprehensive disk-drive failure warnings. IEEE Transactions on review of hard-disk drive reliability. In Proceed- Reliability, 51(3):350 – 357, September 2002. ings of the Annual Symposium on Reliability and Maintainability, pages 403 – 409, January 1999. [12] Peter Lyman and Hal R.Varian. How much information? October 2003. http://www2.sims.berkeley.edu/ research/projects/how-much-info-2003/index.htm. [13] Joseph F. Murray, Gordon F Hughes, and Kenneth Kreutz-Delgado. Hard drive failure prediction us- ing non-parametric statistical methods. Proceed- ings of ICANN/ICONIP, June 2003. [14] Joseph F. Murray, Gordon F. Hughes, and Ken- neth Kreutz-Delgado. Machine learning methods for predicting failures in hard drives: A multiple- instance application. J. Mach. Learn. Res., 6:783– 816, 2005. [15] Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel anal- ysis with sawzall. Scientific Programming Jour- nal, Special Issue on Grids and Worldwide Com- puting Programming Models and Infrastructure, 13(4):227 – 298. [16] Bianca Schroeder and Garth A. Gibson. Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST), February 2007. [17] Thomas Schwartz, Mary Baker, Steven Bassi, Bruce Baumgart, Wayne Flagg, Catherine van Ingen, Kobus Joste, Mark Manasse, and Mehul Shah. Disk failure investigations at the internet archive. 14th NASA Goddard, 23rd IEEE Confer- ence on Mass Storage Systems and Technologies, May 2006.