Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. DRAM Errors in the Wild: A Large-Scale Field Study Bianca Schroeder Eduardo Pinheiro Wolf-Dietrich Weber Dept. of Computer Science Google Inc. Google Inc. University of Toronto Mountain View, CA Mountain View, CA Toronto, Canada bianca@cs.toronto.eduABSTRACT were last written. Memory errors can be caused by elec-Errors in dynamic random access memory (DRAM) are a common trical or magnetic interference (e.g. due to cosmic rays),form of hardware failure in modern compute clusters. Failures are can be due to problems with the hardware (e.g. a bit beingcostly both in terms of hardware replacement costs and service permanently damaged), or can be the result of corruptiondisruption. While a large body of work exists on DRAM in labo- along the data path between the memories and the process-ratory conditions, little has been reported on real DRAM failures ing elements. Memory errors can be classified into soft er-in large production clusters. In this paper, we analyze measure-ments of memory errors in a large fleet of commodity servers over rors, which randomly corrupt bits but do not leave physicala period of 2.5 years. The collected data covers multiple vendors, damage; and hard errors, which corrupt bits in a repeatableDRAM capacities and technologies, and comprises many millions manner because of a physical defect.of DIMM days. The consequence of a memory error is system dependent. The goal of this paper is to answer questions such as the follow- In systems using memory without support for error correc-ing: How common are memory errors in practice? What are their tion and detection, a memory error can lead to a machinestatistical properties? How are they affected by external factors,such as temperature and utilization, and by chip-specific factors, crash or applications using corrupted data. Most memorysuch as chip density, memory technology and DIMM age? systems in server machines employ error correcting codes We find that DRAM error behavior in the field differs in many (ECC) [5], which allow the detection and correction of onekey aspects from commonly held assumptions. For example, we or multiple bit errors. If an error is uncorrectable, i.e. theobserve DRAM error rates that are orders of magnitude higher number of affected bits exceed the limit of what the ECCthan previously reported, with 25,000 to 70,000 errors per billion can correct, typically a machine shutdown is forced. Indevice hours per Mbit and more than 8% of DIMMs affectedby errors per year. We provide strong evidence that memory many production environments, including ours, a single un-errors are dominated by hard errors, rather than soft errors, which correctable error is considered serious enough to replace theprevious work suspects to be the dominant error mode. We find dual in-line memory module (DIMM) that caused it.that temperature, known to strongly impact DIMM error rates in Memory errors are costly in terms of the system failureslab conditions, has a surprisingly small effect on error behavior they cause and the repair costs associated with them. In pro-in the field, when taking all other factors into account. Finally, duction sites running large-scale systems, memory compo-unlike commonly feared, we don’t observe any indication thatnewer generations of DIMMs have worse error behavior. nent replacements rank near the top of component replace- ments [20] and memory errors are one of the most commonCategories and Subject Descriptors: B.8 [Hardware]: hardware problems to lead to machine crashes [19]. More-Performance and Reliability; C.4 [Computer Systems Orga- over, recent work shows that memory errors can cause secu-nization]: Performance of Systems; rity vulnerabilities [7,22]. There is also a fear that advancingGeneral Terms: Reliability. densities in DRAM technology might lead to increased mem-Keywords: DRAM, DIMM, memory, reliability, data cor- ory errors, exacerbating this problem in the future [3,12,13].ruption, soft error, hard error, large-scale systems. Despite the practical relevance of DRAM errors, very little is known about their prevalence in real production systems. Existing studies are mostly based on lab experiments us-1. INTRODUCTION ing accelerated testing, where DRAM is exposed to extreme Errors in dynamic random access memory (DRAM) de- conditions (such as high temperature) to artificially inducevices have been a concern for a long time [3, 11, 15–17, 23]. errors. It is not clear how such results carry over to realA memory error is an event that leads to the logical state production systems. The few existing studies that are basedof one or multiple bits being read differently from how they on measurements in real systems are small in scale, such as recent work by Li et al. [10], who report on DRAM errors in 300 machines over a period of 3 to 7 months. One main reason for the limited understanding of DRAMPermission to make digital or hard copies of all or part of this work for errors in real systems is the large experimental scale requiredpersonal or classroom use is granted without fee provided that copies are to obtain interesting measurements. A detailed study of er-not made or distributed for profit or commercial advantage and that copies rors requires data collection over a long time period (severalbear this notice and the full citation on the first page. To copy otherwise, to years) and thousands of machines, a scale that researchersrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. cannot easily replicate in their labs. Production sites, whichSIGMETRICS/Performance’09, June 15–19, 2009, Seattle, WA, USA. run large-scale systems, often do not collect and record errorCopyright 2009 ACM 978-1-60558-511-6/09/06 ...$5.00.
  2. 2. Breadcrumbsdata rigorously, or are reluctant to share it because of thesensitive nature of data related to failures. This paper provides the first large-scale study of DRAM Raw Datamemory errors in the field. It is based on data collected Collectorfrom Google’s server fleet over a period of more than two Computing Nodeyears making up many millions of DIMM days. The DRAMin our study covers multiple vendors, DRAM densities and Aggregated Raw Datatechnologies (DDR1, DDR2, and FBDIMM). The paper addresses the following questions: How com- Real Timemon are memory errors in practice? What are their statis-tical properties? How are they affected by external factors, Bigtablesuch as temperature, and system utilization? And how do Resultsthey vary with chip-specific factors, such as chip density, Selected Raw Datamemory technology and DIMM age? We find that in many aspects DRAM errors in the field be- Summary Data Sawzallhave very differently than commonly assumed. For example, Analysis Toolwe observe DRAM error rates that are orders of magnitudehigher than previously reported, with FIT rates (failures intime per billion device hours) of 25,000 to 70,000 per Mbit Figure 1: Collection, storage, and analysis architec-and more than 8% of DIMMs affected per year. We provide ture.strong evidence that memory errors are dominated by harderrors, rather than soft errors, which most previous workfocuses on. We find that, out of all the factors that impact because of a physical defect (e.g. “stuck bits”). Our mea-a DIMM’s error behavior in the field, temperature has a surement infrastructure captures both hard and soft errors,surprisingly small effect. Finally, unlike commonly feared, but does not allow us to reliably distinguish these types ofwe don’t observe any indication that per-DIMM error rates errors. All our numbers include both hard and soft errors.increase with newer generations of DIMMs. Single-bit soft errors in the memory array can accumu- late over time and turn into multi-bit errors. In order to2. BACKGROUND AND METHODOLOGY avoid this accumulation of single-bit errors, memory systems can employ a hardware scrubber [14] that scans through the2.1 Memory errors and their handling memory, while the memory is otherwise idle. Any memory Most memory systems in use in servers today are pro- words with single-bit errors are written back after correction,tected by error detection and correction codes. The typical thus eliminating the single-bit error if it was soft. Three ofarrangement is for a memory access word to be augmented the six hardware platforms (Platforms C, D and F) we con-with additional bits to contain the error code. Typical error sider make use of memory scrubbers. The typical scrubbingcodes in commodity server systems today fall in the single rate in those systems is 1GB every 45 minutes. In the othererror correct double error detect (SECDED) category. That three hardware platforms (Platforms A, B, and E) errors aremeans they can reliably detect and correct any single-bit er- only detected on access.ror, but they can only detect and not correct multiple biterrors. More powerful codes can correct and detect more er- 2.2 The systemsror bits in a single memory word. For example, a code family Our data covers the majority of machines in Google’s fleetknown as chip-kill [6], can correct up to 4 adjacent bits at and spans nearly 2.5 years, from January 2006 to June 2008.once, thus being able to work around a completely broken Each machine comprises a motherboard with some proces-4-bit wide DRAM chip. We use the terms correctable error sors and memory DIMMs. We study 6 different hardware(CE) and uncorrectable error (UE) in this paper to general- platforms, where a platform is defined by the motherboardize away the details of the actual error codes used. and memory generation. If done well, the handling of correctable memory errors is The memory in these systems covers a wide variety of thelargely invisible to application software. Correction of the most commonly used types of DRAM. The DIMMs comeerror and logging of the event can be performed in hardware from multiple manufacturers and models, with three differ-for a minimal performance impact. However, depending on ent capacities (1GB, 2GB, 4GB), and cover the three mosthow much of the error handling is pushed into software, the common DRAM technologies: Double Data Rate (DDR1),impact can be more severe, with high error rates causing a Double Data Rate 2 (DDR2) and Fully-Buffered (FBDIMM).significant degradation of overall system performance. DDR1 and DDR2 have a similar interface, except that DDR2 Uncorrectable errors typically lead to a catastrophic fail- provides twice the per-data-pin throughput (400 Mbit/s andure of some sort. Either there is an explicit failure action in 800 Mbit/s respectively). FBDIMM is a buffering interfaceresponse to the memory error (such as a machine reboot), around what is essentially a DDR2 technology inside.or there is risk of a data-corruption-induced failure such as akernel panic. In the systems we study, all uncorrectable er- 2.3 The measurement methodologyrors are considered serious enough to shut down the machine Our collection infrastructure (see Figure 1) consists of lo-and replace the DIMM at fault. cally recording events every time they happen. The logged Memory errors can be classified into soft errors, which ran- events of interest to us are correctable errors, uncorrectabledomly corrupt bits, but do not leave any physical damage; errors, CPU utilization, temperature, and memory allocated.and hard errors, which corrupt bits in a repeatable manner These events (”breadcrumbs”) remain in the host machine
  3. 3. and are collected periodically (every 10 minutes) and archivedin a Bigtable [4] for later processing. This collection happens Table 1: Memory errors per year: Per machinecontinuously in the background. Platf. Tech. CE CE CE CE UE The scale of the system and the data being collected make Incid. Rate Rate Median Incid.the analysis non-trivial. Each one of many ten-thousands of (%) Mean C.V. Affct. (%)machines in the fleet logs every ten minutes hundreds of pa- A DDR1 45.4 19,509 3.5 611 0.17rameters, adding up to many TBytes. It is therefore imprac- B DDR1 46.2 23,243 3.4 366 –tical to download the data to a single machine and analyze it C DDR1 22.3 27,500 17.7 100 2.15 D DDR2 12.3 20,501 19.0 63 1.21with standard tools. We solve this problem by using a paral- E FBD – – – – 0.27lel pre-processing step (implemented in Sawzall [18]), which F DDR2 26.9 48,621 16.1 25 4.15runs on several hundred nodes simultaneously and performs Overall – 32.2 22,696 14.0 277 1.29basic data clean-up and filtering. We then perform the re-mainder of our analysis using standard analysis tools. Per DIMM Platf. Tech. CE CE CE CE UE2.4 Analytical methodology Incid. Rate Rate Median Incid. The metrics we consider are the rate and probability of (%) Mean C.V. Affct. (%)errors over a given time period. For uncorrectable errors, A DDR1 21.2 4530 6.7 167 0.05we focus solely on probabilities, since a DIMM is expected B DDR1 19.6 4086 7.4 76 – C DDR1 3.7 3351 46.5 59 0.28to be removed after experiencing an uncorrectable error. D DDR2 2.8 3918 42.4 45 0.25 As part of this study, we investigate the impact of temper- E FBD – – – – 0.08ature and utilization (as measured by CPU utilization and F DDR2 2.9 3408 51.9 15 0.39amount of memory allocated) on memory errors. The ex- Overall – 8.2 3751 36.3 64 0.22act temperature and utilization levels at which our systemsoperate are sensitive information. Instead of giving abso-lute numbers for temperature, we therefore report tempera- chine, we begin by looking at the frequency of memory errorsture values “normalized” by the smallest observed tempera- per machine. We then focus on the frequency of memory er-ture. That is a reported temperature value of x, means the rors for individual DIMMs.temperate was x degrees higher than the smallest observedtemperature. The same approach does not work for CPU 3.1 Errors per machineutilization, since the range of utilization levels is obvious Table 1 (top) presents high-level statistics on the frequency(ranging from 0-100%). Instead, we report CPU utilization of correctable errors and uncorrectable errors per machineas multiples of the average utilization, i.e. a utilization of per year of operation, broken down by the type of hardwarex, corresponds to a utilization level that is x times higher platform. Blank lines indicate lack of sufficient data.than the average utilization. We follow the same approach Our first observation is that memory errors are not rarefor allocated memory. events. About a third of all machines in the fleet experience When studying the effect of various factors on memory at least one memory error per year (see column CE Incid.errors, we often want to see how much higher or lower the %) and the average number of correctable errors per yearmonthly rate of errors is compared to an average month (in- is over 22,000. These numbers vary across platforms, withdependent of the factor under consideration). We therefore some platforms (e.g. Platform A and B) seeing nearly 50% ofoften report “normalized” rates and probabilities, i.e. we their machines affected by correctable errors, while in othersgive rates and probabilities as multiples of the average. For only 12–27% are affected. The median number of errors perexample, when we say the normalized probability of an un- year for those machines that experience at least one errorcorrectable error is 1.5 for a given month, that means the ranges from 25 to 611.uncorrectable error probability is 1.5 times higher than in Interestingly, for those platforms with a lower percentagean average month. This has the additional advantage that of machines affected by correctable errors, the average num-we can plot results for platforms with very different error ber of correctable errors per machine per year is the sameprobabilities in the same graph. or even higher than for the other platforms. We will take Finally, when studying the effect of factors, such as tem- a closer look at the differences between platforms and tech-perature, we report error rates as a function of percentiles nologies in Section 3.2.of the observed factor. For example, we might report that We observe that for all platforms the number of errorsthe monthly correctable error rate is x if the temperature per machine is highly variable with coefficients of variationlies in the first temperature decile (i.e. the temperature is between 3.4 and 20 1 . Some machines develop a very largein the range of the lowest 10% of reported temperature mea- number of correctable errors compared to others. We findsurements). This has the advantage that the error rates for that for all platforms, 20% of the machines with errors makeeach temperature range that we report on are based on the up more than 90% of all observed errors for that platform.same number of data samples. Since error rates tend to be One explanation for the high variability might be correla-highly variable, it is important to compare data points that tions between errors. A closer look at the data confirmsare based on a similar number of samples. this hypothesis: in more than 93% of the cases a machine that sees a correctable error experiences at least one more3. BASELINE STATISTICS correctable error in the same year. We start our study with the basic question of how common 1 These are high C.V. values compared, for example, to anmemory errors are in the field. Since a single uncorrectable exponential distribution, which has a C.V. of 1, or a Poissonerror in a machine leads to the shut down of the entire ma- distribution, which has a C.V. of 1/mean.
  4. 4. 0 10 Table 2: Errors per DIMM by DIMM Fraction of correctable errors type/manufacturer −1 10 Incid. Incid. Mean C.V. CEs/ Pf Mfg GB CE UE CE CE GB −2 (%) (%) rate 10 1 1 20.6 0.03 4242 6.9 4242 1 2 19.7 0.07 4487 5.9 2244 −3 A 2 1 6.6 1496 11.9 1469 10 3 1 27.1 0.04 5821 6.2 5821 Platform A Platform B 4 1 5.3 0.03 1128 13.8 1128 Platform C 1 20.3 – 3980 7.5 3980 −4 Platform D 1 10 2 18.4 – 5098 6.8 2549 −5 −4 −3 −2 −1 0 B 10 10 10 10 10 10 1 7.9 – 1841 11.0 1841 Fraction of dimms with correctable errors 2 2 18.1 – 2835 8.9 1418 1 1 3.6 0.21 2516 69.7 2516Figure 2: The distribution of correctable errors over C 4 1 2.6 0.43 2461 57.2 2461 5 2 4.7 0.22 10226 12.0 5113DIMMs: The graph plots the fraction Y of all errors in a 2 2.7 0.24 3666 39.4 1833platform that is made up by the fraction X of DIMMs with D 6 4 5.7 0.24 12999 23.0 3250the largest number of errors. 2 – 0 – – – 1 4 – 0.13 – – – 2 – 0.05 – – – While correctable errors typically do not have an immedi- E 2 4 – 0.27 – – –ate impact on a machine, uncorrectable errors usually result 2 – 0.06 – – – 5in a machine shutdown. Table 1 shows, that while uncor- 4 – 0.14 – – –rectable errors are less common than correctable errors, they 2 2.8 0.20 2213 53.0 1107 F 1do happen at a significant rate. Across the entire fleet, 1.3% 4 4.0 1.09 4714 42.8 1179of machines are affected by uncorrectable errors per year,with some platforms seeing as many as 2-4% affected. A closer look at the data also lets us rule out memory3.2 Errors per DIMM technology (DDR1, DDR2, or FBDIMM) as the main factor Since machines vary in the numbers of DRAM DIMMs responsible for the difference. Some platforms within theand total DRAM capacity, we next consider per-DIMM statis- same group use different memory technology (e.g. DDR1tics (Table 1 (bottom)). versus DDR2 in Platform C and D, respectively), while there Not surprisingly, the per-DIMM numbers are lower than are platforms in different groups using the same memorythe per-machine numbers. Across the entire fleet, 8.2% of technology (e.g. Platform A , B and C all use DDR1). Thereall DIMMs are affected by correctable errors and an average is not one memory technology that is clearly superior to theDIMM experiences nearly 4000 correctable errors per year. others when it comes to error behavior.These numbers vary greatly by platform. Around 20% of We also considered the possibility that DIMMs from dif-DIMMs in Platform A and B are affected by correctable ferent manufacturers might exhibit different error behav-errors per year, compared to less than 4% of DIMMs in ior. Table 2 shows the error rates broken down by thePlatform C and D. Only 0.05–0.08% of the DIMMs in Plat- most common DIMM types, where DIMM type is definedform A and Platform E see an uncorrectable error per year, by the combinations of platform and manufacturer. We notecompared to nearly 0.3% of the DIMMs in Platform C and that, DIMMs within the same platform exhibit similar er-Platform D. The mean number of correctable errors per ror behavior, even if they are from different manufacturers.DIMM are more comparable, ranging from 3351–4530 cor- Moreover, we observe that DIMMs from some manufacturersrectable errors per year. (Mfg1 , Mfg4 ) are used in a number of different platforms The differences between different platforms bring up the with very different error behavior. These observations showquestion of how chip-hardware specific factors impact the two things: the differences between platforms are not mainlyfrequency of memory errors. We observe that there are two due to differences between manufacturers and we do not seegroups of platforms with members of each group sharing manufacturers that are consistently good or bad.similar error behavior: there are Platform A , B, and E on While we cannot be certain about the cause of the differ-one side, and Platform C , D and F on the other. While both ences between platforms, we hypothesize that the observedgroups have mean correctable error rates that are on the differences in correctable errors are largely due to board andsame order of magnitude, the first group has a much higher DIMM design differences. We suspect that the differencesfraction of DIMMs affected by correctable errors, and the in uncorrectable errors are due to differences in the errorsecond group has a much higher fraction of DIMMs affected correction codes in use. In particular, Platforms C and Dby uncorrectable errors. are the only platforms that do not use a form of chip-kill [6]. We investigated a number of external factors that might Chip-kill is a more powerful code, that can correct certainexplain the difference in memory rates across platforms, in- types of multiple bit errors, while the codes in Platforms Ccluding temperature, utilization, DIMM age and capacity. and D can only correct single-bit errors.While we will see (in Section 5) that all these affect the We observe that for all platforms the number of correctablefrequency of errors, they are not sufficient to explain the errors per DIMM per year is highly variable, with coefficientsdifferences we observe between platforms. of variation ranging from 6 to 46. One might suspect that
  5. 5. 6 100 10 1 13X Platform D Platform A 64X 91X Platform C Platform C 80 5 10 Platform A 0.8 Platform D Number of CEs in month 35X 158X 228XCE probability (%) Autocorrelation 4 60 10 0.6 3 10 0.4 40 2 10 0.2 20 Platform A Platform C 1 Platform D 10 0 1 2 3 4 5 0 0 10 10 10 10 10 10 0 2 4 6 8 10 12 CE same month CE previous month Number of CEs in prev. month Lag (months) Figure 3: Correlations between correctable errors in the same DIMM: The left graph shows the probability of seeing a CE in a given month, depending on whether there were other CEs observed in the same month and the previous month. The numbers on top of each bar show the factor increase in probability compared to the CE probability in a random month (three left-most bars) and compared to the CE probability when there was no CE in the previous month (three right-most bars). The middle graph shows the expected number of CEs in a month as a function of the number of CEs in the previous month. The right graph shows the autocorrelation function for the number of CEs observed per month in a DIMM. this is because the majority of the DIMMs see zero errors, error is followed by at least one more correctable error in while those affected see a large number of them. It turns out the same month. Depending on the platform, this corre- that even when focusing on only those DIMMs that have ex- sponds to an increase in probability between 13X to more perienced errors, the variability is still high (not shown in than 90X, compared to an average month. Also seeing cor- table). The C.V. values range from 3–7 and there are large rectable errors in the previous month significantly increases differences between the mean and the median number of the probability of seeing a correctable error: The probability correctable errors: the mean ranges from 20, 000 − 140, 000, increases by factors of 35X to more than 200X, compared to while the median numbers are between 42 − 167. the case when the previous month had no correctable errors. Figure 2 presents a view of the distribution of correctable Seeing errors in the previous month not only affects the errors over DIMMs. It plots the fraction of errors made up probability, but also the expected number of correctable er- by the top x percent of DIMMs with errors. For all plat- rors in a month. Figure 3 (middle) shows the expected forms, the top 20% of DIMMs with errors make up over number of correctable errors in a month, as a function of 94% of all observed errors. For Platform C and D, the dis- the number of correctable errors observed in the previous tribution is even more skewed, with the top 20% of DIMMs month. As the graph indicates, the expected number of cor- comprising more than 99.6% of all errors. Note that the rectable errors in a month increases continuously with the graph in Figure 2 is plotted on a log-log scale and that the number of correctable errors in the previous month. lines for all platforms appear almost straight indicating a Figure 3 (middle) also shows that the expected number of power-law distribution. errors in a month is significantly larger than the observed To a first order, the above results illustrate that errors in number of errors in the previous month. For example, in DRAM are a valid concern in practice. This motivates us the case of Platform D , if the number of correctable errors to further study the statistical properties of errors (Section in the previous month exceeds 100, the expected number of 4) and how errors are affected by various factors, such as correctable errors in this month is more than 1,000. This is environmental conditions (Section 5). a 100X increase compared to the correctable error rate for a random month. We also consider correlations over time periods longer 4. A CLOSER LOOK AT CORRELATIONS than from one month to the next. Figure 3 (right) shows the In this section, we study correlations between correctable autocorrelation function for the number of errors observed errors within a DIMM, correlations between correctable and per DIMM per month, at lags up to 12 months. We observe uncorrectable errors in a DIMM, and correlations between that even at lags of several months the level of correlation errors in different DIMMs in the same machine. is still significant. Understanding correlations between errors might help iden- tify when a DIMM is likely to produce a large number of 4.2 Correlations between correctable and un- errors in the future and replace it before it starts to cause correctable errors serious problems. Since uncorrectable errors are simply multiple bit corrup- tions (too many for the ECC to correct), one might won- 4.1 Correlations between correctable errors der whether the presence of correctable errors increases the Figure 3 (left) shows the probability of seeing a correctable probability of seeing an uncorrectable error as well. This is error in a given month, depending on whether there were cor- the question we focus on next. rectable errors in the same month or the previous month. As The three left-most bars in Figure 4 (left) show how the the graph shows, for each platform the monthly correctable probability of experiencing an uncorrectable error in a given error probability increases dramatically in the presence of month increases if there are correctable errors in the same prior errors. In more than 85% of the cases a correctable month. The graph indicates that for all platforms, the prob-
  6. 6. 3 2.5 90 10 431X Platform A Platform A 88X Factor increase in UE probability Platform C 80 Platform C Platform D 60X Platform D 2 70 10X 193XUE probability (%) 2 Percentage (%) 60 10 1.5 50 40 6X 1 47X 1 10 30 32X 19X 20 15X Platform D 0.5 Platform C 10 0 Platform A 27X 9X 10 0 1 2 3 4 5 0 0 10 10 10 10 10 10 CE same month CE previous month CE same month CE prev month Number of CEs in same month Figure 4: Correlations between correctable and uncorrectable errors in the same DIMM: The left graph shows the UE probability in a month depending on whether there were CEs in the same month or in the previous month. The numbers on top of the bars give the increase in UE probability compared to a month without CEs (three left-most bars) and the case where there were no CEs in the previous month (three right-most bars). The middle graph shows how often a UE was preceded by a CE in the same/previous month. The right graph shows the factor increase in the probability of observing an UE as a function of the number of CEs in the same month. ability of an uncorrectable error is significantly larger in a We also experimented with more sophisticated methods month with correctable errors compared to a month with- for predicting uncorrectable errors, for example by building out correctable errors. The increase in the probability of an CART (Classification and regression trees) models based on uncorrectable error ranges from a factor of 27X (for Plat- parameters such as the number of CEs in the same and pre- form A ) to more than 400X (for Platform D ). While not vious month, CEs and UEs in other DIMMs in the machine, quite as strong, the presence of correctable errors in the pre- DIMM capacity and model, but were not able to achieve ceding month also affects the probability of uncorrectable er- significantly better prediction accuracy. Hence, replacing rors. The three right-most bars in Figure 4 (left) show that DIMMs solely based on correctable errors might be worth the probability of seeing a uncorrectable error in a month fol- the price only in environments where the cost of downtime lowing a month with at least one correctable errors is larger is high enough to outweigh the cost of the relatively high by a factor of 9X to 47X than if the previous month had no rate of false positives. correctable errors. The observed correlations between correctable errors and Figure 4 (right) shows that not only the presence, but also uncorrectable errors will be very useful in the remainder of the rate of observed correctable errors in the same month af- this study, when trying to understand the impact of var- fects the probability of an uncorrectable error. Higher rates ious factors (such as temperature, age, utilization) on the of correctable errors translate to a higher probability of un- frequency of memory errors. Since the frequency of cor- correctable errors. We see similar, albeit somewhat weaker rectable errors is orders of magnitudes higher than that of trends when plotting the probability of uncorrectable errors uncorrectable errors, it is easier to obtain conclusive results as a function of the number of correctable errors in the pre- for correctable errors than uncorrectable errors. For the re- vious month (not shown in figure). The uncorrectable error mainder of this study we focus mostly on correctable errors probabilities are about 8X lower than if the same number and how they are affected by various factors. We assume of correctable errors had happened in the same month, but that those factors that increase correctable error rates, are still significantly higher than in a random month. likely to also increase the probability of experiencing an un- Given the above observations, one might want to use cor- correctable error. rectable errors as an early warning sign for impending uncor- rectable errors. Another interesting view is therefore what 4.3 Correlations between DIMMs in the same fraction of uncorrectable errors are actually preceded by a machine correctable error, either in the same month or the previ- So far we have focused on correlations between errors ous month. Figure 4 (middle) shows that 65-80% of uncor- within the same DIMM. If those correlations are mostly due rectable errors are preceded by a correctable error in the to external factors (such as temperature or workload inten- same month. Nearly 20-40% of uncorrectable errors are pre- sity), we should also be able to observe correlations between ceded by a correctable error in the previous month. Note errors in different DIMMs in the same machine, since these that these probabilities are significantly higher than seeing are largely subject to the same external factors. a correctable error in an average month. Figure 5 shows the monthly probability of correctable and The above observations lead to the idea of early replace- uncorrectable errors, as a function of whether there was an ment policies, where a DIMM is replaced once it experi- error in another DIMM in the same machine. We observe ences a significant number of correctable errors, rather than significantly increased error probabilities, compared to an waiting for the first uncorrectable error. However, while average month, indicating a correlation between errors in uncorrectable error probabilities are greatly increased after different DIMMs in the same machine. However, the ob- observing correctable errors, the absolute probabilities of an served probabilities are lower as when an error was previ- uncorrectable error are still relatively low (e.g. 1.7–2.3% in ously seen in the same DIMM (compare with Figure 3 (left) the case of Platform C and Platform D , see Figure 4 (left)). and Figure 4 (left)).
  7. 7. 30 1 Platform A Platform A Platform C Platform C 25 Platform D Platform D 0.8 CE probability (%) UE probability (%) 20 0.6 15 0.4 10 0.2 5 0 0 CE in other DIMM UE in other DIMM CE in other DIMM UE in other DIMMFigure 5: Correlations between errors in different DIMMs in the same machine: The graphs show the monthlyCE probability (left) and UE probability (right) as a function of whether there was a CE or a UE in another DIMM in thesame machine in the same month. The fact that correlations between errors in different DIMMs 6 CE Prob Factor increase when doubling GBare significantly lower than those between errors in the same CE RateDIMM might indicate that there are strong factors in addi- 5 UE Probtion to environmental factors that affect error behavior. 45. THE ROLE OF EXTERNAL FACTORS 3 In this section, we study the effect of various factors oncorrectable and uncorrectable error rates, including DIMM 2capacity, temperature, utilization, and age. We considerall platforms, except for Platform F , for which we do not 1have enough data to allow for a fine-grained analysis, andPlatform E , for which we do not have data on CEs. 0 A−1 B−1 B−2 D−6 E−1 E−2 F−15.1 DIMM Capacity and chip size Since the amount of memory used in typical server systems Figure 6: Memory errors and DIMM capacity: Thekeeps growing from generation to generation, a commonly graph shows for different Platform-Manufacturer pairs theasked question when projecting for future systems, is how an factor increase in CE rates, CE probabilities and UE prob-increase in memory affects the frequency of memory errors. abilities, when doubling the capacity of a DIMM.In this section, we focus on one aspect of this question. Weask how error rates change, when increasing the capacity ofindividual DIMMs. To answer this question we consider all DIMM types (type built, since a given DIMM capacity can be achieved in mul-being defined by the combination of platform and manufac- tiple ways. For example, a one gigabyte DIMM with ECCturer) that exist in our systems in two different capacities. can be manufactured with 36 256-megabit chips, or 18 512-Typically, the capacities of these DIMM pairs are either 1GB megabit chips or with 9 one-gigabit chips.and 2GB, or 2GB and 4GB (recall Table 2). Figure 6 shows We studied the effect of chip sizes on correctable and un-for each of these pairs the factor by which the monthly prob- correctable errors, controlling for capacity, platform (dimmability of correctable errors, the correctable error rate and technology), and age. The results are mixed. When two chipthe probability of uncorrectable errors changes, when dou- configurations were available within the same platform, ca-bling capacity2 . pacity and manufacturer, we sometimes observed an increase Figure 6 indicates a trend towards worse error behavior in average correctable error rates and sometimes a decrease.for increased capacities, although this trend is not consis- This either indicates that chip size does not play a dom-tent. While in some cases the doubling of capacity has a inant role in influencing CEs or there are other, strongerclear negative effect (factors larger than 1 in the graph), confounders in our data that we did not control for.in others it has hardly any effect (factor close to 1 in the In addition to a correlation of chip size with error rates,graph). For example, for Platform A -Mfg1 and Platform F - we also looked for correlations of chip size with incidence ofMfg1 doubling the capacity increases uncorrectable errors, correctable and uncorrectable errors. Again we observe nobut not correctable errors. Conversely, for Platform D - clear trends. We also repeated the study of chip size effectMfg6 doubling the capacity affects correctable errors, but without taking information on the manufacturer and/or agenot uncorrectable error. into account, again without any clear trends emerging. The difference in how scaling capacity affects errors might The best we can conclude therefore is that any chip size ef-be due to differences in how larger DIMM capacities are fect is unlikely to dominate error rates given that the trends2 Some bars are omitted, as we do not have data on UEs for are not consistent across various other confounders such asPlatform B and data on CEs for Platform E . age and manufacturer.
  8. 8. 2.5 4 2.5 Platform A Temp high Temp high Platform B 3.5 Temp low Temp lowNormalized monthly CE rate Normalized monthly CE rate Normalized CEs per month 2 Platform C 2 Platform D 3 1.5 2.5 1.5 2 1 1.5 1 1 0.5 0.5 0.5 0 0 1 2 0 −1 0 1 0 −1 0 1 10 10 10 10 10 10 10 10 10 Normalized Temperature Normalized CPU utilization Normalized Allocated Memory Figure 7: The effect of temperature: The left graph shows the normalized monthly rate of experiencing a correctable error as a function of the monthly average temperature, in deciles. The middle and right graph show the monthly rate of experiencing a correctable error as a function of memory usage and CPU utilization, respectively, depending on whether the temperature was high (above median temperature) or low (below median temperature). We observe that when isolating temperature by controlling for utilization, it has much less of an effect. 5.2 Temperature Figure 7 (middle) and (right) we therefore isolate the effects Temperature is considered to (negatively) affect the re- of temperature from the effects of utilization. We divide liability of many hardware components due to the strong the utilization measurements (CPU utilization and allocated physical changes on materials that it causes. In the case memory, respectively) into deciles and report for each decile of memory chips, high temperature is expected to increase the observed error rate when temperature was “high” (above leakage current [2, 8] which in turn leads to a higher likeli- median temperature) or “low” (below median temperature). hood of flipped bits in the memory array. We observe that when controlling for utilization, the effects In the context of large-scale production systems, under- of temperature are significantly smaller. We also repeated standing the exact impact of temperature on system reli- these experiments with higher differences in temperature, ability is important, since cooling is a major cost factor. e.g. by comparing the effect of temperatures above the 9th There is a trade-off to be made between increased cooling decile to temperatures below the 1st decile. In all cases, for costs and increased downtime and maintenance costs due to the same utilization levels the error rates for high versus low higher failure rates. temperature are very similar. Our temperature measurements stem from a temperature sensor on the motherboard of each machine. For each plat- 5.3 Utilization form, the physical location of this sensor varies relative to the position of the DIMMs, hence our temperature measure- The observations in the previous subsection point to sys- ments are only an approximation of the actual temperature tem utilization as a major contributing factor in memory of the DIMMs. error rates. Ideally, we would like to study specifically the To investigate the effect of temperature on memory er- impact of memory utilization (i.e. number of memory ac- rors we turn to Figure 7 (left), which shows the normalized cesses). Unfortunately, obtaining data on memory utiliza- monthly correctable error rate for each platform, as a func- tion requires the use of hardware counters, which our mea- tion of temperature deciles (recall Section 2.4 for the reason surement infrastructure does not collect. Instead, we study of using deciles and the definition of normalized probabili- two signals that we believe provide indirect indication of ties). That is the first data point (x1 , y1 ) shows the monthly memory activity: CPU utilization and memory allocated. correctable error rate y1 , if the temperature is less than the CPU utilization is the load activity on the CPU(s) mea- first temperature decile (temperature x1 ). The second data sured instantaneously as a percentage of total CPU cycles point (x2 , y2 ) shows the correctable error rate y2 , if the tem- used out of the total CPU cycles available and are averaged perature is between the first and second decile (between x1 per machine for each month. and x2 ), and so on. Memory allocated is the total amount of memory marked Figure 7 (left) shows that for all platforms higher temper- as used by the operating system on behalf of processes. It atures are correlated with higher correctable error rates. In is a value in bytes and it changes as the tasks request and fact, for most platforms the correctable error rate increases release memory. The allocated values are averaged per ma- by a factor of 3 or more when moving from the lowest to the chine over each month. highest temperature decile (corresponding to an increase in Figure 8 (left) and (right) show the normalized monthly temperature by around 20C for Platforms B, C and D and rate of correctable errors as a function of CPU utilization an increase by slightly more than 10C for Platform A ). and memory allocated, respectively. We observe clear trends It is not clear whether this correlation indicates a causal of increasing correctable error rates with increasing CPU relationship, i.e. higher temperatures inducing higher error utilization and allocated memory. Averaging across all plat- rates. Higher temperatures might just be a proxy for higher forms, it seems that correctable error rates grow roughly system utilization, i.e. the utilization increases leading inde- logarithmically as a function of utilization levels (based on pendently to higher error rates and higher temperatures. In the roughly linear increase of error rates in the graphs, which have log scales on the X-axis).
  9. 9. 3 3 Platform A Platform A Platform B Platform B Normalized monthly CE rate 2.5 Normalized monthly CE rate 2.5 Platform C Platform C Platform D Platform D 2 2 1.5 1.5 1 1 0.5 0.5 0 −1 0 1 0 −1 0 1 10 10 10 10 10 10 Normalized CPU Utilization Normalized Allocated MemoryFigure 8: The effect of utilization: The normalized monthly CE rate as a function of CPU utilization (left) and memoryallocated (right). 2 2 CPU high Mem high 1.8 CPU low 1.8 Mem low Normalized monthly CE rate Normalized monthly CE rate 1.6 1.6 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0 1 2 0.4 0 1 2 10 10 10 10 10 10 Normalized Temperature Normalized TemperatureFigure 9: Isolating the effect of utilization: The normalized monthly CE rate as a function of CPU utilization (left)and memory allocated (right), while controlling for temperature. One might ask whether utilization is just a proxy for tem- 5.4 Agingperature, where higher utilization leads to higher system Age is one of the most important factors in analyzingtemperatures, which then cause higher error rates. In Fig- the reliability of hardware components, since increased er-ure 9, we therefore isolate the effects of utilization from those ror rates due to early aging/wear-out limit the lifetime of aof temperature. We divide the observed temperature values device. As such, we look at changes in error behavior overinto deciles and report for each range the observed error time for our DRAM population, breaking it down by age,rates when utilization was ”high” or “low”. High utilization platform, technology, correctable and uncorrectable errors.means the utilization (CPU utilization and allocated mem-ory, respectively) is above median and low means the utiliza- 5.4.1 Age and Correctable Errorstion was below median. We observe that even when keeping Figure 10 shows normalized correctable error rates as atemperature fixed and focusing on one particular tempera- function of age for all platforms (left) and for four of the mostture decile, there is still a huge difference in the error rates, common DIMM configurations (platform, manufacturer anddepending on the utilization. For all temperature levels, the capacity). We observe that age clearly affects the correctablecorrectable error rates are by a factor of 2–3 higher for high error rates for all platforms.utilization compared to low utilization. For a more fine-grained view of the effects of aging, we The higher error rate for higher utilization levels might consider the mean cumulative function (MCF) of errors. In-simply be due to a higher detection rate of errors, not an tuitively, the MCF value for a given age x represents theincreased incidence of errors. For Platforms A and B, which expected number of errors a DIMM will have seen by age x.do not employ a memory scrubber, this might be the case. That is, for each age point, we compute the number ofHowever, we note that for Platforms C and D, which do use DIMMs with errors divided by the total number of DIMMsmemory scrubbing, the number of reported soft errors should at risk at that age and add this number to the previousbe the same, independent of utilization levels, since errors running sum, hence the term cumulative. The use of a cu-that are not found by a memory access, will be detected mulative mean function helps visualizing trends, as it allowsby the scrubber. The higher incidence of memory errors at us to plot points at discrete rates. A regular age versus ratehigher utilizations must therefore be due to a different error plot would be very noisy if plotted at such a fine-granularity.mechanism, such as hard errors or errors induced on the The left-most graph in Figure 11 shows the MCF for alldatapath, either in the DIMMs or on the motherboard. DIMMs in our population that were in production in Jan-
  10. 10. 5 4 D−Mfg6−4GB Normalized montly CE rate Platform A D−Mfg6−2GB 3.5 Platform B C−Mfg5−2GB Platform C 4 Normalized monthly CE rate C−Mfg1−1GB 3 Platform D 2.5 3 2 2 1.5 1 1 0.5 0 0 3 5 10 15 20 25 30 35 2 3 5 10 15 20 25 30 35 Age(months) Age (months)Figure 10: The effect of age: The normalized monthly rate of experiencing a CE as a function of age by platform (left)and for four common DIMM configurations (right). We consider only DIMMs manufactured after July 2005, to exclude veryold platforms (due to a rapidly decreasing population).uary 2007 and had a correctable error. We see that the figures, we see a sharp increase in correctable errors at earlycorrectable error rate starts to increase quickly as the pop- ages (3-5 months) and then a subsequent flattening of errorulation ages beyond 10 months up until around 20 months. incidence. This flattening is due to our policy of replacingAfter around 20 months, the correctable error incidence re- DIMMs that experience uncorrectable errors, and hence themains constant (flat slope). incidence of uncorrectable errors at very old ages is very low The flat slope means that the error incidence rates reach a (flat slope in the figures).constant level, implying that older DIMMs continue to have In summary, uncorrectable errors are strongly influencedcorrectable errors (even at an increased pace as shown by by age with slightly different behaviors depending on theFigure 10), but there is not a significant increase in the in- exact demographics of the DIMMs (platform, manufacturer,cidence of correctable error for other DIMMs. Interestingly, DIMM technology). Our replacement policy enforces thethis may indicate that older DIMMs that did not have cor- survival of the fittest.rectable errors in the past, possibly will not develop themlater on. Since looking at the MCF for the entire population might 6. RELATED WORKconfound many other factors, such as platform and DRAM Much work has been done in understanding the behav-technology, we isolate the aging effect by focusing on one in- ior of DRAM in the laboratory. One of the earliest pub-dividual platform. The second graph from the left in Figure lished work comes from May and Woods [11] and explains11 shows the MCF for correctable errors for Platform C , the physical mechanisms in which alpha-particles (presum-which uses only DDR1 RAM. We see a pattern very similar ably from cosmic rays) cause soft errors in DRAM. Sinceto that for the entire population. While not shown, due to then, other studies have shown that radiation and errorslack of space, the shape of the MCF is similar for all other happens at ground level [16], how soft error rates vary withplatforms. The only difference between platforms is the age altitude and shielding [23], and how device technology andwhen the MCF begins to steepen. scaling [3, 9] impact reliability of DRAM components. Bau- We also note the lack of infant mortality for almost all mann [3] shows that per-bit soft-error rates are going downpopulations: none of the MCF figures shows a steep incline with new generations, but that the reliability of the system-near very low ages. We attribute this behavior to the weed- level memory ensemble has remained fairly constant.ing out of bad DIMMs that happens during the burn-in of All the above work differs from ours in that it is limited toDIMMs prior to putting them into production. laboratory studies and focused on only soft errors. Very few In summary, our results indicate that age severely affects studies have examined DRAM errors in the field, in largecorrectable error rates: one should expect an increasing in- populations. One such study is the work by Li et al. whichcidence of errors as DIMMs get older, but only up to a cer- reports soft-error rates for clusters of up to 300 machines.tain point, when the incidence becomes almost constant (few Our work differs from Li’s in the scale of the DIMM-days ob-DIMMs start to have correctable errors at very old ages). served by several orders of magnitude. Moreover, our workThe age when errors first start to increase and the steepness reports on uncorrectable as well as correctable errors, andof the increase vary per platform, manufacturer and DRAM includes analysis of covariates commonly thought to be cor-technology, but is generally in the 10–18 month range. related with memory errors, such as age, temperature, and workload intensity.5.4.2 Age and Uncorrectable Errors We observe much higher error rates than previous work. We now turn to uncorrectable errors and aging effects. Li et al cite error rates in the 200–5000 FIT per Mbit rangeThe two right-most graphs in Figure 11 show the mean cu- from previous lab studies, and themselves found error ratesmulative function for uncorrectable errors for the entire pop- of < 1 FIT per Mbit. In comparison, we observe meanulation of DIMMs that were in production in January 2007, correctable error rates of 2000–6000 per GB per year, whichand for all DIMMs in Platform C , respectively. In these translate to 25,000–75,000 FIT per Mbit. Furthermore, for