40220140505007
Upcoming SlideShare
Loading in...5
×
 

40220140505007

on

  • 63 views

 

Statistics

Views

Total Views
63
Views on SlideShare
63
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

40220140505007 40220140505007 Document Transcript

  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 57 INTERMITTENT FAILURES IN HARDWARE AND SOFTWARE Dr. Michael Pecht, Anwar Mohammed CALCE Electronic Products and Systems Center, University of Maryland, College Park, MD 20742, USA Flextronics, 847 Gibraltar Drive, Milpitas, CA 95035, USA ABSTRACT Intermittent failures are a major concern in electronics system because they are unpredictable and non-repeatable. They can be very expensive for companies, damage the reputation of a company, or cause catastrophic damage in safety-critical systems such as nuclear plants. This paper discusses, both at the hardware and software level, the causes of intermittent failures and the methodology to diagnose the causes. Mitigation strategies to help reduce the occurrence of these failures are discussed and new, emerging technologies designed to minimize intermittent failures are also reviewed. The paper concludes with recommendations designed to minimize the occurrence of intermittent failures. 1. INTRODUCTION Intermittent failures are sporadic failures that are not easily repeatable. According to IEEE, intermittent failure (IF) can be defined as the failure of an item for a limited period of time, following which the item recovers its ability to perform its required function without being subjected to any external corrective action [1]. When a product can no longer perform its designed function over the intended time frame, it is considered to have failed. When the product manifests a loss of some of its function or performance characteristics for a limited time, but shows subsequent recovery, it has experienced intermittent failure. Intermittent failures are hard to replicate because of their erratic behavioral pattern. Intermittent failures are often called “ghost failures” for the obvious reason that they come and go, as well as being hard to reproduce on the bench [2]. Therefore, it is more difficult to conduct failure analysis for intermittent failures, understand their root causes, and isolate their failure sites than it is for permanent failures. An intermittent INTERNATIONAL JOURNAL OF ELECTRICAL ENGINEERING & TECHNOLOGY (IJEET) ISSN 0976 – 6545(Print) ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME: www.iaeme.com/ijeet.asp Journal Impact Factor (2014): 6.8310 (Calculated by GISI) www.jifactor.com IJEET © I A E M E
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 58 failure is not necessarily repeatable; however, it often is [3]. An intermittent failure may lead to permanent failures in later stages of the life cycle. During the inspection process in manufacturing, intermittent failures may be reported as rejected parts with no failure found (NFF). This means that a failure was observed in the system, but when the device was re-tested, a failure mode could not be identified or the failure could not be duplicated. This is also known as trouble not identified (TNI), no trouble found (NTF), cannot duplicate (CND), or retest ok (RTOK) [3]. These failures are hard to identify or replicate, even though they are recurrent. Many different factors can cause intermittent failures, such as process variations like a change in the humidity level, manufacturing residuals like solder fluxes and epoxy bleed outs, radiation, vibration, wear out leading to opens, and voltage and temperature fluctuations [3]. Such transient causes, seen both in hardware and software, are hard to reproduce and can lead to negative consequences such as mission aborts and flight and train delays or cancellations. They can increase system downtime and decrease system availability. A reduction in IF will increase system availability more than a reduction in failure rate [4]. An intermittent failure can lead to unintended consequences such as increased operation cost, higher downtime, and a perception of lower quality, especially in sensitive industries such as aerospace. A system which has failed previous testing and then suddenly starts passing testing, showing no signs of failure, can erode the trust in the testing methodology [5] and can cause an IF to be identified as a false alarm even though a real failure exists in the system. Intermittent failures inflict a heavy toll on companies. During retesting, when a failed part cannot be validated as a failed part, extra testing must be conducted to identify the failure. These extra tests impose additional costs. In the case of IFs, since the failures cannot be replicated consistently, the retest and repair costs are higher than those for permanent failures. This is because an effective repair cannot be made till the failure is validated. Maintenance can cost time and labor in an attempt to identify a failure without any success, sometimes resulting in blind replacement of parts that are suspected of having a defect (without finding any specific problem), which increases the cost of inventory. For example, in 2001, fighter plane customers spent $10 million to replace parts that were tested as intermittent failures at the shop level [6]. In another case, in the 1980s, the thick film integrated (TFI) ignition module in an automotive company were afflicted by intermittent failures, leading to a lawsuit settlement by the company [3]. A study carried out in 2005 found that IFs account for about 63% of the mobile phones returned to the manufacturer, costing the industry $4.5 billion dollars per year [7]. Kimseng et al. [8] carried out a study on intermittent failures in the digital electronic cruise control modules made by a manufacturer for various automobiles and found that 96% of the modules returned to the manufacturer passed the bench tests carried out by the manufacturer. Kimseng concluded that the bench tests were not representative of the actual automotive environment and nor was the testing appropriate to assess the original failure. A holistic approach is helpful to understand and eliminate intermittent failures. This approach would include better diagnostic capability and efficient mitigation techniques. Therefore, this paper discusses both hardware and software intermittent failures, including their causes, diagnosis, and mitigation methodologies. Emerging developments in this technology space are also reviewed to help formulate better solutions. 2. HARDWARE INTERMITTENT FAILURES Tentative or temporary hardware malfunctions can cause intermittent failure in electronic devices. This section describes common hardware components that experience intermittent failures and their failure mechanisms. The diagnosis and mitigation of hardware intermittent failures is also examined, and some recent technologies designed to overcome these problems are covered.
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 59 2.1 Causes of Failure Unlike permanent failures with persistent causes, the failure cause in intermittent failures may no longer exist during testing, because of changes in the working environment. Hardware intermittent failures can have different root causes, such as mismatched thermal expansion, vibration, corrosion, and electromigration. In this section, some key intermittent failure causes for hardware components are investigated. 2.1.1 Wire Bond and Connectors Failures Wire bonds and connectors cause a high percentage of hardware intermittent failures [9]. Some common causes include coefficient of thermal expansion (CTE) mismatch, component wear out caused by age or repeated usage and corrosion. For example, the CTE mismatch between the wire bonds and the copper bonding pads on a PCB can cause intermittent opens and shorts during temperature excursions. In another example, the contact resistance of a new, tin-plated contact may be a few milliohms, but after a thousand contact cycles, the resistance can become as high as several ohms. With more usage, intermittent failures that disappear in the next contact cycle may also occur [9]. The thermal and mechanical vibrations in the connectors can lead to fretting corrosion, causing the contact resistance to increase, thus inducing intermittent connection failures [10, 11]. It has been identified [3] that loose PCB interconnectors and aging connectors and components are some of the common causes for electronic systems failure. Gibson et al. [12] concluded that over 50% of all electronic failures are triggered by interconnector related problems. Other common causes are vibration, stress relaxation, and the movement of the wiring harness generated by the magnetic field [9]. The following paragraphs will describe some of these failures in more details. Wire bond related intermittent failure occurs when a poorly connected wire bond temporarily dislodges because of thermal expansion at temperatures above the room temperature. The wire bond may then restore to its normal state once the thermal stress caused by CTE mismatch is removed. The failure mode in such cases is usually an open circuit. On the other hand, a loose conductive material floating on the package may connect with a wire bond on another part of the circuit, resulting in a short circuit. When this floating piece moves away from the failure site, because of vibration for example, the failure is no longer observed [13]. Loose materials can be detected by using appropriated screening methods including X-ray, vibration, and acoustic testing. Screening and testing methodologies are designed based on the potential causes and effects of the short circuit on the component performance. [14]. Intermittent wire bond failures may also be induced by the molding process which can damage wire bonds. This damage is not easily detectible and is attributed to the weakening and lifting of the gold bond during the molding process on the side of the package opposite to where the injection molding occurred [15]. Proper molding process control parameters and effective detection techniques would minimize such intermittent failures. In a study done by Sorensen [16] on military aircraft he noted that 50% of all the failures were intermittent failures and 80% of those were related to solder joints and connector pins. For the aircraft industry, aging devices will lead to IFs, quite often as a prelude to permanent failures. Many IFs are the result of the gradual degradation of a component or system. They may initially appear as small noise fluctuations but could lead to permanent failures. Filho et al. [17] point out that for continuous monitoring methods, intermittent failures can appear long before open circuits are detected. Corrosion can cause electrical degradation of the contact, which is initiated by a galvanic reaction between two metals within the electrical circuit. Corrosion on electronic parts can result in either of the two scenarios: short circuits or an increase in the electrical resistance of the components. When corrosion occurs, it is rarely uniform on the affected surfaces, which may result in the appearance of an intermittent failure. With respect to the contacts of electronic parts, intermittent failures occur because of frequent connections and disconnections, as seen in the corrosion of copper
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 60 connectors that have layers of nickel and gold to protect against wear out. In harsh environments (high relative humidity and the presence of H2S), formation of the corrosive component Cu2S causes intermittent failure behaviors [18]. With vibration and temperature fluctuations, this conductive path can be connected and disconnected, resulting into intermittent failures. Intermittent failures due to corrosion generally occur in the early stages (the first 50%) of the product life cycle. Intermittent behavior aggravated by CTE mismatch or vibration generally appears during the later stages (the last 50%) of the life cycle of a product [19]. For example, electrochemical migration, which occurs between anodes and cathodes (and can be a reason behind IF reports), is a corrosion-related failure mechanism that forms dendrites between opposite biases and eventually results in short circuits. The driving forces for this corrosion process are the potential voltage bias, contaminated surfaces (lack of environmental control), and the fact that the metals that are commonly used (Sn, Pb, Cu and Ag) are susceptible to corrosion. Since this process is not time induced, the intermittent failures are manifested early in the product life cycle. Tin whiskering has been identified [20] as another common cause for intermittent failures. A PCB with a pure tin finish, having non-compressive internal stress, is known to create tin dendrites that can cause short failures. However at elevated temperatures the dendrites may melt away and repair the short. 2.1.2 Digital Integrated Circuit Failures Integrated chip devices are being scaled down rapidly. This reduction in size makes digital integrated circuits more susceptible to permanent and intermittent behavior. Intermittent failure modes in logic, digital integrated circuits (ICs) have been categorized as timing violations, stuck-at- zero or stuck-at-one failures, intermittent shorts or opens, or electro-migration failures [21]. An increase in the resistance of interconnects due to thermal or mechanical loads, electromigration, or material diffusion, increases the time for signal propagation and leads to a timing violation [22]. These failures are manifested because of thermal and electrical loads and signal frequency variations. Kothawade et al. [23] found that timing violation in a processor can be attributed to multiple factors such as process variations, negative bias temperature instability (NBTI), temperature fluctuations, hot carrier injection (HCI), and voltage fluctuations. Since timing violations can be caused by many factors, it is challenging for processor designers to design fault tolerance mechanisms. Time dependent HCI failures are generally permanent in nature. NBTI failures caused by AC stress tend to be intermittent failures whereas failures caused by static stress usually manifest as permanent failures. Within an integrated circuit, the thin oxide layers separating the adjacent metal traces can also lead to intermittent shorting or opens caused by traces coming in contact with each other or losing contact. Constantinescu [21] also studied the causes of intermittent behavior in integrated circuits (ICs). The study attributed voltage fluctuations across ICs as the cause for oxide layer breakdown. As ICs have become smaller, the thickness of the oxide layers has decreased. This leads to an increased risk of breakdown in oxide layer thickness. When this oxide layer breaks down, it creates a conducting path, thereby increasing the leakage current. The introduction of high k dielectrics reduces the rate of oxide breakdown, enabling the use of thinner dielectrics. However, this can also lead to timing violation failures. Before a complete breakdown takes place due to dielectric breakdown leading to a permanent failure, there is a stage known as dielectric soft breakdown, during this stage a device may exhibit intermittent failures. Intermittent stuck-at-zero or stuck-at-one failures occur in storage elements. Digital circuits have two states, 0 or 1, and a fault occurs when a particular signal is tied to either 1 or 0. This produces a logical error. Pan et al. [24] developed a metric for stuck-at-zero/stuck-at-one to characterize the vulnerability of a microprocessor to intermittent failures based on its structure. Experimental results show that the susceptibility varies significantly across different structures, and the vulnerability of the reorder buffer is much higher than that of the register file. These storage element intermittent failures have an active time and an inactive time. The active time is the time
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 61 during which the failure is in process and causes unexpected behavior, while the inactive time is the time when the failure does not affect performance. The length of this active time determines how significantly the failure affects the performance of a microprocessor. ICs are susceptible to intermittent failures due to electro-migration. Electro-migration is the movement of metal atoms when electrons flow through those atoms. This movement of atoms can lead to an open or short circuit failure. In both the cases the failures appear initially as intermittent failures and end up as permanent failures. As IC chip technology becomes smaller, the wire widths are reduced. When current flow is not scaled down proportionally, the ICs become vulnerable to electro-migration [24]. 2.1.3 COMPONENT CONNECTION FAILURES Another area of concern for intermittent failures is the area of component pins, whether it be a multi-pin IC, resistor network or a simple two-lead capacitor. Intermittent failures can be caused by imperfections in the solder process or a fractured lead where the two broken ends are intermittently making and breaking connections. Once the pin is broken, the failure may show up during thermal cycling or vibration testing. Resolution for these types of failure includes better attachment methods of longer-size components like a resistor network or large capacitor to the circuit card. Studies [25] have shown that intrinsic flaws in design and sub quality manufacturing processes like soldering play a big role in creating intermittent failures. 2.2 Diagnosis The Failure Modes, Mechanisms, and Effects Analysis (FMMEA) can be used to detect intermittent and permanent failures in hardware. Mathew et al. [26] have proposed the following methodology. Figure 1: FMMEA Methodology [26]
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 62 The first two steps identified in Figure 1 are ‘define system and identify elements and functions to be analyzed’ and ‘identify potential failure modes.’ They are more challenging for intermittent failures than for permanent failures. This is because, in the case of intermittent failures, it is difficult to define which system has the failure in a complex system consisting of several subsystems intermeshed together. A failure in one of the subsystems could affect another subsystem and result in its failure. Finding the subsystem with the initial failure is challenging, since intermittent failures are not always detected when the system is tested for faults. Identifying the correct failure modes is also not easy because of the erratic nature of intermittent failures; this requires extra work. Kirkland [27] describes a variety of methods to detect failure modes for intermittent failures in electronic devices, including signal looping, pattern looping, signal stepping, frequency deviation, pattern adjustment in critical areas, signal strength variation, current path duplication, measuring capacitance variations, Vcc adjustments, resistive or impedance rebounce, temperature change application, and noise dissimilarity testing. Using these methods can help identify failure modes, such as increased gate delays, degraded signals, increased leakage, and high frequency failures. A minimum set of conditions (such as voltage drop threshold and temperature variations) needs to be present to make the failure mode observable. Another systematic approach for analyzing intermittent failures is employing a cause and effect diagram, which is also known as the fishbone diagram. An example of this diagram is depicted in the following Ishikawa fishbone diagram below [28]. Figure 2: Fishbone diagram for intermittent failures in hardware and software [28]
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 63 A cause and effect diagram defines the key failure (also known as key effect) and investigates the possible causes of each of the effects and offers a list of all the possible causes leading to the failure. It is an effective method for analyzing failures in complex systems. For example intermittent failures in plastic ball grid array packages using this method and narrowed down the possible causes of failure, finally identifying solder joint failure as the main cause of intermittent failure [28]. Steadman et al. [29] developed a test methodology for intermittent faults in aircraft. This method subjected an avionics system to thermal and vibrational loads, while simultaneously monitoring the system for faulty components, thus reducing the occurrence of intermittent failures. An improved approach should include online monitoring of critical avionics components while the system is in operation. This would reduce the overhead cost incurred by offline monitoring that uses load profiles that do not accurately replicate the operating conditions. The monitoring of current to detect intermittent failure has been recommended [30] because normal circuits would carry a significantly different current load when compared to damaged circuits. In 1978, Savir [31] presented a paper on developing a model to detect intermittent failures in a sequential circuit, which is a type of circuit with memory logic and is found in most digital systems. He recommends the leveraging of both deterministic (non- random) and random test procedures for optimizing the probability of IF detection. The intermittent failures are divided in two major categories comprising stationary failures (such as loose connections) and transient failures (such as failures induced by electro-magnetic interference). In sequential circuits, the first manifestation of an active fault may induce the circuit to enter an incorrect state without producing an immediate output error. This state change may generate an output error later when the fault has become inactive. The optimal value of detection probability is obtained by developing a graph of all the input sequences and determining which sequences lead to intermittent failures. To detect intermittent failures, a minimum set of conditional requirements is necessary to manifest the failure [32]. The challenge is in determining the environmental conditions when the failure occurred and re-creating them. Harsh ambient conditions, such as high humidity and the presence of halides, can initiate unintended conductive pathways on insulating surfaces. Such a pathway could eventually become a permanent failure, but it could manifest itself in the earlier stages as intermittent failure. Figure 3 offers a brief list of potential causes for hardware related intermittent failures. Figure 3: List of Causes for Intermittent Failure 2.3 Mitigation Integrated circuits try to compensate for breakdowns by having failure tolerance built into them. Failure tolerance masks the occurrence of failures from the end user (it prevents end users from experiencing performance drops). For example, most processors choose a max clock rate after having guard-banded against unpredictable interactions and variations in the actual clock rate. ICs Component shifting during solder reflow Magnetic field variations Contamination (including oxidation at test sites) Materials degradation (aging, chemical, stress, etc.) Chemical degradation (including creep corrosion Overstress (example high voltage on cap. Dielectric) fretting, whiskers, electron migration etc.) Partial delamination Cracked substrates PCB (warpage, via cracking, black pad, etc.) Damaged circuits Poor wire bonding (on high K dielectric, etc.) ESD induced Temperature sensitivity (CTE mismatch, etc.) Floating leads (or other conductive pieces) Vibration induced Ionizing Radiation in Semiconductors Voltage overstress Insulation Oxide layer breakdown Weak solder joints (varying with temp/stress) Irregular or altered current path Weak structural integrity Loose connections (wire bonds, connectors etc.) Wire sweep during molding
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 64 also have chip-level failure tolerance, such as error correcting codes, self-checking circuits, and hardware-implemented check pointing and retries [22]. Three main methodologies to mitigate the intermittent behavior in ICs are dynamic instruction delaying, core frequency scaling, and thread migration. When the processor incurs more than the expected time to execute a process, time delay and timing violation occur. This fault may be avoided by using techniques such as dynamic instruction delaying. This is a type of algorithm that calculates the scheduling priorities during the execution of the system. The objective is to respond dynamically to the changing conditions and form a self-sustained, optimized configuration. Another approach to mitigating delay is core frequency scaling, which scales down the performance of the CPU to a lower frequency when less is needed and scales it up to a higher frequency when more is needed. Thread migration is another technique used to overcome intermittent failure. A thread is an ordered set of instructions that tells a computer exactly what to do. When a specific thread encounters failures, the content of the thread within the faulty computer core is transferred to another thread within an idle core, where the problem is addressed and solved. The intermittent failures in some avionic systems can be caused by failures in solder joints and multi-layer ribbon cables [29]. These failures may be initiated by the variations in operating conditions, such as temperature or current, and may disappear due to re-melting of the solder, closing of the crack, or filling of the void due to thermal fluctuations. Development of robust soldering processes which include appropriate material selection would mitigate soldering related intermittent failures. The plethora of solder choices which include leaded solder, lead free solder, low temperature solder, low silver solder, soft solder make it even more critical for developing appropriate processes for solder attach and solder reflow. Since there is no known, effective method to mitigate solder joints and multi-layer ribbon cable failures, more research on improving the robustness and consistency of solder joints is necessary, and self-repairing wire bonds should also be developed. 2.4 New Technology Trends Recent technological developments to solve hardware intermittent failures offer us insight to future solutions. The industry is addressing the IF problem by developing innovative approaches. The focus is also shifting from failure detection to failure avoidance. Intermittent failures on a silicon chip, such as Time Dependent Dielectric Breakdown (TDDB) and Electromigration (EM), are caused by gate wear out because of extensive usage. Gate usage can be monitored in the form of gate toggles [33]. Researchers [34] discovered that the vulnerability to intermittent failure could be monitored by tracking the amount of gate toggles. They studied four OpenSPARC RTL modules and tracked how each instruction moved through these four modules while toggling different gates. The four modules studied were the IFU, EXU, FFU, and LSU modules. They discovered that certain sub modules within the EXU module, such as the exu- alu and lsu-dcdp within the load store unit, display a relatively high amount of toggling regardless of the type of instruction being executed. This revealed that there could be groups of modules and sub modules which would have higher susceptibility to wear out failures, resulting in intermittent failures. Higher vulnerability by itself cannot be a good predictor for a failure rate, but when combined with operating conditions such as temperature, the degradation of a gate structure can be forecasted. Preemptive steps could also be taken during the design stage to avoid the occurrence of such intermittent failures. The intermittent loss of connection between connectors is a very common failure in electrical systems [35]. In spite of the extra caution during connector installation, this remains a problem in avionics and military equipment. In 2012 an approach was suggested [36] to create an online methodology to detect intermittent failures caused by intermittent connections. The idea is premised around the principle derived from the Lorentz Law that any sudden flux change should create a large
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 65 voltage manifesting as an arc which would propagate along the circuitry as a traveling wave. The arc is defined as the electrical discharge initiated by improper cable connections. Intermittent failures caused by lose connector connections can be detected by monitoring for the presence of this arc. Their research describes the online monitoring methodology to detect the presence of this arc to flag any connector disconnection failures. Advances in semiconductor scaling technology have revealed that there is now greater exposure and vulnerability to not only single event upsets (SEUs) in integrated memories but also to single-event transients (SETs) in high speed logic [37]. SEUs are induced by environmental causes such as cosmic radiation or alpha particle radiation. They initiate current pulses at random times and locations in a digital circuit. SETs are caused by transient charge displacements which generate logic errors in subsequent circuits. Both SEUs and SETs are responsible for creating intermittent failures. This is a problem which is getting worse because of industry demand for semiconductor scaling. An estimation methodology to monitor the SEUs and SETs in combinatorial circuits using CMOS technology has been proposed [38]. The source for alpha particle contamination is some packaging materials, such as the filler materials, deployed in molding compound or the presence of lead in non- lead free solders. SEU problems initiated by alpha particles have been essentially solved by the industry, but cosmic rays still pose significant SEU problems. [28] A paper published in 2012 by Pan et al. [39] strives to address the CMOS technology scaling problem from a different perspective. The paper proposes the quantitative characterization of the vulnerability of the microprocessor structure to intermittent failures. This is called the intermittent vulnerability factor (IVF), and it is the probability that an intermittent fault in the microprocessor structure will manifest as an external visible failure. Their research revealed that it is the intermittent stuck at one fault model which has the most serious impact on program execution. The IVF factor is calculated after listing the causes of the intermittent failures, classifying them into different fault models and setting parameters to determine when the intermittent fault will result in a visible error. This information is used to develop IVF computational algorithms for different intermittent fault models within a processor. The IVF data could now be used to improve the microprocessor quality, reliability, and durability (QRD) by proper interventions during the design stage. The IVF could also be used for intermittent fault detection and error recovery. Correcher et al. in their paper [40] published in 2012 introduce the concept of modeling intermittent failure dynamics. They propose two methodologies for characterizing the dynamics: the probabilistic model and the temporal model. The probabilistic model allows the computing of intermittent failure probability at any time; however, it needs historical data which may not always be available. The temporal model is more practical, and it offers the measurement of failure density. Research shows that the duration and frequency of intermittent failures increase with time, and the failure density and pseudo-period can help us in predicting it. The pseudo-period is the average time difference between failures, which is normalized by the number of failures. It is related to MTBF (mean time between failures) and used to model the reliability of repairable systems. The pseudo- period can be used to predict the number of operations before replacement in determining whether the model should follow a linear or exponential fitting. A limitation of this approach is the ability to derive optimal values for the failure density and pseudo-period. Recent research on component residual life is helpful for predictive maintenance systems. The approach focuses around not avoiding the intermittent failures but on predicting when the negative effects of the IF failures are no longer tolerable. A stochastic model has been proposed [41] to predict the residual life of live components of a coherent system. A coherent system is a system where, when a failed component is replaced by a new component, the system does not fail. The conditional reliability of components within a working system exhibiting an increasing failure rate has been shown to decrease with time. Also, when two coherent working systems comprising similar
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 66 components have the hazard rates sequenced, the corresponding residual lives are also stochastically ordered. New approaches from Kleer et al. [42] offer a framework for diagnosing intermittent failures in a continuously operating piece of machinery, where objects are transferred from one module to the next, as in the case of a copying machine involving the transfer of paper from one site in the copier to another. Research has shown [43, 44, and 45] that by leveraging in-situ sensors, physics of failure models and life cycle monitoring one can predict the occurrence of failure and measure degradation and remaining useful life. Such information could become the building blocks of developing modalities to troubleshoot intermittent failures. 3. SOFTWARE INTERMITTENT FAILURES Software intermittent failures are generated when some conditions occur simultaneously. For example, if the available memory and CPU processing power are both below a certain threshold due to other applications running on a computer, a selected program can exhibit intermittent failures due to insufficient resources. Software intermittent failures can also occur are when two or more processes (called threads) are running simultaneously and can “collide”. When this happens, the computer can end up in a lock up condition in which the software does not have a clear exit point and may result in a “frozen screen” condition showing on the computer monitor. These potential collisions may not be obvious when the software code is being written for the many different subroutine modules used in the computer. An example of one such collision of process involves a bank ATM where a customer may dip their ATM card to open up a session, and at the same time the branch personnel may open the rear safe door of the ATM (out of view from the customer). The resulting condition causes the computer to “freeze up” and the screen to be stuck in one view, making the ATM non-responsive to the customer. Software may also contain bugs and exhibit intermittent failure whenever a user encounters the buggy parts of the program. In the next sections, the causes of software intermittent behavior are investigated, and then the methods for identification and mitigation of these failures are described. Some recent research in this area is also briefly discussed. 3.1 Causes Even though software intermittent failures occur in most software-based systems, the end user may not always experience a drop in performance. The ability to perceive a failure is known as observability of faults. The observability of software intermittent failures is affected by three factors: processor speed, memory capacity, and processor load. A low processor speed increases the possibility of occurrence of intermittent failures, whereas with high processor speed, intermittent failures may be observed less frequently. A high memory capacity reduces the observability of software intermittent failures, whereas an increase in the processor load could increase the occurrence of intermittent failures. To mitigate the frequency of intermittent behavior, the factors and fault causes of the intermittent behavior must be addressed. Gracia et al. [46] classify the causes of software-related intermittent failures as timing failures, errors in memory, unhandled exceptions, errors in disks, and concurrency-related failures. Timing failures occur when process executions are delayed during processing or when the sequence of their execution is disturbed. For example, because process executions are time-sensitive, the timing of parallel processes running simultaneously can experience a delay if one of the processes does not get completed within the expected time. Memory leaks and memory errors occur because of improper memory allocation or de-allocation. This can happen when the memory footprint, which is
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 67 the amount of main memory a program uses or references, becomes very high. This may be caused by prolonged memory usage and can result in intermittent freezes and crashes. Software failures because of unhandled exceptions happen when an unexpected error occurs during execution and this error is not handled by the software. For example, when the software tries to divide one by zero, an error is generated. If this error is not handled, it could lead to an intermittent failure. Disk error failures are software intermittent failures resulting from physical errors in the disk drives. Concurrency-related failures occur when concurrent tasks are being executed, leading to heavy usage of the system. 3.2 Diagnosis In software, there are many different configurations possible. It is difficult, if not impossible; to test a product under all these configurations, and intermittent failures can occur on configurations which have not been fully tested. While testing for intermittent behavior, the interaction between the hardware and software needs to be considered, because hardware configuration can influence the frequency and length of intermittent software failure. Syed et al. [47] observed that software testing results in a different frequency of intermittent failures based upon the hardware configuration. For example, parameters such as processor speed, memory, hard drive capacity, and processor load led to a variation in the number of intermittent failures observed. Wei et al. [48] developed a test methodology to inject faults at the hardware architecture level to understand the effect of hardware intermittent failures on software failures. The authors discovered that different sites of the processor architecture affected the software execution differently. They observed that the impact of a hardware fault on software will depend upon the origination site and length of the hardware fault. For the detection of intermittent software failures, five techniques [47] are used. The first technique is known as deterministic replay debugging (DRB). It is the ability to replay precisely the same set of instructions that led up to a software failure. Essentially, the engineer records all instructions up to the point where the system crashes and then replays that recording to determine the roots of the failure. It is used for bug detection, fault tolerance studies, and intrusion analysis [47]. It is effective in debugging issues caused in multi- threaded and distributed applications. The second technique is called fuzzy testing (FT). It uses random, invalid, or unexpected data and observes how the system reacts. Fuzzy-testing is generally used for detecting failures related to corrupted data, leaks in memory, software crashes and assertions [47]. FT is also used to enhance software security. The third commonly used technique is termed high volume test automation (HVTA). In this approach the software automatically generates, executes and evaluates a large number of tests cases to detect failures. The high volume of testing, which is automatically generated, offers a higher probability of detecting failures. HVTA techniques are generally used in detecting failures such as buffer overruns, stack overflows, resource exhaustion, and timing-related errors. The fourth failure detection technique is load testing, which includes tests such as stress testing (testing at the operating condition limits until the system breaks) and volume testing (operating very large tasks). Load testing involves a demand which is exerted on a system or device while the response is being monitored. It assists in determining the maximum operating capacity and identifying the bottlenecks and weak links in a system. The last technique is called disturbance testing (DT). In this case, the normal operation of the system is disrupted by introducing physical failures such as by unplugging the power cord. This technique is used for testing the fault tolerance and the overall quality of a system. 3.3. Mitigation The aim of fault mitigation is to prevent unexpected outputs and control errors. Anderson et al. [49] discussed the phases that constitute fault mitigation: error detection, damage assessment, and
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 68 error recovery. Error detection is used to identify the source of intermittent faults, while damage assessment determines the extent of disruption and losses suffered by the system. Once the nature of the fault is clearly identified, the next phase, error recovery, mitigates these faults. This stage minimizes the negative effects experienced by the end user. There are three techniques for error recovery: recovery block, n-version software, and self- checking software [50]. Recovery blocks were originally developed by Randell [51] to prevent faults in software components from affecting functionality at the system level. In this approach, results from sequences in a software component are verified by adjudicator software. Each of the outputs of the software component needs to pass an acceptance test by the adjudicator. N-version programming (NVP) is also known as multi-version programming. In this method, multiple versions of functionally equivalent software are created independently using the identical original specifications. This assumes that independently generated software will have a sharply reduced probability of the same software faults. Statistical techniques are employed to determine the most common responses to these multiple versions, and measures are undertaken to mitigate the responses. N-version software combines the advantages of redundancy (multiple software versions) and leveraging statistical techniques [52]. Even though the NVP approach is commonly used in software developed for electronic voting and switching trains, it is not free of controversy. There are critics who do not agree that independently developed software versions will reduce the common errors. Self-checking software [53] detects the occurrence of software errors, locate and identify the causes, and stop the propagation of errors. For self-checking software to perform successfully, the system needs to monitor both functional aspects of the process and the data. Functional monitoring checks for infinite loops and incorrect loop terminations in a software program, while data monitoring checks the integrity of defined data structures in software. 3.4 New Technology Trends New approaches are being developed to overcome software related intermittent failures. Data race issues can cause many intermittent failures in software. They are non-deterministic, hard to debug, and cause problems at runtime [54]. A data race is initiated when two threads access the same memory location without undergoing a synchronized operation and when at least one of the access events is a write operation. Because of its complexity, the C and C++ language specifications leave such program behavior undefined [55] and the Java specification for such programs is complicated and known to be buggy [56]. There is a trend of increased usage of multithread programs because of the use of multicore processors, and multithreading is prone to data race issues. One approach to overcome data race detection issues was presented in 2013 by Wester et al. [57]. It is called parallelizing data race detection. They point out that traditional data race detectors are too slow to be used regularly. Wester et al. propose to increase the speed by spreading the detection work across multiple cores. Their strategy involves a process called uniparallelism, which allows the execution of program time intervals in a parallel manner, providing scalability while executing all threads on a single core to eliminate locking. Another emerging research area is automated software repair. Heuristic and algorithmic approaches are leveraged for generating, evaluating, and repairing defective sites. This approach has received attention in the field of language programming [58], operating systems [59], and software engineering [60]. Automated repair is effective in solving concurrency bugs which lead to IF issues [58]. Schulte et al. [61] presented a paper in 2013 outlining a methodology to employ automated repair on arbitrary and non-repeatable software defects in embedded systems. This process has been implemented on Nokia N9000 smart phones. The algorithm used for localizing fault sites is based on Gaussian convolution and stochastic sampling. It reduces memory requirements by 85% for embedded systems. It is ten times faster and is suited for devices where direct instrumentation is not feasible.
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 69 Sahoo et al. [62] published a paper in 2013 wherein automatic diagnostic techniques are proposed for isolating root causes for software-related intermittent failures. Self-generated likely program invariants are used with filtering techniques at sites close to the fault-triggering point to select a set of candidate programs as possible root causes. Likely program invariants are effective tools for detecting and diagnosing software errors [63]. They are program properties that are observed to hold valid in some set of successful executions but not necessarily for all executions. The set of candidate sites are trimmed down by dynamic backward slicing, which is a technique that can pinpoint precisely which instructions affect a particular value in a single execution of a program [64]. The list of candidates are further reduced by dependence filtering, which is based upon the premise that if an invariant on one instruction fails, then a different dependent instruction may also have a chance of invariant failure, but the underlying cause is the first invariant and not the second. The second filtering approach assumes that if multiple similar inputs result in the same failure symptom, they are likely to have the same cause. This is a promising approach for the automatic diagnosis of software root causes; however, this approach only works on deterministic detectors. Future work is planned to include non-deterministic detectors. The use of multicore processors has resulted in concurrency errors in multithreaded programs. These errors can lead to intermittent failures arising from schedule-dependent failures. These failures are caused by interactions between threads that were not anticipated by the program developer [65]. Atomicity is another schedule-dependent failure that can cause intermittent failures. This occurs when a thread accessing a shared state is inadvertently allowed to interleave between a pair of accesses in another thread. A paper from the University of Washington [65] in 2013 discusses the development of automated techniques for avoiding schedule-dependent failures such as concurrency and atomicity. They established a system for collecting relevant program events during run time. When a program fails, the information collected is analyzed to generate hypotheses for failure causes. Leveraging the multiple instances of the deployed software in operation, a predictive statistical model and an empirical framework has been developed to identify which hypothesis is most likely to be correct. Corrective actions are taken by manipulating future program executions. The emphasis of the study is not on failure detection but on failure avoidance. 4. RECOMMENDATIONS Intermittent failures should be treated seriously not only because of the massive cost but also because they could be early indicators to permanent failures. For intermittent failures, it is better to focus on failure avoidance rather than failure detection or failure mitigation. From the hardware design perspective it is recommended that the specification of minimum spacing requirements for circuit traces should be dependent upon the current usage. With the increase in semiconductor scaling, preemptive design strategies need to be developed that leverage data like IVF (Intermittent Vulnerability Factor) discussed in this paper. On the packaging side it would be valuable to develop new materials which offer better shielding from cosmic radiation to prevent SEUs (Single Event Upsets). Self-repairing wire bonds and self-healing solder joints may sound futuristic but they can diminish the occurrence of intermittent failures in hardware. Since connector disconnections is a common cause for intermittent failures it is recommended to develop effective methodologies for monitoring travel waves caused by sudden connector dis-connections. For some avionic systems it is recommended to develop an online test methodology rather than performing lab testing to increase the probability of detecting intermittent failures. Software intermittent failures should always be studied within the context of the hardware being used and it is important to focus on fault causes rather than on the observability of intermittent failures. There is a need for more detailed studies in solving system-level intermittent failures. With the increase in multicore processor usage, it is recommended to anticipate and preempt IF problems
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 70 caused by data race when using multithreading programming. Parallelizing techniques should be employed where possible to detect data race failure. It is recommended to use automated software repair for solving concurrency issues and likely program invariants are encouraged for automatic diagnostic techniques for solving deterministic failures. 5. CONCLUSIONS Intermittent failures are difficult to diagnose because, when they are investigated, the faults cannot be replicated consistently. This paper undertakes a wider approach by describing the various causes, diagnosis and mitigation strategies for intermittent failures manifested at the hardware and software levels. Some promising upcoming technologies are highlighted that might help develop future solutions for intermittent failures. Since diagnosing intermittent failure is challenging, helpful tables and methodologies have been presented to detect the causes of hardware and software intermittent failures. Recommendations have been offered to help minimize the occurrence of intermittent failures in hardware and software. The paper strives to advance the state of the art and practice by covering a wide diversity of intermittent failures, both in hardware and software while offering an understanding of the underlying causes and proposing approaches and methodologies for diagnosis and mitigation. 6. ACKNOWLEDGEMENTS The authors would like to acknowledge the personnel associated with the University of Maryland and CALCE (Center for Advanced Life Cycle Engineering) for their constant support and assistance in developing this paper. Special appreciation and thanks are due to Diganta Das, Kelly Smith, Mark Zimmerman, Faye Chai, Weifeng Liu and Ken Neubeck for guidance in the content, structure and presentation of this paper. 7. REFERENCES [1] Authoritative Dictionary of IEEE Standard Terms, 7th edition, published by Standards Information Network IEEE Press, 2000 IEEE 100. [2] K. Neubeck, “Practical Reliability Analysis”, (Prentice Hall, 2004). [3] D. A. Thomas, K. Ayers, and M. Pecht, “The ‘trouble not identified’ phenomenon in automotive electronics,” Microelectronics Reliability, vol. 42, no. 4–5, pp. 641–651, Apr. 2002. [4] I. James, D. Lumbard, I. Willis, and J. Goble, “Investigating no fault found in the aerospace industry,” in Reliability and Maintainability Symposium, 2003. Annual, 2003, pp. 441 – 446. [5] P. Söderholm, “A system view of the No Fault Found (NFF) phenomenon,” Reliability Engineering & System Safety, vol. 92, no. 1, pp. 1–14, Jan. 2007. [6] B. Steadman, T. Pombo, I. Madison, J. Shively, and L. Kirkland, “Reducing No Fault Found using statistical processing and an expert system,” in AUTOTESTCON Proceedings, 2002. IEEE, 2002, pp. 872 – 878. [7] WDS Global white paper, “No Fault Found returns cost the mobile industry $4.5 Billion per year”, 2006. <online> http://www.wds.co/news/whitepapers/20060717/MediaBulletinNFF.pdf. [8] Kimseng K., Hoit M, Pecht M, “ Physics of failure assessment of a cruise control module” Microelectronics Reliability, 1999, 39(10):423-444.
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 71 [9] C. Maul, J. W. McBride, and J. Swingler, “Intermittency phenomena in electrical connectors,” Components and Packaging Technologies, IEEE Transactions on, vol. 24, no. 3, pp. 370 –377, Sep. 2001. [10] M. Antler, “Contact fretting of electronic connectors”, IEICE Trans. Electron, Vol E82-C, #1, 1994, pp 3-12. [11] C. Maul, J. McBride and J. Swingler, “On the nature of intermittence in electrical contacts”, in 20th Int. Conf. Electrical Contacts, Stockholm, 2000, pp 23-28. [12] A. Gibson, S. Choi, T. Bieler and K. Subramanian, Environmental concerns and materials issues in manufactured solder joints, Proceedings of the 1997 IEEE International Symposium, In Electronics and the Environment (1997) 246–251. [13] H. A. Schafft, “Failure Analysis of Wire Bonds,” in Reliability Physics Symposium, 1973. 11th Annual, 1973, pp. 98 –104. [14] R. E. McCullough, “Screening Techniques for Intermittent Shorts,” in Reliability Physics Symposium, 1972. 10th Annual, 1972, pp. 19 –22. [15] T. Koch, W. Richliug, J. Whitlock, and D. Hall, “A Bond Failure Mechanism,” in Reliability Physics Symposium, 1986. 24th Annual, 1986, pp. 55 –60. [16] Sorensen B. Digital averaging-the smoking gun behind No-Fault-Found, Air Safety Week, February, 24, 2003. [17] W.C. Maia Filho, M. Brizoux, H.Fremont, Y. Danto, “Improved Physical Understanding of Intermittent Failure in Continuous Monitoring Method”, Proceedings of 14th IPFA, 2007, pp.141-146. [18] M. Reid, J. Punch, G. Grace, L. F. Garfias, and S. Belochapkine, “Corrosion Resistance of Copper-Coated Contacts,” Journal of The Electrochemical Society, vol. 153, no. 12, p. B513, 2006. [19] D. Minzari, M. S. Jellesen, P. Møller, and R. Ambat, “On the electrochemical migration mechanism of tin in electronics,” Corrosion Science, vol. 53, no. 10, pp. 3366–3379, Oct. 2011. [20] B. Sood, M. Osterman and M. Pecht, Tin whisker analysis of Toyotas electronic throttle control, CircuitWorld 37(3) (2011) 4–9. [21] C. Constantinescu, “Intermittent faults and effects on reliability of integrated circuits,” in Reliability and Maintainability Symposium, 2008. RAMS 2008. Annual, 2008, pp. 370 –374. [22] D. T. Blaauw, C. Oh, V. Zolotov, and A. Dasgupta, “Static electromigration analysis for on- chip signal interconnects,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 22, no. 1, pp. 39 – 48, Jan. 2003. [23] S. Kothawade, K. Chakraborty, S. Roy, and Y. Han, “Analysis of intermittent timing fault vulnerability,” Microelectronics Reliability, vol. 52, no. 7, pp. 1515–1522, Jul. 2012. [24] S. Pan, Y. Hu, and X. Li, “IVF: Characterizing the vulnerability of microprocessor structures to intermittent faults,” in Design, Automation Test in Europe Conf. Exhibition, 2010, pp. 238 –243. [25] N. Vichare and M. Pecht, Prognostics and health management of electronics IEEE Transactions on Components and Packaging Technologies, 29(1) (2006) 222–229 [26] S. Mathew, D. Das, R. Rossenberger, and M. Pecht, “Failure mechanisms based prognostics,” in Prognostics and Health Management, 2008. PHM 2008. International Conference, 2008, pp. 1 –6. [27] L. V. Kirkland, “When should intermittent failure detection routines be part of the legacy re- host TPS?” in AUTOTESTCON, 2011 IEEE, 2011, pp. 54 –59. [28] H. Qi, S. Ganesan, and M. Pecht, “No-fault-found and intermittent failures in electronic products,” Microelectronics Reliability, vol. 48, no. 5, pp. 663–674, May 2008.
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 72 [29] Bryan Steadman, Floyd Berghout, Nathan Olsen, “Intermittent Fault Detection and Isolation System”, IEEE AUTOTESTCON, 2008. [30] M. Pecht, Prognostics and health monitoring of electronics, John Wiley & Sons, Ltd, 2008. [31] J. Savir, “Detection of Intermittent Faults in Sequential Circuits” Stanford University, Rep. TR-120, 1978. [32] L. Kirkland, “When should intermittent failure detection routines be part of the Legacy Re-Host TPS”, IEEE, Autotestcon, 2011, pp 54-59. [33] R. Vattikonda, W. Wang and Y. Cao, “Modeling and minimization of PMOS NBTI effect for robust nanometer design”, in proceedings of the Design Automation Conference, DAC 2006. [34] M. Demertzi, B. Zandian, R. Rojas and M. Annavaram, “Benchmarking ISA Reliability to Intermittent Failures”, IEE International Symposium on Workload Characterization (IISWC), 2012, pp. 86-87. [35] S. Hannel, S. Fouvry, P. Kapsa and L. Vincent “The fretting sliding transition as a criterion for electrical contact performance” WEAR, Vol 49, 2001, pp 761-770. [36] A.Ginart, I. Ali, J. Goldwin, P. Kalgren, M. Roemer, E. Balaban and J. Celaya “Sensing and characterization of EMI during Intermittent Connector Anomalies” Aerospace Conference, IEEE, March 3-10, 2012, pp 1-7. [37] R. Rao, K. Chopra, D. Blaauw and D. Sylvester, “An efficient static algorithm for computing the soft error rates of combinatorial circuits,” in Proceedings of Design, Automation and Test in Europe, Vol. 1, March 2006, pp1-6. [38] N. Kehl and W. Rosenstiel, “An efficient SER estimation method for Combinatorial Circuits”, IEEE Transactions on Reliability, vol 60, number 4, 2011, pp 742-747. [39] S. Pan, Y. Hu and X. Li, “IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 20, number 5, 2012, pp 777-790. [40] A. Correcher, E. Garcia, F. Morant, E. Quiles and L. Rodriguez, “Intermittent Failure Dynamics Characterization”, IEEE Transactions on Reliability, Vol 61, Number 3, pp 649-658, Sep. 2012. [41] N. Balakrishnan and M. Asadi, “A proposed measure of Residual Life of Live Components of a Coherent System”, IEEE Trans. Rel. Vol. 61, #1, pp 41-49. [42] J. Kleer, B. Price, L.Kuhn, M. Doh, R. Zhou, “A framework for continuously estimating persistent and intermittent failure probabilities”, Palo Alto Research Center Publications, 2008. [43] J. Xie and M. Pecht, Applications of in-situ health monitoring and prognostic sensors, The 9th Pan Pacific microelectronics Symposium, Exhibits and Conference (2004) 10–12. [44] S. Mathew, D, Das, M. Oserma, M. Pecht and N. Ferebee, Prognostic assessment of aluminum support structure on printed circuit boards, ASME Journal of Electronic Packaging 128(4) (2006), 339–345. [45] V. Shetty, D. Das, M. Pecht, D. Hiemstra and S, Martin, Remaining life assessment of shuttle remote manipulator system end effector, Proceedings of the 22nd Space Simulation Conference (2002), 21–23. [46] J. Gracia, L. Saiz, J. C. Baraza, D. Gil, and P. Gil, “Analysis of the influence of intermittent faults in a microcontroller,” in Design and Diagnostics of Electronic Circuits and Systems, 2008. DDECS 2008. 11th IEEE Workshop on, 2008, pp. 1 –6. [47] R. A. Syed, B. Robinson, and L. Williams, “Does Hardware Configuration and Processor Load Impact Software Fault Observability?” in Software Testing, Verification and Validation (ICST), 2010 Third International Conference on, 2010, pp. 285 –294.
  • International Journal of Electrical Engineering and Technology (IJEET), ISSN 0976 – 6545(Print), ISSN 0976 – 6553(Online) Volume 5, Issue 5, May (2014), pp. 57-73 © IAEME 73 [48] J. Wei, L. Rashid, K. Pattabiraman, and S. Gopalakrishnan, “Comparing the effects of intermittent and transient hardware faults on programs,” in Dependable Systems and Networks Workshops (DSN-W), 2011 IEEE/IFIP 41st International Conference on, 2011, pp. 53 –58. [49] T. Anderson and J. C. Knight, “A Framework for Software Fault Tolerance in Real-Time Systems,” IEEE Transactions on Software Engineering, vol. SE-9, no. 3, pp. 355 – 364, May 1983. [50] M. R. Lyu, Software Fault Tolerance. New York, NY, USA: John Wiley &amp; Sons, Inc., 1995. [51] B. Randell, “System structure for software fault tolerance,” in Proceedings of the international conference on Reliable software, New York, NY, USA, 1975, pp. 437–449. [52] A. Avizienis, “The N-Version Approach to Fault-Tolerant Software,” IEEE Transactions on Software Engineering, vol. SE-11, no. 12, pp. 1491 – 1501, Dec. 1985. [53] Ronitt A. Rubinfeld, A mathematical theory of self-checking, self-testing and self-correcting programs, University of California at Berkeley, Berkeley, CA, 1991. [54] N. Levenson and C. Turner, “An investigation of the Therac-25 accidents”, IEEE Computer, 26(7): 18-41, July 1993. [55] H. Boehm and S. Adve, “Foundations of the C++ concurrency memory model”, In Proc. 2008 ACM Conference on Programming Language Design and Implementation, pp. 69-78. [56] J. Seveik and D. Aspinall, “On validity of Program Transformations in the Java memory Model”, in Proc. 2008 European Conference on Object-Oriented Programming. Pp 27-51. [57] B. Wester, D. Devecsery, P. Chen, J. Flinn and S. Narayanasamy, “Parallelizing Data Race Detection”, In APLOS 2013, Houston Texas, March 16-20, 2013. [58] G. Jin, L. Song, W. Zhang, S. Lu and B. Liblit, “Automated atomicity violation fixing”, In Programming Language Design and Implementation”, In Programming Language Design and Implementation, 2011, pp. 389-400. [59] J. Perkins, S. Kim, S. Larsen, S. Amarasinghe, J. Bachrach, and M. Carbin, “Automatically patching errors in deployed software. In Symposium on Operating Systems Principles, 2009, pp. 87-102. [60] Y. Wei, Y. Pei, C. Furia, L. Silva, S. Buchholz, B. Meyer and A. Zeller, “Automated fixing of programs with contracts”, in International Symposium on Software Testing and Analysis”, 2010, pp.61-72. [61] E. Schulte, J. DiLorenzo, W. Weimer, S. Forrest, “ Automated repair of binary and assembly programs for cooperating embedded devices”, In APLOS 2013, Houston Texas, March 16-20, 2013. [62] S. Sahoo, J. Crisswell, C. Geigle and V. Adve, “ Using Likely Invariants for automated Software Fault Localization”, In APLOS 2013, Houston Texas, March 16-20, 2013. [63] M. Ernst, J. Cockrell, W. Griswold, and D. Notkin, “Dynamically discovering likely program invariants to support program evolution” IEEE Trans. Software Eng., 2001. [64] X. Zhang, R. Gupta and Y. Zhang, “ Precise dynamic slicing algorithms”, In Proceedings of the 25th International Conference on Software Engineering, 2003. [65] B. Lucia and L. Ceze, “Cooperative Empirical Failure Avoidance for Multithread programs”, In APLOS 2013, Houston Texas, March 16-20, 2013. [66] V.Yuvaraj and T.Vasanth, “Simulation, Control and Analysis of HTS Resistive and Power Electronic FCL for Fault Current Limitation and Voltage Sag Mitigation in Electrical Network”, International Journal of Electrical Engineering & Technology (IJEET), Volume 4, Issue 3, 2013, pp. 82 - 94, ISSN Print : 0976-6545, ISSN Online: 0976-6553.