DfR Seminar at Wyle Labs - Mike Silverman - Presentation

&

We Provide You Confidence in Your Product ReliabilityTM
Ops A La Carte / (408) 654-0499 / askops@opsalacarte.com / www.opsalacarte.com

DESIGN FOR
RELIABILITY (DfR)
SEMINAR
at

February 11, 2010
Mike Silverman // (408) 472-3889 // mikes@opsalacarte.com

Ops A La Carte LLC // www.opsalacarte.com
© 2009 Ops A La Carte 1

DfR Seminar Overview
Thurs, Feb 11, 2010
- DFR SEMINAR -

♦ 10:00-10:10am Introduction
♦ 10:10-10:30am DfR Overview/Introduction
♦ 10:30-11:00am FMEA
♦ 11:00-11:30am Using FMEA to Design a Better Reliability Test Program
♦ 11:30-11:50 am HALT
♦ 11:50- 12:10pm Lunch Break
♦ 12:10-12:30pm ALT
♦ 12:30:1:00pm HALT vs. ALT – When to Use Which Technique?
♦ 1:00-1:15pm Reliability Demonstration Test (RDT)
♦ 1:15-1:45pm HALT vs. RDT – The HALT Calculator
♦ 1:45-2:00pm Wrap-Up/Questions
Note that this ½ day seminar is an abridged version of a 5 day DfX seminar we will be holding 3 times this year:
- Apr 16-20 in Santa Clara, CA
- May 17-21 in Huntsville, AL
- Oct 11-15 in Maryland

Product Life Cycle Reliability and Test Spectrum
Wyle and OPS Combined Capabilities

Program Test & Operate &
Capture Design Build Eval Qualify Manufacture Maintain

Test Engineering Services
Test Quotes
Tech Test
Requirements
KEY
Plans
Test Data Analysis
Wyle
Procedures
Ops
Test Services
Wyle & OPS
HALT

HASS
Dev Test
Qual Test
Acceptance
Reliability, Maintainability, Supportability Services
FMECA
Reliability Eng Configuration
Publications Training Reliability Eng
& Analysis Management
& Analysis
Asset Lean
Management RCM
Six Sigma
NDI
TOC


What is DESIGN for
RELIABILITY?


First we must ask: What is Reliability?
Reliability is often considered quality over time.

Reliability is…
“The ability of a system or component to perform its required
functions under stated conditions for a specified period of time”

- IEEE 610.12-1990

♦ We shall revisit this when we discuss Reliability Goal Setting.


Different Views of Reliability
♦ Product development teams
View reliability as the domain to
address mechanical and electrical, and Mechanical

manufacturing issues. Reliability

♦ Customers +
View reliability as a system-level issue, Electrical
with minimal concern placed on the Reliability
distinction into sub-domains.

♦ Since the primary measure of +
reliability is made by the customer, SW
engineering teams must maintain a Reliability
balance of both views (system and
sub-domain) in order to develop a
reliable product.
System

Reliability vs. Cost
♦ Intuitively, the emphasis in reliability to
achieve a reduction in warranty and in-service
costs results in some minimal increase in
development and manufacturing costs .

♦ Use of the proper tools during the proper life
cycle phase will help to minimize total Life
Cycle Cost (LCC).


Reliability vs. Cost, continued
To minimize total Life Cycle Costs (LCC), an
organization must do two things:
1. Choose the best tools from all of the tools
available and apply these tools at the proper
phases of the product life cycle.
2. Properly integrate these tools together to assure
that the proper information is fed forwards and
backwards at the proper times.


Reliability Integration
“the process of seamlessly,
cohesively integrating reliability
tools together to maximize
reliability and at the lowest
possible cost”


Reliability vs. Cost, continued
TOTAL
COST
OPTIMUM CURVE
COST
POINT RELIABILITY
PROGRAM
COSTS
COST

WARRANTY
COSTS

RELIABILITY
HW RELIABILITY & COSTS


ELEMENTS
OF A
RELIABILITY
PROGRAM

DfR Tool Selection

A reliability assessment is the recommended first
step in establishing a reliability program. This
mechanism is the appropriate forum for selecting
the best tools for each product life cycle phase.


RELIABILITY
ASSESSMENT


Reliability Program Assessment
• Initiate a Reliability Program
• Determine next best steps $ Profits
• Reduce customer complaints
• Select right tools
• Improve reliability market
Goal share
Program Plan
Gap Analysis
satisfaction
Benchmarking

Statistical
Data Analysis
A detailed evaluation of an
organization’s approach and
Assessment
Interviews
processes involved in creating
field
reliable products. The assessment
failures $ unreliability captures the current state and
Now leads to an actionable reliability
? Unknown program plan.
complaints
Reliability ? © 2009 Ops A La Carte 19

Agenda

• motivation
• approach
• results
• findings
• observations
• next steps
• close


Assessment Motivation

• Identify systemic changes that impact
reliability
– Tie into culture and product
– Both enjoy benefits

• Provides roadmap for activities that
achieve results
– Matching of capabilities and expectations
– Cooperative approach


Assessment Approach
♦ Preparation

♦ Checklist

♦ Who to interview in organization

♦ Analysis, average scores and summary of
comments


Steps Involved

♦ selecting people to
survey
♦ selecting survey topics

♦ develop scoring system

♦ data analysis

♦ summary feedback
results
♦ review of results

♦ recommended actions


Select People to Survey
Hardware:
• Hardware manager
• Electrical engineering lead
• Mechanical engineering lead
• System engineering lead
• Reliability manager/engineer
• Procurement
• Manufacturing

Software:
• sw r&d manager
• sw r&d engineer
• sw test manager
• sw test engineer


Select Survey Topics
DFR Methods Survey
Scoring: 4 = 100%, top priority, always done
3 = >75%, use normally, expected
2 = 25% - 75%, variable use
1 = <25%, only occasional use
0 = not done or discontinued
- = not visible, no comment

Management:
□ Goal setting for division
□ Priority of quality & reliability improvement
□ Management attention & follow up (goal ownership)

Design:
□ Documented hardware design cycle
□ Goal setting by product or module

Example
♦ To what extent is FMEA used?
Design Engineer
Score = 1: Used only as a troubleshooting tool

Manufacturing Engineer
Score = 3: Commonly used on critical design elements

Reliability Engineer
Score = 4: Always used on all products

Results: Score 2.6
Comments: Clearly a disconnect between reliability and
design engineering – indicative of a problem with the tool.

Reliability Maturity Grid
• 5 levels of maturity
• Loosely based on IEEE 1332: “Reliability Program
for the Development and Production of Electronic
Products” (currently in draft form)
• Similar to Crosby’s Quality Maturity
• On the following page is a matrix based on
Crosby’s as an example.
• Read across each row and find the statement that
seems most true for your organization.
• The center of mass of the levels is the
organization’s overall level.

Reliability Maturity Matrix
Measurement Stage I: Stage II: Stage III: Stage IV: Stage V:
Category Uncertainty Awakening Enlightenment Wisdom Certainty
Management No comprehension of Recognizing that reliability Still learning more about Participating. Consider reliability
Understanding and Attitude reliability as a management management may be of reliability management. Understand absolutes of management an
tool. Tend to blame value but not willing to Becoming supportive and reliability management. essential part of company
reliability engineering for provide money or time to helpful. Recognize their personal system.
‘reliability problems’ make it happen. role in continuing
emphasis.
Reliability status Reliability is hidden in A stronger reliability Reliability manager Reliability manager is an Reliability manager is on
manufacturing or leader appointed, yet reports to top officer of company; board of directors.
engineering departments. main emphasis is still on management, with role in effective status reporting Prevention is main
Reliability testing probably an audit of initial product management of division. and preventive action. concern. Reliability is a
not part of organization. functionality. Reliability Involved with consumer thought leader.
Emphasis on initial product testing still not performed. affairs.
functionality.
Problem handling Fire fighting; no root cause Teams are set up to solve Corrective action process Problems are identified Except in the most
analysis or resolution; lots of major problems. Long- in place. Problems are early in their unusual cases, problems
yelling and accusations. range solutions are not recognized and solved in development. All are prevented.
identified or orderly way. functions are open to
implemented. suggestion and
improvement.
Cost of Reliability as % of Warranty: unknown Warranty: 3% Warranty: 4% Warranty: 3% Warranty: 1.5%
net revenue Reported: unknown Reported: unknown Reported: 8% Reported: 6.5% Reported: 3%
Actual: 20% Actual: 18% Actual: 12% Actual: 8% Actual: 3%
Feedback process None. No reliability testing. Some understanding of Accelerated testing of Refinement of testing The few field failures are
No field failure reporting field failures and critical systems during systems – only testing fully analyzed and
other than customer complaints. Designers design. System level critical or uncertain product designs or
complaints and returns. and manufacturing do modeling and testing. areas. Increased procurement
not get meaningful Field failures analyzed understanding of causes specifications altered.
information. and root causes reported. of failure allow Reliability testing done to
deterministic failure rate augment reliability
prediction models models.
DFR program status No organized activities. Organization told Implementation of DFR DFR program active in all Reliability improvement is
No understanding of such reliability is important. DFR program with thorough areas of division – not a normal and continued
activities. tools and processes understanding and just design & mfg’ing. activity.
inconsistently applied and establishment of each DFR normal part of R&D
only ‘when time permits’. tool. and manufacturing.
Summation of reliability “We don’t know why we “Is it absolutely necessary “Through commitment “Failure prevention is a “We know why we do not
posture have problems with to always have problems and reliability routine part of our have problems with
reliability” with reliability?” improvement we are operation.” reliability.”
identifying and resolving
our problems.”


Reliability Maturity Matrix
Lets look at one row to get a better understanding.
Measure- Stage I: Stage II: Stage III: Stage IV: Stage V:
Uncertainty Awakening Enlighten- Wisdom Certainty
ment
ment
Category
Problem Fire Teams are Corrective Problems Except in
handling fighting; no set up to action are the most
root cause solve process in identified unusual
analysis or major place. early in cases,
resolution; problems. Problems their problems
lots of Long- are developm are
yelling and range recognize ent. All prevented.
accusations solutions d and functions
. are not solved in are open
identified orderly to
or way. suggestio
implement n and
ed. improvem
ent.

Results & Meaning
• Looking for trends, gaps in process, skill mismatches,
over analysis, under analysis, etc.

• Looking for differences across the organization,
pockets of excellence, areas with good results

• Process provides snapshot of current system

• No one tool make an entire reliability program. The
tools need to match the needs of the products and
the culture.

• Check step is critical before moving to
recommendation around improvement plan


Observations
What Companies Are What Companies Are
Doing Best Weak at
♦ Prediction ♦ Goal setting/Planning
♦ HALT ♦ Repair & warranty
invisible
♦ Golden nuggets
♦ Lessons learned
♦ Fast reaction to fix
capture
problems
♦ Single owner of product
reliability
♦ Multiple defect tracking
systems
♦ Reliability Integration
♦
© 2009 Ops A La Carte
Statistics 31

Next Steps
• Determine current state of your organization
(Summary of Assessment)
– Identify strong and weak areas

• Goal Setting
– Market Analysis to gather requirements
– Benchmarking

• Gap Analysis

• Develop plan and implement


Failure Mode and Effect
Analysis (FMEA) Seminar


FMEA

A FMEA is a systematic method
of identifying and preventing
product and process problems
BEFORE they occur.


Not close enough to home yet?


FMEA Benefits
♦ Facilitates investigation of design alternatives to consider high
reliability at the conceptual stages of the design.
♦ Provides a basis for identifying root cause failures and
developing corrective actions.
♦ Determines the effects of each failure mode on system
performance.
♦ Aids in developing test methods and troubleshooting
techniques.
♦ Provides a foundation for qualitative analyses.

♦ Provide structured forum for cross functional discussions

♦ Provide common understanding and focus to reduce product
or process issues
♦ Provide documentation of risk management effort


Types of FMEAs

• Design FMEA
• Process FMEA
• System FMEA
• Functional FMEA
• User FMEA
• Software FMEA
• Many others


When Is a FMEA Performed

• FMEA’s are begun early in the design process and
then updated throughout the life cycle of a product to
capture changes in the design.


The 10 Steps
♦ Step 1: Review the Process/Design
♦ Step 2: Brainstorm potential failure modes
♦ Step 3: List potential effects of each failure mode
♦ Step 4: Assign a severity rating for each effect
♦ Step 5: Assign an occurrence rating for failure modes
♦ Step 6: Assign a detection rating for modes/effects
♦ Step 7: Calculate the risk priority numbers
♦ Step 8: Prioritize the failure modes for action
♦ Step 9: Take action to eliminate/reduce high-risk
♦ Step 10: Calculate the resulting RPN


Step 1: Review the Design or Process
♦ Understand the topic of study
• Design – drawings, prototypes, etc.
• Process – flowcharts, assembly instructions, etc.
♦ Focus on developing common understanding of
design or process
♦ Designers or Process Experts available for questions


Step 2: Brainstorm potential failure
modes
♦ Have fun!
♦ How can the design/process fail?

♦ Break complex designs/processes into smaller
elements
♦ Combine like ideas (affinity plotting)
♦ May have more than one failure mode per item or
function
♦ List failure modes on worksheet
♦ Determine failure modes vs. failure mechanisms
♦ Use Boundary Interface Diagram Tool
♦ Use P-Diagram Tool

Common brainstorming tools
♦ Team dynamics
♦ Consensus-building techniques
♦ Team project documentation
♦ Idea-generation techniques
• Group brainstorming with a facilitator
• Affinity diagramming
♦ Flowcharting
♦ Boundary Interface Diagram
♦ P-Diagram
♦ Data analysis
♦ Graphing techniques

Step 3: List Potential effects of each
failure mode
♦ If the failure occurs, what are the consequences?

♦ List effect for each failure mode (not mechanism).

♦ List more than one effect, when necessary
• (note: more than one effect if ratings will be different, or
solution would have to different)


Step 4: Assign a severity rating for each
effect
♦ What is the consequence of the failure should it
occur?
♦ Assign a severity rating for each effect
♦ An estimation of how serious the effects would be if
the failure mode occurs
• Historical data
• Engineering judgment
• Experimentation, DOE, if needed


Severity
Severity is the assessment of the seriousness of the
effect of the failure mode to the next component,
subsystem, system or customer if it occurs.
Below is a typical Severity Rating Table.

Rating Description Definition
10 Dangerously High Catastrophic Failure Causing Replacement of the Entire System)

9 Very high Failure of a FRU Component, MTTR > 1 Hour

8 High Failure of a FRU Component, MTTR < 1 Hour

6 Moderate Failure that results in reduced throughput

4 Minor Failure that requires a tool reset or recalibration

2 Very minor Failure that can be corrected during a PM cycle

1 None Failure that does not affect system performance


Step 5: Assign an occurrence rating for
each failure mode
♦ What is the probability of the failure occurring

♦ List the potential causes of failure

♦ Use actual data when available for rating

♦ When real data is not available:
• Engineering estimates or models
• Consider the failure causes probabilities
• Rank order then assign rating


Probability of Occurrence
Probability of Occurrence can be in terms of failure rate or
can just be a scale of 1-10 relative to all other failure modes.
Below is a typical Probability Rating Table

10 Dangerously Likely to Occur Chronically, (Daily or Hourly)
High
9 Very High Likely to Occur during one week of operation

8 High Likely to occur during one month of operation.

6 Medium Likely to occur during one year of operation.

4 Moderate Is likely to Occur during the Life of the System.

2 Low A Remote Probability of Occurrence in the Life of the System

1 Remote An Unlikely Probability of Occurrence in the Life of the System


Step 6: Assign a detection rating for each
failure mode and/or effect
♦ What is the probability of the failure being detected
before the impact of the effect is realized

♦ List known current controls
♦ Those items without controls are unlikely to be
detected (scoring 9 or 10)
♦ Again, use actual data when possible


Detection
A third factor used in assessing the risk of a failure is
likelihood of Detection of the failure before releasing the
product. The following table is an example of detection
scores (note that a high score indicates that the failure
is more difficult to detect).
Below is a typical Detection Rating Scale
No ability to detect before it occurs or and some ability to detect
5 Very Low after (unconfirmed failures)
No ability to detect before it occurs but can detect after
3 Moderate
Some ability to detect before it occurs but can detect after
2 High
Very likely it will be detectable before it occurs and after
1 Almost Certain

Note that the Detection Scale has been derated (scale 1-5 only). For many industries, the
key drivers are severity and probability.
In many industries, there is a high unconfirmed failure rate. Yet there is a high
probability of failures repeating themselves when they go back to the field after not
being confirmed – hence the importance of health diagnostics and the conditional
based maintenance strategy based on these health monitoring diagnostics.

Step 7: Calculate the risk priority number
for each effect
♦ RPN = S x P x D

♦ Risk Priority Number equals
Severity rating times
Probability of Occurrence rating times
Detection rating


Risk Priority Number
♦ Risk Priority Number (RPN)
• The RPN is the product of the Severity Score, the
Probability Score, and the Detection Score.
• Once all of the RPN’s have been calculated, the data
can be sorted from highest to lowest RPN to show
which are the most critical items to work on.
• Below is an example of an RPN Table

RISK VALUE (RPN)
251-500 Intolerable Risk Additional measures are required to ensure
adequate safety.
101-250 Undesirable Risk Risk is tolerable only if risk reduction is impractical or
if reduction costs are grossly disproportionate to the
improvement(s) gained. (Requires Executive Mgt.
Approval.)
11-100 Tolerable Risk The risk is tolerable if the cost of risk reduction will
exceed the improvement(s) gained. (Requires Project
Mgt. Approval.)
1-10 Negligible Acceptable as implemented.


Step 8: Prioritize the failure modes for
action

♦ Simple rank ordering from high to low based on RPN

♦ Decide on cutoff value
• Those above get attention & resources to improve
• Those below are left alone for now

♦ Consider including above the cut off any Severity
rating of 9 or 10


Step 9: Take action to eliminate or reduce
the high risk failure modes
♦ Use an organized problem-solving process

♦ Identify and implement actions to eliminate or reduce
the high-risk failure modes

♦ Consider DOE as tool to break down and solve
multiple variable or complex issues


Step 10: Calculate the resulting RPN as
the failure modes are reduced or
eliminate
♦ Document progress in reducing product risk with an
update by team of resulting RPN.

♦ You should expect 50% or greater reduction in total
PRN after an FMEA

♦ Continue to make improvements on highest risk items
until time, resources or overall ROI shift focus.


Linking FMEAs with Test Plans

In order to write better test plans,
we must first understand;
- the use environment
- the key risks to the design

The best tool for this is FMEA

Developing Better Test Plans

Stated another way, we cannot
know what to test for unless we
understand the key risks.

Therefore, FMEA is one of the
best sources of input for a
Reliability Test Plan.

Developing a Test Plan
without FMEA
What types of tests can you think of for
this device?

Developing a Test Plan
without FMEA
We used the IEC standards and came up
with a number of solid tests, including:
High/Low Temperature
Temperature Cycling
Vibration
Drop
Shock
Crush
Humidity
Altitude
Did we miss any?

FMEA Generated Tests

Then we performed an FMEA and
came up with the following:


Different cleaning solutions


Pen test


Pen test
Lipstick test


Pen test
Lipstick test
Motor Oil Test


Pen test
Lipstick test
Motor Oil Test
Cap Tether Test


Pen test
Lipstick test
Motor Oil Test
Cap Tether Test
Did we miss any?

Conclusion
FMEA is a development tactic
that can help solve the problem
of testing too little by uncovering
failure modes that require
tailored test methods rather than
only cookbook methods from
industry standards.

HALT
Highly Accelerated
Life Testing


HALT - Highly Accelerated
Life Test
Quickly discover design issues.
Evaluate & improve design margins.
Release mature product at market introduction.
Reduce development time & cost.
Eliminate design problems before release.
Evaluate cost reductions made to product.

Developmental HALT is not really a test you pass or fail,
it is a process tool for the design engineers.

There are no pre-established limits.


HALT, How It Works

ss
re
St

Start low and step up the
stress, testing the product
during the stressing


HALT, How It Works
Fa
ilu
ss re
re
St

Gradually increase
stress level until a
failure occurs


HALT, How It Works
Fa
ilu
ss re
re
St

s is
aly
An
Analyze
the failure


HALT, How It Works
Fa
ilu
ss re
re
St

s is
Im
aly
pr
ov
An
Make
temporary e
improvements

HALT, How It Works
Increase
stress and Fa
start
ilu
re s s
process

e)
re re
as
over
St
( inc

s is
Im
aly
pr
ov
An
e

HALT, How It Works
Fa
ilu
re s s
e)
re re
as
St
inc

Fundamental
(

Technological

s is
Im Limit
aly
pr
ov
An
e

HALT, Why It Works
Classic S-N Diagram
(stress vs. number of cycles)

S0= Normal Stress conditions
S2
N0= Projected Normal Life

S1

S0

N2 N1 N0


HALT, Why It Works
Classic S-N Diagram

Point at which failures become non-relevant

S2

S1

S0

N2 N1 N0


Margin Improvement Process

Lower Lower Upper Upper
Destruct Oper. Product Oper. Destruct
Limit Limit Operational Limit Limit
Specs

Stress


This is what the product spec distribution really looks like

Specs

Stress



Specs

Destruct
Margin
Operating
Margin

Stress


Developmental HALT Process
Planning a HALT
Setting up for a HALT
Executing a HALT
Post Testing


When to Perform HALT ?
Feasibility Development Qualification Launch
P1- P2 → Late P2 → P3 →

Perform HALT Perform HALT on ♦Demonstrate ♦Tracking
on 1 to 2 early more samples. 100% reliability reliability through
prototypes. These samples will target @ 80% C.L. field data
These samples be closer to final ♦Shipping /
may be hand- product and Packaging test
made and test functional tests will ♦Validation HALT
coverage may be more refined can be performed
be low, but we with higher test here
can still get coverage.
clues as to
gross design
issues.

Lessons learned feedback to next
generation product


Summary of Results
- by stress -

Cold Step Stress: 14%

Hot Step Stress: 17%

Rapid Thermal Transitions: 4%

Vibration Step Stress: 45%

Combined Environment: 20%

Significance:
Without Combined Environment, 20% of all
failures would have been missed

Traditional vs HALT
Engineering Needs
Product Development Manpower Requirements
Spending
Rate
6 DVT1 ..... DVTn,

5

4 MR
3
MR
2

1 $ Savings
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Time


HALT
Cost Benefits
Reduced product time to market
Lowered warranty cost through higher MTBF
Faster DVT with fewer product samples
Accelerated screening (HASS) allowed


Accelerated Life Testing
(ALT)


Accelerated Life Test (ALT)
♦ An Accelerated Life Test (ALT) is the process of
determining the reliability of a product in a short period
of time by accelerating the use environment.
♦ ALT's are good for finding dominant failure
mechanisms.
♦ ALT's are usually performed on individual assemblies
rather than full systems.
♦ ALT's are also frequently used when there is a wear-out
mechanism involved.


Stress
• Anything applied to a product, either electrically or
environmentally, to accelerate finding possible
weaknesses

• Examples of Electrical Stress: Current, Voltage (DC
and AC), Power Cycling, and Frequency (line and
board)
• Examples of Environmental Stress: Temperature
Extremes, Temperature Cycling, Vibration, Shock,
Humidity, ESD, Drop, Altitude


Physical Acceleration
♦ Acceleration means that operating a unit at high
stress (temperature, voltage, humidity, or duty cycle,
etc.) produces the same failures that would occur at
typical-use stresses, except that they happen much
quicker.

♦ Failure may be due to mechanical fatigue, corrosion,
chemical reaction, diffusion, migration, etc. The
causes are the same, the time scale is simply
different.
♦ Changing the stress is equivalent to transforming the
time scale. This is often a linear transform, which
means the time-to-fail at high stress is multiplied by a
constant (acceleration factor) to obtain the equivalent
time-to-fail at use stress.

Failure Mode Dependence
♦ Keep in mind that the acceleration factor is highly
dependent on the failure mechanism.
♦ Each failure mechanism will most likely have a
different acceleration factor.

♦ During testing, conduct thorough failure analysis and
separate the failure mechanisms for separate
analysis.

♦ Selecting the stress to apply must be done with the
expected failure mechanisms in mind.


Theory of ALT
Classic S-N Diagram

S2

Stress S1

S0

N2 N1 N0
Number of Cycles 94

When to Apply ALT

ALT Region of Application


ALT Parameters
In order to set up an ALT, we must know several different
parameters, including
• Length of Test
• Number of Samples
• Goal of Test
• Confidence Desired
• Accuracy Desired
• Cost
• Acceleration Factor
• Field Environment
• Test Environment
• Acceleration Factor Calculation
• Slope of Weibull Distribution (Beta parameter)

Review

♦ When wear-out is a dominant failure
mechanism, we must be able to predict or
characterize this wear-out mechanism to
assure that it occurs outside customer
expectations and outside the warranty period.

♦ ALT is an excellent method for doing this


HALT vs. ALT
When to Use Which
Technique?


Overview

HALT and ALT are two of the most
popular testing methods but often
times engineers are confused about
which to use when.


Overview
Highly Accelerated Life Testing (HALT) is a great
reliability technique to use for finding predominant
failure mechanisms in a hardware product.

However, in many cases, the predominant failure
mechanism is wear-out.

When this is the situation, we must be able to predict or
characterize this wear-out mechanism to assure that it
occurs outside customer expectations and outside the
warranty period.

The best technique to use for this is a slower test method
Accelerated Life Testing (ALT).

Overview
In many cases, it is best to use both
because each technique is good at
finding different types of failure
mechanisms.

The proper use of both techniques
together will offer a complete picture
of the reliability of the product.


HALT
Highly Accelerated Life Testing
used for Product Ruggedization

ALT
Accelerated Life Testing
used to Characterize Predominant Failure Mechanisms,
Especially for Wearout


Comparison Between
ALT and HALT
FAILURE TESTING

HALT ALT

OBJECTIVES OBJECTIVES
1. Root Cause Analysis 1. Reliability Evaluation (e.g. Failure Rates)
2. Corrective Action Identification 2. Dominant Failure Mechanisms Identification
3. Design Robustness Determination

TESTING REQUIREMENTS TESTING REQUIREMENTS
1. Detailed Product Knowledge 1. Detailed Parameters
2. Engineering Experience (a) Test Length
(b) Number of Samples
(c) Confidence/Accuracy
(d) Acceleration Factors
(e) Test Environment
2. Test Metrology & Factors
(a) 4:2:1Procedure Or Other
(b) Costs

ANALYTICAL MODELS
1. Weibull Distribution
2. Arrhenius
3. Coffin-Manson
4. Norris-Lanzberg


Advantage of ALT over
HALT

One key advantage of ALT over HALT is when we
need to know the life of the product.
In HALT, we don’t concern ourselves with this
much because we are more interested in making
the product as reliable as we can, and measuring
the amount of reliability is not as important.
However, with mechanical items that wear over
time, it is very important to know the life of the
product as accurately as possible.


Advantage of ALT over HALT
Another advantage is that we often do not need any
environmental equipment. Benchtop testing is often adequate.


Advantage of HALT over
ALT
A big advantage of HALT over ALT is time. We
are not so worried about time to failure as we are
which failure mode is dominant. And this we can
usually find out in a matter of days rather than
weeks or months.
This savings in time is also a big savings in money
since it takes less time at a test lab.
The number of samples is far fewer (usually 10 to
1)
We don’t need to calculate acceleration factor
We don’t need to stay with the same stresses as the
field environment because of the cross-over effect

Combining ALT with HALT
Often times we will run a product through HALT and then
run the subassemblies through ALT that were not good
candidates for HALT.

HALT on System ALT on System Fan


Developing ALT from HALT
And at other times, we may develop the ALT based on the
HALT limits, using the same accelerants but lowering the
acceleration factors to measurable levels.

HALT on System ALT on System


Examples of Products for
HALT and ALT
Component
Robot

Fan
Infusion Pump

Hard Drive
Medical
Cabinet
Automotive
Electronics
Cell Phone
Automobile

These pictures are samples of products we have tested. These are not the
actual products to protect the proprietary nature of the products we test.

Component

Characteristic Accelerant
Aging High Temperature

Contamination, Package Temp/Humidity
Hermeticity
Mismatch of Thermal Temp Cycling
Characteristics of Package Matls
Die Attachment, Bond Wires Vibration


Automobile

Test Accelerant
Electronics Temperature, Vibration, Humidity
Contamination
Mechanical Repetitive cycling test


Fan

Test Accelerant
Spinning Duty Cycle, Speed, Torque,
Backpressure
Lubricant Longevity Temperature, Humidity,
Contamination


Hard Drive

Test Accelerant
Head Spinning Duty Cycle, Start/Stop, Speed,
Temperature?, Vibration?
Contamination on Head Surface Non-Operational Vibration

Board Derating Temperature/Voltage

Connectors – Power, Data Duty Cycle, Force, Angle


Robot

Test Accelerant
Arm Movement (side to side) Duty Cycle, Speed, Torque

Z-Stage (up and down) Duty Cycle, Speed, Torque
Vacuum Hold-down Temperature, Altitude
Repeatability Duty Cycle


Automotive
Electronics –
GPS Receiver

Test Accelerant
Electronics Temperature, Vibration, Humidity
Contamination
Button Pushing Duty Cycle, Force?, Angle


Infusion Pump

Test Accelerant
Battery Charging Duty Cycle, Deep Discharge, Speed
of Charge
Touchscreen Duty Cycle, Location, Force?
Pumping Duty Cycle, Rate, Plunger Force
Connectors – Battery, Charger, Pole Duty Cycle, Force, Angle
Clamp, IV Line, Cassette


Drawer for
Medical Cabinet

Test Accelerant

Opening/Closing of Drawer Duty Cycle, Force, Angle

Locking Mechanism Duty Cycle, Force, Contamination


Cell Phone

Test Accelerant
Button Pushing Duty Cycle, Force?, Angle
Touchscreen Duty Cycle, Location, Force?
Connectors – Headset, Battery, Duty Cycle, Force, Angle
Charger


Summary
When wear-out is not a dominant failure
mechanism, HALT is an excellent tool for
finding product weaknesses in a short
period of time.


Summary
When wear-out is a dominant failure
mechanism, we must be able to predict or
characterize this wear-out mechanism to
assure that it occurs outside customer
expectations and outside the warranty
period.

ALT is an excellent method for doing this


RELIABILITY
DEMONSTRATION
TESTING (RDT)

121

Reliability Demonstration Testing (RDT)
♦ A sample of units are tested at accelerated
stresses for several months.
♦ The stresses are a bit lower than the HALT
stresses and they are held constant (or cycled
constantly) rather than gradually increasing.
♦ This enables us to calculate the acceleration
factor for the test.
♦ The RDT can be used to validate the reliability
prediction analyses.

122

RDT vs. ALT
♦ RDT and ALT are very similar in that the stresses
are usually accelerated but at a lower level than
HALT.
♦ The main difference between RDT and ALT is that
ALT is usually used to characterize the wearout
region of the product whereas RDT is usually used
to demonstrate the MTBF in the steady state region
of the product.
♦ In an RDT, you CAN substitute samples for time.
♦ In an ALT, you CANNOT substitute samples for
time.

123

RDT vs. ALT

ALT Region

RDT Region

124

HALT vs. RDT


Overview
Highly Accelerated Life Testing (HALT) is a great
reliability technique to use for finding predominant
failure mechanisms in a hardware product.

However, in many cases, customers need to know the
MTBF or Annualized Failure Rate (AFR) of a product in
the field.

When this is the situation, most people turn to RDT.

However, recently we have developed a method for
estimating MTBF from HALT data.


The AFR Estimator

The AFR Estimator is a patent pending
mathematical model that, when provided with
the appropriate HALT and product
information, will accurately estimate the
product’s field AFR or Annual Failure Rate.
This methodology has been used on a number
of products with significant positive financial
results.


Justification for the
AFR Estimator
As HALT takes only a few days to run and to implement its
corrective action(s), and even if it took a bit longer, this time
would be far less than waiting for an RDT to be run and to
implement its corrective action(s). The application of this
model can be a huge time and cost saver.
As higher HALT limits equate to lower AFR, you now have a
tool that can accurately estimate the field AFR before
launching the product. Stress levels that are depicted in the
table in Section E are highly recommended for HALT. These
levels can assure the producer that the product will exceed
customer expectations and allow the producer to accurately
forecast warranty expenditures.


AFR Estimator
By not performing life tests and simply doing HALT, time and
money will be saved. This is not to say that life testing isn’t
important. It should be considered for new technologies and
for an existing part/design with a different application but not
as a process to accurately estimate AFR.
With seven to ten simple data entry points and most of them
coming from the HALT effort, the AFR Estimator will provide
an accurate field AFR instantaneously with its associated 90%
statistical confidence limits. The inputs for HASS and HASA
are: will you perform HASS or HASA, the daily sample size,
and the detectable shift in the AFR you wish to detect.


AFR Estimator
The AFR Estimator has been validated on over twenty products
from diverse manufacturers and design environments.
The model can accommodate HALT samples sizes from one to
six with the optimum size being four. Sample sizes of greater
than four will default to four.
90% upper and lower confidence limits are calculated based on
the HALT AFR and the HALT Sample Size.


Recommendations when
using the AFR Estimator
An effective HALT needs to be done with at least three units
and highly preferable four although the model can
accommodate sample sizes from one to six.
Please realize that HALT sample sizes of three or less will
dramatically affect the ability to detect product defects and
hence, the statistical confidence is likewise impacted.


Recommendations when
using the AFR Estimator
1. Root Cause for Failures
2. Robust Protocols
3. Achieve at least the Guard Band Limits
4. For HASS or HASA, normalize chamber vibration tables
5. Obtain a copy of, “HALT, HASS, & HASA Explained”, by
Harry McLean and use it as a reference.


Recommendation 1:
Root Cause for Failures
Each of the issues encountered needs to have root cause
analysis understood, corrective action implemented, then
verified in HALT under the same stress conditions in which the
defect was detected. Exceptions to this would be limitations
that occur beyond the Guard Band Limits in the table following
Section E. Issues encountered beyond these levels are to have
root cause analysis performed but corrective action
implemented as a business decision based on timeliness, cost,
and program delays.


Recommendation 3:
Achieve Guard Band Limits
For the maximum benefit of a low field AFR or a high MTBF,
it is suggested that the product achieve at least the levels shown
under the Guard Band Limits in Section E below. These are
very achievable with time and understanding within the
organization without having to use extended (more costly)
temperature range components.


How to Use the Estimator:


Calculated MTBF Estimate
The MTBF estimate in kHours can be from Telcordia, Relex,
or a similar tool. If this estimate is not available, use 40,000
as a default value for the estimator. This parameter has very
little effect on the final field AFR or MTBF estimate due to
the highly variable processes followed by the many
assumptions used in estimating an MTBF. Enter this value in
the table following Section H. Please note that the estimator
will recommend an MTBF of 40,000 when a value to less than
40,000 is used.


HALT Operating Limits
The final Hot operating limit (HOL) achieved in HALT as
measured on the product and not the chamber setpoint. Enter
this value in the table following Section H.
The final Cold operating limit (COL) achieved in HALT as
measured on the product and not the chamber setpoint. Enter
this value in the table following Section H.
The final Vibration operating limit (VOL) achieved in HALT
as measured on the product and not the chamber setpoint.
Enter this value in the table following Section H.


Product Environment
The product’s published thermal operating specifications, in
°C. Try to match your product's Published Specifications to a
corresponding Level number listed in the table below, i.e., a
high-end consumer product equates to a Level 2.

Product's Published Specs Category Guard Band Limits Level
0°C to 40°C Consumer ‐30°C to +80°C 1
0°C to +50°C Hi‐end Consumer ‐30°C to +100°C 2
‐10°C to +50°C Hi Performance ‐40°C to +110°C 3
‐20°C to +50°C Critical Application ‐50°C to +110°C 4
‐25°C to +65°C Sheltered ‐50°C to +110°C 5
‐40°C to +85°C All Outdoor ‐65°C to +110°C 6


Running the Estimator
Once the Value for AFR Estimator column is completed, you
are ready to run the AFR Estimator and determine the
product’s AFR, MTBF, Confidence Limits, and days to detect
shift in AFR if HASS or HASA is being used.


WRAP-UP


Design for Reliability (DfR) Tools
♦ Reliability Assessment, Goal Setting, and Planning
♦ Reliability Modeling and Prediction
♦ Thermal Analysis
♦ Derating Analysis
♦ Failure Modes and Effects Analysis (FMEA)
♦ Fault Tree Analysis (FTA)
♦ Design of Experiments (DoE)
♦ Human Engineering/Human Factors Analysis
♦ Highly Accelerated Life Test (HALT)
♦ Accelerated Life Test (ALT)
♦ RDT and ORT
♦ Highly Accelerated Stress Screen (HASS)
♦ Root Cause Analysis (RCA)
♦ Restriction of Hazardous Substances (RoHS)
♦ Outsourced Engineering and Reliability
♦ Field Data Analysis
Red shows tools we introduced today. 143

Thank you for your
participation!


DfR Seminar at Wyle Labs - Mike Silverman - Presentation

Recommended

Recommended

More Related Content

Featured

Featured (20)

DfR Seminar at Wyle Labs - Mike Silverman - Presentation