DfR Seminar at Wyle Labs - Mike Silverman - Presentation1. &
We Provide You Confidence in Your Product ReliabilityTM
Ops A La Carte / (408) 654-0499 / askops@opsalacarte.com / www.opsalacarte.com
2. DESIGN FOR
RELIABILITY (DfR)
SEMINAR
at
February 11, 2010
Mike Silverman // (408) 472-3889 // mikes@opsalacarte.com
Ops A La Carte LLC // www.opsalacarte.com
© 2009 Ops A La Carte 1
3. DfR Seminar Overview
Thurs, Feb 11, 2010
- DFR SEMINAR -
♦ 10:00-10:10am Introduction
♦ 10:10-10:30am DfR Overview/Introduction
♦ 10:30-11:00am FMEA
♦ 11:00-11:30am Using FMEA to Design a Better Reliability Test Program
♦ 11:30-11:50 am HALT
♦ 11:50- 12:10pm Lunch Break
♦ 12:10-12:30pm ALT
♦ 12:30:1:00pm HALT vs. ALT – When to Use Which Technique?
♦ 1:00-1:15pm Reliability Demonstration Test (RDT)
♦ 1:15-1:45pm HALT vs. RDT – The HALT Calculator
♦ 1:45-2:00pm Wrap-Up/Questions
Note that this ½ day seminar is an abridged version of a 5 day DfX seminar we will be holding 3 times this year:
- Apr 16-20 in Santa Clara, CA
- May 17-21 in Huntsville, AL
- Oct 11-15 in Maryland
© 2009 Ops A La Carte 2
4. Product Life Cycle Reliability and Test Spectrum
Wyle and OPS Combined Capabilities
Program Test & Operate &
Capture Design Build Eval Qualify Manufacture Maintain
Test Engineering Services
Test Quotes
Tech Test
Requirements
KEY
Plans
Test Data Analysis
Wyle
Procedures
Ops
Test Services
Wyle & OPS
HALT
HASS
Dev Test
Qual Test
Acceptance
Reliability, Maintainability, Supportability Services
FMECA
Reliability Eng Configuration
Publications Training Reliability Eng
& Analysis Management
& Analysis
Asset Lean
Management RCM
Six Sigma
NDI
TOC
© 2009 Ops A La Carte 3
6. First we must ask: What is Reliability?
Reliability is often considered quality over time.
Reliability is…
“The ability of a system or component to perform its required
functions under stated conditions for a specified period of time”
- IEEE 610.12-1990
♦ We shall revisit this when we discuss Reliability Goal Setting.
© 2009 Ops A La Carte 10
7. Different Views of Reliability
♦ Product development teams
View reliability as the domain to
address mechanical and electrical, and Mechanical
manufacturing issues. Reliability
♦ Customers +
View reliability as a system-level issue, Electrical
with minimal concern placed on the Reliability
distinction into sub-domains.
♦ Since the primary measure of +
reliability is made by the customer, SW
engineering teams must maintain a Reliability
balance of both views (system and
sub-domain) in order to develop a
reliable product.
System
© 2009 Ops A La Carte 11
8. Reliability vs. Cost
♦ Intuitively, the emphasis in reliability to
achieve a reduction in warranty and in-service
costs results in some minimal increase in
development and manufacturing costs .
♦ Use of the proper tools during the proper life
cycle phase will help to minimize total Life
Cycle Cost (LCC).
© 2009 Ops A La Carte 12
9. Reliability vs. Cost, continued
To minimize total Life Cycle Costs (LCC), an
organization must do two things:
1. Choose the best tools from all of the tools
available and apply these tools at the proper
phases of the product life cycle.
2. Properly integrate these tools together to assure
that the proper information is fed forwards and
backwards at the proper times.
© 2009 Ops A La Carte 13
10. Reliability Integration
“the process of seamlessly,
cohesively integrating reliability
tools together to maximize
reliability and at the lowest
possible cost”
© 2009 Ops A La Carte 14
11. Reliability vs. Cost, continued
TOTAL
COST
OPTIMUM CURVE
COST
POINT RELIABILITY
PROGRAM
COSTS
COST
WARRANTY
COSTS
RELIABILITY
HW RELIABILITY & COSTS
© 2009 Ops A La Carte 15
12. ELEMENTS
OF A
RELIABILITY
PROGRAM
© 2009 Ops A La Carte 16
13. DfR Tool Selection
A reliability assessment is the recommended first
step in establishing a reliability program. This
mechanism is the appropriate forum for selecting
the best tools for each product life cycle phase.
© 2009 Ops A La Carte 17
15. Reliability Program Assessment
• Initiate a Reliability Program
• Determine next best steps $ Profits
• Reduce customer complaints
• Select right tools
• Improve reliability market
Goal share
Program Plan
Gap Analysis
satisfaction
Benchmarking
Statistical
Data Analysis
A detailed evaluation of an
organization’s approach and
Assessment
Interviews
processes involved in creating
field
reliable products. The assessment
failures $ unreliability captures the current state and
Now leads to an actionable reliability
? Unknown program plan.
complaints
Reliability ? © 2009 Ops A La Carte 19
16. Agenda
• motivation
• approach
• results
• findings
• observations
• next steps
• close
© 2009 Ops A La Carte 20
17. Assessment Motivation
• Identify systemic changes that impact
reliability
– Tie into culture and product
– Both enjoy benefits
• Provides roadmap for activities that
achieve results
– Matching of capabilities and expectations
– Cooperative approach
© 2009 Ops A La Carte 21
19. Steps Involved
♦ selecting people to
survey
♦ selecting survey topics
♦ develop scoring system
♦ data analysis
♦ summary feedback
results
♦ review of results
♦ recommended actions
© 2009 Ops A La Carte 23
20. Select People to Survey
Hardware:
• Hardware manager
• Electrical engineering lead
• Mechanical engineering lead
• System engineering lead
• Reliability manager/engineer
• Procurement
• Manufacturing
Software:
• sw r&d manager
• sw r&d engineer
• sw test manager
• sw test engineer
© 2009 Ops A La Carte 24
21. Select Survey Topics
DFR Methods Survey
Scoring: 4 = 100%, top priority, always done
3 = >75%, use normally, expected
2 = 25% - 75%, variable use
1 = <25%, only occasional use
0 = not done or discontinued
- = not visible, no comment
Management:
□ Goal setting for division
□ Priority of quality & reliability improvement
□ Management attention & follow up (goal ownership)
Design:
□ Documented hardware design cycle
□ Goal setting by product or module
© 2009 Ops A La Carte 25
22. Example
♦ To what extent is FMEA used?
Design Engineer
Score = 1: Used only as a troubleshooting tool
Manufacturing Engineer
Score = 3: Commonly used on critical design elements
Reliability Engineer
Score = 4: Always used on all products
Results: Score 2.6
Comments: Clearly a disconnect between reliability and
design engineering – indicative of a problem with the tool.
© 2009 Ops A La Carte 26
23. Reliability Maturity Grid
• 5 levels of maturity
• Loosely based on IEEE 1332: “Reliability Program
for the Development and Production of Electronic
Products” (currently in draft form)
• Similar to Crosby’s Quality Maturity
• On the following page is a matrix based on
Crosby’s as an example.
• Read across each row and find the statement that
seems most true for your organization.
• The center of mass of the levels is the
organization’s overall level.
© 2009 Ops A La Carte 27
24. Reliability Maturity Matrix
Measurement Stage I: Stage II: Stage III: Stage IV: Stage V:
Category Uncertainty Awakening Enlightenment Wisdom Certainty
Management No comprehension of Recognizing that reliability Still learning more about Participating. Consider reliability
Understanding and Attitude reliability as a management management may be of reliability management. Understand absolutes of management an
tool. Tend to blame value but not willing to Becoming supportive and reliability management. essential part of company
reliability engineering for provide money or time to helpful. Recognize their personal system.
‘reliability problems’ make it happen. role in continuing
emphasis.
Reliability status Reliability is hidden in A stronger reliability Reliability manager Reliability manager is an Reliability manager is on
manufacturing or leader appointed, yet reports to top officer of company; board of directors.
engineering departments. main emphasis is still on management, with role in effective status reporting Prevention is main
Reliability testing probably an audit of initial product management of division. and preventive action. concern. Reliability is a
not part of organization. functionality. Reliability Involved with consumer thought leader.
Emphasis on initial product testing still not performed. affairs.
functionality.
Problem handling Fire fighting; no root cause Teams are set up to solve Corrective action process Problems are identified Except in the most
analysis or resolution; lots of major problems. Long- in place. Problems are early in their unusual cases, problems
yelling and accusations. range solutions are not recognized and solved in development. All are prevented.
identified or orderly way. functions are open to
implemented. suggestion and
improvement.
Cost of Reliability as % of Warranty: unknown Warranty: 3% Warranty: 4% Warranty: 3% Warranty: 1.5%
net revenue Reported: unknown Reported: unknown Reported: 8% Reported: 6.5% Reported: 3%
Actual: 20% Actual: 18% Actual: 12% Actual: 8% Actual: 3%
Feedback process None. No reliability testing. Some understanding of Accelerated testing of Refinement of testing The few field failures are
No field failure reporting field failures and critical systems during systems – only testing fully analyzed and
other than customer complaints. Designers design. System level critical or uncertain product designs or
complaints and returns. and manufacturing do modeling and testing. areas. Increased procurement
not get meaningful Field failures analyzed understanding of causes specifications altered.
information. and root causes reported. of failure allow Reliability testing done to
deterministic failure rate augment reliability
prediction models models.
DFR program status No organized activities. Organization told Implementation of DFR DFR program active in all Reliability improvement is
No understanding of such reliability is important. DFR program with thorough areas of division – not a normal and continued
activities. tools and processes understanding and just design & mfg’ing. activity.
inconsistently applied and establishment of each DFR normal part of R&D
only ‘when time permits’. tool. and manufacturing.
Summation of reliability “We don’t know why we “Is it absolutely necessary “Through commitment “Failure prevention is a “We know why we do not
posture have problems with to always have problems and reliability routine part of our have problems with
reliability” with reliability?” improvement we are operation.” reliability.”
identifying and resolving
our problems.”
© 2009 Ops A La Carte 28
25. Reliability Maturity Matrix
Lets look at one row to get a better understanding.
Measure- Stage I: Stage II: Stage III: Stage IV: Stage V:
Uncertainty Awakening Enlighten- Wisdom Certainty
ment
ment
Category
Problem Fire Teams are Corrective Problems Except in
handling fighting; no set up to action are the most
root cause solve process in identified unusual
analysis or major place. early in cases,
resolution; problems. Problems their problems
lots of Long- are developm are
yelling and range recognize ent. All prevented.
accusations solutions d and functions
. are not solved in are open
identified orderly to
or way. suggestio
implement n and
ed. improvem
ent.
© 2009 Ops A La Carte 29
26. Results & Meaning
• Looking for trends, gaps in process, skill mismatches,
over analysis, under analysis, etc.
• Looking for differences across the organization,
pockets of excellence, areas with good results
• Process provides snapshot of current system
• No one tool make an entire reliability program. The
tools need to match the needs of the products and
the culture.
• Check step is critical before moving to
recommendation around improvement plan
© 2009 Ops A La Carte 30
27. Observations
What Companies Are What Companies Are
Doing Best Weak at
♦ Prediction ♦ Goal setting/Planning
♦ HALT ♦ Repair & warranty
invisible
♦ Golden nuggets
♦ Lessons learned
♦ Fast reaction to fix
capture
problems
♦ Single owner of product
reliability
♦ Multiple defect tracking
systems
♦ Reliability Integration
♦
© 2009 Ops A La Carte
Statistics 31
28. Next Steps
• Determine current state of your organization
(Summary of Assessment)
– Identify strong and weak areas
• Goal Setting
– Market Analysis to gather requirements
– Benchmarking
• Gap Analysis
• Develop plan and implement
© 2009 Ops A La Carte 32
30. FMEA
A FMEA is a systematic method
of identifying and preventing
product and process problems
BEFORE they occur.
© 2009 Ops A La Carte 34
35. FMEA Benefits
♦ Facilitates investigation of design alternatives to consider high
reliability at the conceptual stages of the design.
♦ Provides a basis for identifying root cause failures and
developing corrective actions.
♦ Determines the effects of each failure mode on system
performance.
♦ Aids in developing test methods and troubleshooting
techniques.
♦ Provides a foundation for qualitative analyses.
♦ Provide structured forum for cross functional discussions
♦ Provide common understanding and focus to reduce product
or process issues
♦ Provide documentation of risk management effort
© 2009 Ops A La Carte 39
36. Types of FMEAs
• Design FMEA
• Process FMEA
• System FMEA
• Functional FMEA
• User FMEA
• Software FMEA
• Many others
© 2009 Ops A La Carte 40
37. When Is a FMEA Performed
• FMEA’s are begun early in the design process and
then updated throughout the life cycle of a product to
capture changes in the design.
© 2009 Ops A La Carte 41
38. The 10 Steps
♦ Step 1: Review the Process/Design
♦ Step 2: Brainstorm potential failure modes
♦ Step 3: List potential effects of each failure mode
♦ Step 4: Assign a severity rating for each effect
♦ Step 5: Assign an occurrence rating for failure modes
♦ Step 6: Assign a detection rating for modes/effects
♦ Step 7: Calculate the risk priority numbers
♦ Step 8: Prioritize the failure modes for action
♦ Step 9: Take action to eliminate/reduce high-risk
♦ Step 10: Calculate the resulting RPN
© 2009 Ops A La Carte 42
39. Step 1: Review the Design or Process
♦ Understand the topic of study
• Design – drawings, prototypes, etc.
• Process – flowcharts, assembly instructions, etc.
♦ Focus on developing common understanding of
design or process
♦ Designers or Process Experts available for questions
© 2009 Ops A La Carte 43
40. Step 2: Brainstorm potential failure
modes
♦ Have fun!
♦ How can the design/process fail?
♦ Break complex designs/processes into smaller
elements
♦ Combine like ideas (affinity plotting)
♦ May have more than one failure mode per item or
function
♦ List failure modes on worksheet
♦ Determine failure modes vs. failure mechanisms
♦ Use Boundary Interface Diagram Tool
♦ Use P-Diagram Tool
© 2009 Ops A La Carte 44
41. Common brainstorming tools
♦ Team dynamics
♦ Consensus-building techniques
♦ Team project documentation
♦ Idea-generation techniques
• Group brainstorming with a facilitator
• Affinity diagramming
♦ Flowcharting
♦ Boundary Interface Diagram
♦ P-Diagram
♦ Data analysis
♦ Graphing techniques
© 2009 Ops A La Carte 45
42. Step 3: List Potential effects of each
failure mode
♦ If the failure occurs, what are the consequences?
♦ List effect for each failure mode (not mechanism).
♦ List more than one effect, when necessary
• (note: more than one effect if ratings will be different, or
solution would have to different)
© 2009 Ops A La Carte 46
43. Step 4: Assign a severity rating for each
effect
♦ What is the consequence of the failure should it
occur?
♦ Assign a severity rating for each effect
♦ An estimation of how serious the effects would be if
the failure mode occurs
• Historical data
• Engineering judgment
• Experimentation, DOE, if needed
© 2009 Ops A La Carte 47
44. Severity
Severity is the assessment of the seriousness of the
effect of the failure mode to the next component,
subsystem, system or customer if it occurs.
Below is a typical Severity Rating Table.
Rating Description Definition
10 Dangerously High Catastrophic Failure Causing Replacement of the Entire System)
9 Very high Failure of a FRU Component, MTTR > 1 Hour
8 High Failure of a FRU Component, MTTR < 1 Hour
6 Moderate Failure that results in reduced throughput
4 Minor Failure that requires a tool reset or recalibration
2 Very minor Failure that can be corrected during a PM cycle
1 None Failure that does not affect system performance
© 2009 Ops A La Carte 48
45. Step 5: Assign an occurrence rating for
each failure mode
♦ What is the probability of the failure occurring
♦ List the potential causes of failure
♦ Use actual data when available for rating
♦ When real data is not available:
• Engineering estimates or models
• Consider the failure causes probabilities
• Rank order then assign rating
© 2009 Ops A La Carte 49
46. Probability of Occurrence
Probability of Occurrence can be in terms of failure rate or
can just be a scale of 1-10 relative to all other failure modes.
Below is a typical Probability Rating Table
Rating Description Definition
10 Dangerously Likely to Occur Chronically, (Daily or Hourly)
High
9 Very High Likely to Occur during one week of operation
8 High Likely to occur during one month of operation.
6 Medium Likely to occur during one year of operation.
4 Moderate Is likely to Occur during the Life of the System.
2 Low A Remote Probability of Occurrence in the Life of the System
1 Remote An Unlikely Probability of Occurrence in the Life of the System
© 2009 Ops A La Carte 50
47. Step 6: Assign a detection rating for each
failure mode and/or effect
♦ What is the probability of the failure being detected
before the impact of the effect is realized
♦ List known current controls
♦ Those items without controls are unlikely to be
detected (scoring 9 or 10)
♦ Again, use actual data when possible
© 2009 Ops A La Carte 51
48. Detection
A third factor used in assessing the risk of a failure is
likelihood of Detection of the failure before releasing the
product. The following table is an example of detection
scores (note that a high score indicates that the failure
is more difficult to detect).
Below is a typical Detection Rating Scale
Rating Description Definition
No ability to detect before it occurs or and some ability to detect
5 Very Low after (unconfirmed failures)
No ability to detect before it occurs but can detect after
3 Moderate
Some ability to detect before it occurs but can detect after
2 High
Very likely it will be detectable before it occurs and after
1 Almost Certain
Note that the Detection Scale has been derated (scale 1-5 only). For many industries, the
key drivers are severity and probability.
In many industries, there is a high unconfirmed failure rate. Yet there is a high
probability of failures repeating themselves when they go back to the field after not
being confirmed – hence the importance of health diagnostics and the conditional
based maintenance strategy based on these health monitoring diagnostics.
© 2009 Ops A La Carte 52
49. Step 7: Calculate the risk priority number
for each effect
♦ RPN = S x P x D
♦ Risk Priority Number equals
Severity rating times
Probability of Occurrence rating times
Detection rating
© 2009 Ops A La Carte 53
50. Risk Priority Number
♦ Risk Priority Number (RPN)
• The RPN is the product of the Severity Score, the
Probability Score, and the Detection Score.
• Once all of the RPN’s have been calculated, the data
can be sorted from highest to lowest RPN to show
which are the most critical items to work on.
• Below is an example of an RPN Table
RISK VALUE (RPN)
251-500 Intolerable Risk Additional measures are required to ensure
adequate safety.
101-250 Undesirable Risk Risk is tolerable only if risk reduction is impractical or
if reduction costs are grossly disproportionate to the
improvement(s) gained. (Requires Executive Mgt.
Approval.)
11-100 Tolerable Risk The risk is tolerable if the cost of risk reduction will
exceed the improvement(s) gained. (Requires Project
Mgt. Approval.)
1-10 Negligible Acceptable as implemented.
© 2009 Ops A La Carte 54
51. Step 8: Prioritize the failure modes for
action
♦ Simple rank ordering from high to low based on RPN
♦ Decide on cutoff value
• Those above get attention & resources to improve
• Those below are left alone for now
♦ Consider including above the cut off any Severity
rating of 9 or 10
© 2009 Ops A La Carte 55
52. Step 9: Take action to eliminate or reduce
the high risk failure modes
♦ Use an organized problem-solving process
♦ Identify and implement actions to eliminate or reduce
the high-risk failure modes
♦ Consider DOE as tool to break down and solve
multiple variable or complex issues
© 2009 Ops A La Carte 56
53. Step 10: Calculate the resulting RPN as
the failure modes are reduced or
eliminate
♦ Document progress in reducing product risk with an
update by team of resulting RPN.
♦ You should expect 50% or greater reduction in total
PRN after an FMEA
♦ Continue to make improvements on highest risk items
until time, resources or overall ROI shift focus.
© 2009 Ops A La Carte 57
54. Linking FMEAs with Test Plans
In order to write better test plans,
we must first understand;
- the use environment
- the key risks to the design
The best tool for this is FMEA
55. Developing Better Test Plans
Stated another way, we cannot
know what to test for unless we
understand the key risks.
Therefore, FMEA is one of the
best sources of input for a
Reliability Test Plan.
57. Developing a Test Plan
without FMEA
What types of tests can you think of for
this device?
58. Developing a Test Plan
without FMEA
We used the IEC standards and came up
with a number of solid tests, including:
High/Low Temperature
Temperature Cycling
Vibration
Drop
Shock
Crush
Humidity
Altitude
Did we miss any?
60. FMEA Generated Tests
Then we performed an FMEA and
came up with the following:
Different cleaning solutions
61. FMEA Generated Tests
Then we performed an FMEA and
came up with the following:
Different cleaning solutions
Pen test
62. FMEA Generated Tests
Then we performed an FMEA and
came up with the following:
Different cleaning solutions
Pen test
Lipstick test
63. FMEA Generated Tests
Then we performed an FMEA and
came up with the following:
Different cleaning solutions
Pen test
Lipstick test
Motor Oil Test
64. FMEA Generated Tests
Then we performed an FMEA and
came up with the following:
Different cleaning solutions
Pen test
Lipstick test
Motor Oil Test
Cap Tether Test
65. FMEA Generated Tests
Then we performed an FMEA and
came up with the following:
Different cleaning solutions
Pen test
Lipstick test
Motor Oil Test
Cap Tether Test
Did we miss any?
66. Conclusion
FMEA is a development tactic
that can help solve the problem
of testing too little by uncovering
failure modes that require
tailored test methods rather than
only cookbook methods from
industry standards.
68. HALT - Highly Accelerated
Life Test
Quickly discover design issues.
Evaluate & improve design margins.
Release mature product at market introduction.
Reduce development time & cost.
Eliminate design problems before release.
Evaluate cost reductions made to product.
Developmental HALT is not really a test you pass or fail,
it is a process tool for the design engineers.
There are no pre-established limits.
© 2009 Ops A La Carte 72
69. HALT, How It Works
ss
re
St
Start low and step up the
stress, testing the product
during the stressing
© 2009 Ops A La Carte 73
70. HALT, How It Works
Fa
ilu
ss re
re
St
Gradually increase
stress level until a
failure occurs
© 2009 Ops A La Carte 74
71. HALT, How It Works
Fa
ilu
ss re
re
St
s is
aly
An
Analyze
the failure
© 2009 Ops A La Carte 75
72. HALT, How It Works
Fa
ilu
ss re
re
St
s is
Im
aly
pr
ov
An
Make
temporary e
improvements
© 2009 Ops A La Carte 76
73. HALT, How It Works
Increase
stress and Fa
start
ilu
re s s
process
e)
re re
as
over
St
( inc
s is
Im
aly
pr
ov
An
e
© 2009 Ops A La Carte 77
74. HALT, How It Works
Fa
ilu
re s s
e)
re re
as
St
inc
Fundamental
(
Technological
s is
Im Limit
aly
pr
ov
An
e
© 2009 Ops A La Carte 78
75. HALT, Why It Works
Classic S-N Diagram
(stress vs. number of cycles)
S0= Normal Stress conditions
S2
N0= Projected Normal Life
S1
S0
N2 N1 N0
© 2009 Ops A La Carte 79
76. HALT, Why It Works
Classic S-N Diagram
(stress vs. number of cycles)
Point at which failures become non-relevant
S0= Normal Stress conditions
S2
N0= Projected Normal Life
S1
S0
N2 N1 N0
© 2009 Ops A La Carte 80
77. Margin Improvement Process
Lower Lower Upper Upper
Destruct Oper. Product Oper. Destruct
Limit Limit Operational Limit Limit
Specs
Stress
© 2009 Ops A La Carte 81
78. Margin Improvement Process
This is what the product spec distribution really looks like
Lower Lower Upper Upper
Destruct Oper. Product Oper. Destruct
Limit Limit Operational Limit Limit
Specs
Stress
© 2009 Ops A La Carte 82
79. Margin Improvement Process
Lower Lower Upper Upper
Destruct Oper. Product Oper. Destruct
Limit Limit Operational Limit Limit
Specs
Destruct
Margin
Operating
Margin
Stress
© 2009 Ops A La Carte 83
81. When to Perform HALT ?
Feasibility Development Qualification Launch
P1- P2 → Late P2 → P3 →
Perform HALT Perform HALT on ♦Demonstrate ♦Tracking
on 1 to 2 early more samples. 100% reliability reliability through
prototypes. These samples will target @ 80% C.L. field data
These samples be closer to final ♦Shipping /
may be hand- product and Packaging test
made and test functional tests will ♦Validation HALT
coverage may be more refined can be performed
be low, but we with higher test here
can still get coverage.
clues as to
gross design
issues.
Lessons learned feedback to next
generation product
© 2009 Ops A La Carte 85
82. Summary of Results
- by stress -
Cold Step Stress: 14%
Hot Step Stress: 17%
Rapid Thermal Transitions: 4%
Vibration Step Stress: 45%
Combined Environment: 20%
Significance:
Without Combined Environment, 20% of all
failures would have been missed
© 2009 Ops A La Carte 86
83. Traditional vs HALT
Engineering Needs
Product Development Manpower Requirements
Spending
Rate
6 DVT1 ..... DVTn,
5
4 MR
3
MR
2
1 $ Savings
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Time
© 2009 Ops A La Carte 87
84. HALT
Cost Benefits
Reduced product time to market
Lowered warranty cost through higher MTBF
Faster DVT with fewer product samples
Accelerated screening (HASS) allowed
© 2009 Ops A La Carte 88
86. Accelerated Life Test (ALT)
♦ An Accelerated Life Test (ALT) is the process of
determining the reliability of a product in a short period
of time by accelerating the use environment.
♦ ALT's are good for finding dominant failure
mechanisms.
♦ ALT's are usually performed on individual assemblies
rather than full systems.
♦ ALT's are also frequently used when there is a wear-out
mechanism involved.
© 2009 Ops A La Carte 90
87. Stress
• Anything applied to a product, either electrically or
environmentally, to accelerate finding possible
weaknesses
• Examples of Electrical Stress: Current, Voltage (DC
and AC), Power Cycling, and Frequency (line and
board)
• Examples of Environmental Stress: Temperature
Extremes, Temperature Cycling, Vibration, Shock,
Humidity, ESD, Drop, Altitude
© 2009 Ops A La Carte 91
88. Physical Acceleration
♦ Acceleration means that operating a unit at high
stress (temperature, voltage, humidity, or duty cycle,
etc.) produces the same failures that would occur at
typical-use stresses, except that they happen much
quicker.
♦ Failure may be due to mechanical fatigue, corrosion,
chemical reaction, diffusion, migration, etc. The
causes are the same, the time scale is simply
different.
♦ Changing the stress is equivalent to transforming the
time scale. This is often a linear transform, which
means the time-to-fail at high stress is multiplied by a
constant (acceleration factor) to obtain the equivalent
time-to-fail at use stress.
© 2009 Ops A La Carte 92
89. Failure Mode Dependence
♦ Keep in mind that the acceleration factor is highly
dependent on the failure mechanism.
♦ Each failure mechanism will most likely have a
different acceleration factor.
♦ During testing, conduct thorough failure analysis and
separate the failure mechanisms for separate
analysis.
♦ Selecting the stress to apply must be done with the
expected failure mechanisms in mind.
© 2009 Ops A La Carte 93
90. Theory of ALT
Classic S-N Diagram
(stress vs. number of cycles)
S0= Normal Stress conditions
S2
N0= Projected Normal Life
Stress S1
S0
N2 N1 N0
Number of Cycles 94
© 2009 Ops A La Carte
91. When to Apply ALT
ALT Region of Application
© 2009 Ops A La Carte 95
92. ALT Parameters
In order to set up an ALT, we must know several different
parameters, including
• Length of Test
• Number of Samples
• Goal of Test
• Confidence Desired
• Accuracy Desired
• Cost
• Acceleration Factor
• Field Environment
• Test Environment
• Acceleration Factor Calculation
• Slope of Weibull Distribution (Beta parameter)
© 2009 Ops A La Carte 96
93. Review
♦ When wear-out is a dominant failure
mechanism, we must be able to predict or
characterize this wear-out mechanism to
assure that it occurs outside customer
expectations and outside the warranty period.
♦ ALT is an excellent method for doing this
© 2009 Ops A La Carte 97
95. Overview
HALT and ALT are two of the most
popular testing methods but often
times engineers are confused about
which to use when.
© 2009 Ops A La Carte 99
96. Overview
Highly Accelerated Life Testing (HALT) is a great
reliability technique to use for finding predominant
failure mechanisms in a hardware product.
However, in many cases, the predominant failure
mechanism is wear-out.
When this is the situation, we must be able to predict or
characterize this wear-out mechanism to assure that it
occurs outside customer expectations and outside the
warranty period.
The best technique to use for this is a slower test method
Accelerated Life Testing (ALT).
© 2009 Ops A La Carte 100
97. Overview
In many cases, it is best to use both
because each technique is good at
finding different types of failure
mechanisms.
The proper use of both techniques
together will offer a complete picture
of the reliability of the product.
© 2009 Ops A La Carte 101
98. HALT
Highly Accelerated Life Testing
used for Product Ruggedization
ALT
Accelerated Life Testing
used to Characterize Predominant Failure Mechanisms,
Especially for Wearout
© 2009 Ops A La Carte 102
99. Comparison Between
ALT and HALT
FAILURE TESTING
HALT ALT
OBJECTIVES OBJECTIVES
1. Root Cause Analysis 1. Reliability Evaluation (e.g. Failure Rates)
2. Corrective Action Identification 2. Dominant Failure Mechanisms Identification
3. Design Robustness Determination
TESTING REQUIREMENTS TESTING REQUIREMENTS
1. Detailed Product Knowledge 1. Detailed Parameters
2. Engineering Experience (a) Test Length
(b) Number of Samples
(c) Confidence/Accuracy
(d) Acceleration Factors
(e) Test Environment
2. Test Metrology & Factors
(a) 4:2:1Procedure Or Other
(b) Costs
ANALYTICAL MODELS
1. Weibull Distribution
2. Arrhenius
3. Coffin-Manson
4. Norris-Lanzberg
© 2009 Ops A La Carte 103
100. Advantage of ALT over
HALT
One key advantage of ALT over HALT is when we
need to know the life of the product.
In HALT, we don’t concern ourselves with this
much because we are more interested in making
the product as reliable as we can, and measuring
the amount of reliability is not as important.
However, with mechanical items that wear over
time, it is very important to know the life of the
product as accurately as possible.
© 2009 Ops A La Carte 104
101. Advantage of ALT over HALT
Another advantage is that we often do not need any
environmental equipment. Benchtop testing is often adequate.
© 2009 Ops A La Carte 105
102. Advantage of HALT over
ALT
A big advantage of HALT over ALT is time. We
are not so worried about time to failure as we are
which failure mode is dominant. And this we can
usually find out in a matter of days rather than
weeks or months.
This savings in time is also a big savings in money
since it takes less time at a test lab.
The number of samples is far fewer (usually 10 to
1)
We don’t need to calculate acceleration factor
We don’t need to stay with the same stresses as the
field environment because of the cross-over effect
© 2009 Ops A La Carte 106
103. Combining ALT with HALT
Often times we will run a product through HALT and then
run the subassemblies through ALT that were not good
candidates for HALT.
HALT on System ALT on System Fan
© 2009 Ops A La Carte 107
104. Developing ALT from HALT
And at other times, we may develop the ALT based on the
HALT limits, using the same accelerants but lowering the
acceleration factors to measurable levels.
HALT on System ALT on System
© 2009 Ops A La Carte 108
105. Examples of Products for
HALT and ALT
Component
Robot
Fan
Infusion Pump
Hard Drive
Medical
Cabinet
Automotive
Electronics
Cell Phone
Automobile
These pictures are samples of products we have tested. These are not the
actual products to protect the proprietary nature of the products we test.
© 2009 Ops A La Carte 109
106. Component
Characteristic Accelerant
Aging High Temperature
Contamination, Package Temp/Humidity
Hermeticity
Mismatch of Thermal Temp Cycling
Characteristics of Package Matls
Die Attachment, Bond Wires Vibration
© 2009 Ops A La Carte 110
107. Automobile
Test Accelerant
Electronics Temperature, Vibration, Humidity
Contamination
Mechanical Repetitive cycling test
© 2009 Ops A La Carte 111
108. Fan
Test Accelerant
Spinning Duty Cycle, Speed, Torque,
Backpressure
Lubricant Longevity Temperature, Humidity,
Contamination
© 2009 Ops A La Carte 112
109. Hard Drive
Test Accelerant
Head Spinning Duty Cycle, Start/Stop, Speed,
Temperature?, Vibration?
Contamination on Head Surface Non-Operational Vibration
Board Derating Temperature/Voltage
Connectors – Power, Data Duty Cycle, Force, Angle
© 2009 Ops A La Carte 113
110. Robot
Test Accelerant
Arm Movement (side to side) Duty Cycle, Speed, Torque
Z-Stage (up and down) Duty Cycle, Speed, Torque
Vacuum Hold-down Temperature, Altitude
Repeatability Duty Cycle
© 2009 Ops A La Carte 114
111. Automotive
Electronics –
GPS Receiver
Test Accelerant
Electronics Temperature, Vibration, Humidity
Contamination
Button Pushing Duty Cycle, Force?, Angle
© 2009 Ops A La Carte 115
112. Infusion Pump
Test Accelerant
Battery Charging Duty Cycle, Deep Discharge, Speed
of Charge
Touchscreen Duty Cycle, Location, Force?
Pumping Duty Cycle, Rate, Plunger Force
Connectors – Battery, Charger, Pole Duty Cycle, Force, Angle
Clamp, IV Line, Cassette
© 2009 Ops A La Carte 116
113. Drawer for
Medical Cabinet
Test Accelerant
Opening/Closing of Drawer Duty Cycle, Force, Angle
Locking Mechanism Duty Cycle, Force, Contamination
© 2009 Ops A La Carte 117
114. Cell Phone
Test Accelerant
Button Pushing Duty Cycle, Force?, Angle
Touchscreen Duty Cycle, Location, Force?
Connectors – Headset, Battery, Duty Cycle, Force, Angle
Charger
© 2009 Ops A La Carte 118
115. Summary
When wear-out is not a dominant failure
mechanism, HALT is an excellent tool for
finding product weaknesses in a short
period of time.
© 2009 Ops A La Carte 119
116. Summary
When wear-out is a dominant failure
mechanism, we must be able to predict or
characterize this wear-out mechanism to
assure that it occurs outside customer
expectations and outside the warranty
period.
ALT is an excellent method for doing this
© 2009 Ops A La Carte 120
118. Reliability Demonstration Testing (RDT)
♦ A sample of units are tested at accelerated
stresses for several months.
♦ The stresses are a bit lower than the HALT
stresses and they are held constant (or cycled
constantly) rather than gradually increasing.
♦ This enables us to calculate the acceleration
factor for the test.
♦ The RDT can be used to validate the reliability
prediction analyses.
122
© 2009 Ops A La Carte
119. RDT vs. ALT
♦ RDT and ALT are very similar in that the stresses
are usually accelerated but at a lower level than
HALT.
♦ The main difference between RDT and ALT is that
ALT is usually used to characterize the wearout
region of the product whereas RDT is usually used
to demonstrate the MTBF in the steady state region
of the product.
♦ In an RDT, you CAN substitute samples for time.
♦ In an ALT, you CANNOT substitute samples for
time.
123
© 2009 Ops A La Carte
120. RDT vs. ALT
ALT Region
RDT Region
124
© 2009 Ops A La Carte
124. Overview
Highly Accelerated Life Testing (HALT) is a great
reliability technique to use for finding predominant
failure mechanisms in a hardware product.
However, in many cases, customers need to know the
MTBF or Annualized Failure Rate (AFR) of a product in
the field.
When this is the situation, most people turn to RDT.
However, recently we have developed a method for
estimating MTBF from HALT data.
© 2009 Ops A La Carte 128
125. The AFR Estimator
The AFR Estimator is a patent pending
mathematical model that, when provided with
the appropriate HALT and product
information, will accurately estimate the
product’s field AFR or Annual Failure Rate.
This methodology has been used on a number
of products with significant positive financial
results.
© 2009 Ops A La Carte 129
126. Justification for the
AFR Estimator
As HALT takes only a few days to run and to implement its
corrective action(s), and even if it took a bit longer, this time
would be far less than waiting for an RDT to be run and to
implement its corrective action(s). The application of this
model can be a huge time and cost saver.
As higher HALT limits equate to lower AFR, you now have a
tool that can accurately estimate the field AFR before
launching the product. Stress levels that are depicted in the
table in Section E are highly recommended for HALT. These
levels can assure the producer that the product will exceed
customer expectations and allow the producer to accurately
forecast warranty expenditures.
© 2009 Ops A La Carte 130
127. Justification for the
AFR Estimator
By not performing life tests and simply doing HALT, time and
money will be saved. This is not to say that life testing isn’t
important. It should be considered for new technologies and
for an existing part/design with a different application but not
as a process to accurately estimate AFR.
With seven to ten simple data entry points and most of them
coming from the HALT effort, the AFR Estimator will provide
an accurate field AFR instantaneously with its associated 90%
statistical confidence limits. The inputs for HASS and HASA
are: will you perform HASS or HASA, the daily sample size,
and the detectable shift in the AFR you wish to detect.
© 2009 Ops A La Carte 131
128. Justification for the
AFR Estimator
The AFR Estimator has been validated on over twenty products
from diverse manufacturers and design environments.
The model can accommodate HALT samples sizes from one to
six with the optimum size being four. Sample sizes of greater
than four will default to four.
90% upper and lower confidence limits are calculated based on
the HALT AFR and the HALT Sample Size.
© 2009 Ops A La Carte 132
129. Recommendations when
using the AFR Estimator
An effective HALT needs to be done with at least three units
and highly preferable four although the model can
accommodate sample sizes from one to six.
Please realize that HALT sample sizes of three or less will
dramatically affect the ability to detect product defects and
hence, the statistical confidence is likewise impacted.
© 2009 Ops A La Carte 133
130. Recommendations when
using the AFR Estimator
1. Root Cause for Failures
2. Robust Protocols
3. Achieve at least the Guard Band Limits
4. For HASS or HASA, normalize chamber vibration tables
5. Obtain a copy of, “HALT, HASS, & HASA Explained”, by
Harry McLean and use it as a reference.
© 2009 Ops A La Carte 134
131. Recommendation 1:
Root Cause for Failures
Each of the issues encountered needs to have root cause
analysis understood, corrective action implemented, then
verified in HALT under the same stress conditions in which the
defect was detected. Exceptions to this would be limitations
that occur beyond the Guard Band Limits in the table following
Section E. Issues encountered beyond these levels are to have
root cause analysis performed but corrective action
implemented as a business decision based on timeliness, cost,
and program delays.
© 2009 Ops A La Carte 135
132. Recommendation 3:
Achieve Guard Band Limits
For the maximum benefit of a low field AFR or a high MTBF,
it is suggested that the product achieve at least the levels shown
under the Guard Band Limits in Section E below. These are
very achievable with time and understanding within the
organization without having to use extended (more costly)
temperature range components.
© 2009 Ops A La Carte 136
133. How to Use the Estimator:
© 2009 Ops A La Carte 137
134. How to Use the Estimator:
Calculated MTBF Estimate
The MTBF estimate in kHours can be from Telcordia, Relex,
or a similar tool. If this estimate is not available, use 40,000
as a default value for the estimator. This parameter has very
little effect on the final field AFR or MTBF estimate due to
the highly variable processes followed by the many
assumptions used in estimating an MTBF. Enter this value in
the table following Section H. Please note that the estimator
will recommend an MTBF of 40,000 when a value to less than
40,000 is used.
© 2009 Ops A La Carte 138
135. How to Use the Estimator:
HALT Operating Limits
The final Hot operating limit (HOL) achieved in HALT as
measured on the product and not the chamber setpoint. Enter
this value in the table following Section H.
The final Cold operating limit (COL) achieved in HALT as
measured on the product and not the chamber setpoint. Enter
this value in the table following Section H.
The final Vibration operating limit (VOL) achieved in HALT
as measured on the product and not the chamber setpoint.
Enter this value in the table following Section H.
© 2009 Ops A La Carte 139
136. How to Use the Estimator:
Product Environment
The product’s published thermal operating specifications, in
°C. Try to match your product's Published Specifications to a
corresponding Level number listed in the table below, i.e., a
high-end consumer product equates to a Level 2.
Product's Published Specs Category Guard Band Limits Level
0°C to 40°C Consumer ‐30°C to +80°C 1
0°C to +50°C Hi‐end Consumer ‐30°C to +100°C 2
‐10°C to +50°C Hi Performance ‐40°C to +110°C 3
‐20°C to +50°C Critical Application ‐50°C to +110°C 4
‐25°C to +65°C Sheltered ‐50°C to +110°C 5
‐40°C to +85°C All Outdoor ‐65°C to +110°C 6
© 2009 Ops A La Carte 140
137. How to Use the Estimator:
Running the Estimator
Once the Value for AFR Estimator column is completed, you
are ready to run the AFR Estimator and determine the
product’s AFR, MTBF, Confidence Limits, and days to detect
shift in AFR if HASS or HASA is being used.
© 2009 Ops A La Carte 141
139. Design for Reliability (DfR) Tools
♦ Reliability Assessment, Goal Setting, and Planning
♦ Reliability Modeling and Prediction
♦ Thermal Analysis
♦ Derating Analysis
♦ Failure Modes and Effects Analysis (FMEA)
♦ Fault Tree Analysis (FTA)
♦ Design of Experiments (DoE)
♦ Human Engineering/Human Factors Analysis
♦ Highly Accelerated Life Test (HALT)
♦ Accelerated Life Test (ALT)
♦ RDT and ORT
♦ Highly Accelerated Stress Screen (HASS)
♦ Root Cause Analysis (RCA)
♦ Restriction of Hazardous Substances (RoHS)
♦ Outsourced Engineering and Reliability
♦ Field Data Analysis
Red shows tools we introduced today. 143
© 2009 Ops A La Carte