Design for Reliability Seminar for WYLE Labs - Feb 2010 - Mike Silverman Webinar1. &
We Provide You Confidence in Your Product ReliabilityTM
Ops A La Carte / (408) 654-0499 / askops@opsalacarte.com / www.opsalacarte.com
2. DESIGN FOR
RELIABILITY (DfR)
SEMINAR
at
February 11, 2010
Mike Silverman // (408) 472-3889 // mikes@opsalacarte.com
Ops A La Carte LLC // www.opsalacarte.com
© 2009 Ops A La Carte 1
3. DfR Seminar Overview
Thurs, Feb 11, 2010
- DFR SEMINAR -
10:00-10:10am Introduction
10:10-10:30am DfR Overview/Introduction
10:30-11:00am FMEA
11:00-11:30am Using FMEA to Design a Better Reliability Test Program
11:30-11:50 am HALT
11:50- 12:10pm Lunch Break
12:10-12:30pm ALT
12:30:1:00pm HALT vs. ALT – When to Use Which Technique?
1:00-1:15pm Reliability Demonstration Test (RDT)
1:15-1:45pm HALT vs. RDT – The HALT Calculator
1:45-2:00pm Wrap-Up/Questions
Note that this ½ day seminar is an abridged version of a 5 day DfX seminar we will be holding 3 times this year:
- Apr 16-20 in Santa Clara, CA
- May 17-21 in Huntsville, AL
- Oct 11-15 in Maryland
© 2009 Ops A La Carte 2
4. Product Life Cycle Reliability and Test Spectrum
Wyle and OPS Combined Capabilities
Program Test & Operate &
Capture Design Build Eval Qualify Manufacture Maintain
Test Engineering Services
Test Quotes
Tech Test
Requirements
KEY
Plans
Test Data Analysis
Wyle
Procedures
Ops
Test Services
Wyle & OPS
HALT
HASS
Dev Test
Qual Test
Acceptance
Reliability, Maintainability, Supportability Services
FMECA
Reliability Eng Configuration
Publications Training Reliability Eng
& Analysis Management
& Analysis
Asset Lean
Management RCM
Six Sigma
NDI
TOC
© 2009 Ops A La Carte 3
5. Jim Pinyan
Director of Business Development
Test, Engineering & Research Division
(310) 563-6651
Jim.pinyan@wyle.com
128 Maryland Street
El Segundo, CA 90245
© 2009 Ops A La Carte 4
6. RELIABILITY CONSULTING
Company Overview
© 2009 Ops A La Carte 5
7. provides clients with integrated reliability solutions across the
Product Life Cycle.
We have the unique ability to assess a product and understand the key
reliability elements necessary to measure/improve product performance
and customer satisfaction.
Our strength lies in our ability to tailor a solution to fit your needs based on
your product reliability requirements, schedule and budget.
© 2009 Ops A La Carte 6
8. HALT and HASS Labs
• Our own lab facility located in Northern California in the heart of Silicon Valley. We
provide HALT/HASS services on a world-wide basis, using partner labs for tests
outside California.
• Second oldest HALT facility in the world, established in 1995 (originally owned by
QualMark)
• HALT equipment has all latest technology – only lab in region
• Highly-experienced staff with over 100 years of combined experience in HALT and
HASS
• Tested over 500 products in over 302009 Ops A La Carte industries
© different 7
9. The following presentation materials are
copyright protected property of
Ops A La Carte LLC.
These materials may not be distributed
outside of your company.
© 2009 Ops A La Carte 8
11. First we must ask: What is Reliability?
Reliability is often considered quality over time.
Reliability is…
“The ability of a system or component to perform its required
functions under stated conditions for a specified period of time”
- IEEE 610.12-1990
We shall revisit this when we discuss Reliability Goal Setting.
© 2009 Ops A La Carte 10
12. Different Views of Reliability
Product development teams
View reliability as the domain to
address mechanical and electrical, and Mechanical
manufacturing issues. Reliability
Customers +
View reliability as a system-level issue, Electrical
with minimal concern placed on the Reliability
distinction into sub-domains.
Since the primary measure of +
reliability is made by the customer, SW
engineering teams must maintain a Reliability
balance of both views (system and
sub-domain) in order to develop a
reliable product.
System
© 2009 Ops A La Carte 11
13. Reliability vs. Cost
Intuitively, the emphasis in reliability to
achieve a reduction in warranty and in-service
costs results in some minimal increase in
development and manufacturing costs .
Use of the proper tools during the proper life
cycle phase will help to minimize total Life
Cycle Cost (LCC).
© 2009 Ops A La Carte 12
14. Reliability vs. Cost, continued
To minimize total Life Cycle Costs (LCC), an
organization must do two things:
1. Choose the best tools from all of the tools
available and apply these tools at the proper
phases of the product life cycle.
2. Properly integrate these tools together to assure
that the proper information is fed forwards and
backwards at the proper times.
© 2009 Ops A La Carte 13
15. Reliability Integration
“the process of seamlessly,
cohesively integrating reliability
tools together to maximize
reliability and at the lowest
possible cost”
© 2009 Ops A La Carte 14
16. Reliability vs. Cost, continued
TOTAL
COST
OPTIMUM CURVE
COST
POINT RELIABILITY
PROGRAM
COSTS
COST
WARRANTY
COSTS
RELIABILITY
HW RELIABILITY & COSTS
© 2009 Ops A La Carte 15
17. ELEMENTS
OF A
RELIABILITY
PROGRAM
© 2009 Ops A La Carte 16
18. DfR Tool Selection
A reliability assessment is the recommended first
step in establishing a reliability program. This
mechanism is the appropriate forum for selecting
the best tools for each product life cycle phase.
© 2009 Ops A La Carte 17
20. Reliability Program Assessment
• Initiate a Reliability Program
• Determine next best steps $ Profits
• Reduce customer complaints
• Select right tools
• Improve reliability market
Goal share
Program Plan
Gap Analysis
satisfaction
Benchmarking
Statistical
Data Analysis
A detailed evaluation of an
organization’s approach and
Assessment
Interviews
processes involved in creating
field
reliable products. The assessment
failures $ unreliability captures the current state and
Now leads to an actionable reliability
? Unknown program plan.
complaints
Reliability ? © 2009 Ops A La Carte 19
21. Agenda
• motivation
• approach
• results
• findings
• observations
• next steps
• close
© 2009 Ops A La Carte 20
22. Assessment Motivation
• Identify systemic changes that impact
reliability
– Tie into culture and product
– Both enjoy benefits
• Provides roadmap for activities that
achieve results
– Matching of capabilities and expectations
– Cooperative approach
© 2009 Ops A La Carte 21
24. Steps Involved
selecting people to
survey
selecting survey topics
develop scoring system
data analysis
summary feedback
results
review of results
recommended actions
© 2009 Ops A La Carte 23
25. Select People to Survey
Hardware:
Hardware manager
Electrical engineering lead
Mechanical engineering lead
System engineering lead
Reliability manager/engineer
Procurement
Manufacturing
Software:
sw r&d manager
sw r&d engineer
sw test manager
sw test engineer
© 2009 Ops A La Carte 24
26. Select Survey Topics
DFR Methods Survey
Scoring: 4 = 100%, top priority, always done
3 = >75%, use normally, expected
2 = 25% - 75%, variable use
1 = <25%, only occasional use
0 = not done or discontinued
- = not visible, no comment
Management:
□ Goal setting for division
□ Priority of quality & reliability improvement
□ Management attention & follow up (goal ownership)
Design:
□ Documented hardware design cycle
□ Goal setting by product or module
© 2009 Ops A La Carte 25
27. Example
To what extent is FMEA used?
Design Engineer
Score = 1: Used only as a troubleshooting tool
Manufacturing Engineer
Score = 3: Commonly used on critical design elements
Reliability Engineer
Score = 4: Always used on all products
Results: Score 2.6
Comments: Clearly a disconnect between reliability and
design engineering – indicative of a problem with the tool.
© 2009 Ops A La Carte 26
28. Reliability Maturity Grid
• 5 levels of maturity
• Loosely based on IEEE 1332: “Reliability Program
for the Development and Production of Electronic
Products” (currently in draft form)
• Similar to Crosby’s Quality Maturity
• On the following page is a matrix based on
Crosby’s as an example.
• Read across each row and find the statement that
seems most true for your organization.
• The center of mass of the levels is the
organization’s overall level.
© 2009 Ops A La Carte 27
29. Reliability Maturity Matrix
Measurement Stage I: Stage II: Stage III: Stage IV: Stage V:
Category Uncertainty Awakening Enlightenment Wisdom Certainty
Management No comprehension of Recognizing that reliability Still learning more about Participating. Consider reliability
Understanding and Attitude reliability as a management management may be of reliability management. Understand absolutes of management an
tool. Tend to blame value but not willing to Becoming supportive and reliability management. essential part of company
reliability engineering for provide money or time to helpful. Recognize their personal system.
‘reliability problems’ make it happen. role in continuing
emphasis.
Reliability status Reliability is hidden in A stronger reliability Reliability manager Reliability manager is an Reliability manager is on
manufacturing or leader appointed, yet reports to top officer of company; board of directors.
engineering departments. main emphasis is still on management, with role in effective status reporting Prevention is main
Reliability testing probably an audit of initial product management of division. and preventive action. concern. Reliability is a
not part of organization. functionality. Reliability Involved with consumer thought leader.
Emphasis on initial product testing still not performed. affairs.
functionality.
Problem handling Fire fighting; no root cause Teams are set up to solve Corrective action process Problems are identified Except in the most
analysis or resolution; lots of major problems. Long- in place. Problems are early in their unusual cases, problems
yelling and accusations. range solutions are not recognized and solved in development. All are prevented.
identified or orderly way. functions are open to
implemented. suggestion and
improvement.
Cost of Reliability as % of Warranty: unknown Warranty: 3% Warranty: 4% Warranty: 3% Warranty: 1.5%
net revenue Reported: unknown Reported: unknown Reported: 8% Reported: 6.5% Reported: 3%
Actual: 20% Actual: 18% Actual: 12% Actual: 8% Actual: 3%
Feedback process None. No reliability testing. Some understanding of Accelerated testing of Refinement of testing The few field failures are
No field failure reporting field failures and critical systems during systems – only testing fully analyzed and
other than customer complaints. Designers design. System level critical or uncertain product designs or
complaints and returns. and manufacturing do modeling and testing. areas. Increased procurement
not get meaningful Field failures analyzed understanding of causes specifications altered.
information. and root causes reported. of failure allow Reliability testing done to
deterministic failure rate augment reliability
prediction models models.
DFR program status No organized activities. Organization told Implementation of DFR DFR program active in all Reliability improvement is
No understanding of such reliability is important. DFR program with thorough areas of division – not a normal and continued
activities. tools and processes understanding and just design & mfg’ing. activity.
inconsistently applied and establishment of each DFR normal part of R&D
only ‘when time permits’. tool. and manufacturing.
Summation of reliability “We don’t know why we “Is it absolutely necessary “Through commitment “Failure prevention is a “We know why we do not
posture have problems with to always have problems and reliability routine part of our have problems with
reliability” with reliability?” improvement we are operation.” reliability.”
identifying and resolving
our problems.”
© 2009 Ops A La Carte 28
30. Reliability Maturity Matrix
Lets look at one row to get a better understanding.
Measure- Stage I: Stage II: Stage III: Stage IV: Stage V:
Uncertainty Awakening Enlighten- Wisdom Certainty
ment
ment
Category
Problem Fire Teams are Corrective Problems Except in
handling fighting; no set up to action are the most
root cause solve process in identified unusual
analysis or major place. early in cases,
resolution; problems. Problems their problems
lots of Long- are developm are
yelling and range recognize ent. All prevented.
accusations solutions d and functions
. are not solved in are open
identified orderly to
or way. suggestio
implement n and
ed. improvem
ent.
© 2009 Ops A La Carte 29
31. Results & Meaning
• Looking for trends, gaps in process, skill mismatches,
over analysis, under analysis, etc.
• Looking for differences across the organization,
pockets of excellence, areas with good results
• Process provides snapshot of current system
• No one tool make an entire reliability program. The
tools need to match the needs of the products and
the culture.
• Check step is critical before moving to
recommendation around improvement plan
© 2009 Ops A La Carte 30
32. Observations
What Companies Are What Companies Are
Doing Best Weak at
Prediction Goal setting/Planning
HALT Repair & warranty
invisible
Golden nuggets
Lessons learned
Fast reaction to fix
capture
problems
Single owner of product
reliability
Multiple defect tracking
systems
Reliability Integration
Statistics
© 2009 Ops A La Carte 31
33. Next Steps
• Determine current state of your organization
(Summary of Assessment)
– Identify strong and weak areas
• Goal Setting
– Market Analysis to gather requirements
– Benchmarking
• Gap Analysis
• Develop plan and implement
© 2009 Ops A La Carte 32
35. FMEA
A FMEA is a systematic method
of identifying and preventing
product and process problems
BEFORE they occur.
© 2009 Ops A La Carte 34
40. FMEA Benefits
Facilitates investigation of design alternatives to consider high
reliability at the conceptual stages of the design.
Provides a basis for identifying root cause failures and
developing corrective actions.
Determines the effects of each failure mode on system
performance.
Aids in developing test methods and troubleshooting
techniques.
Provides a foundation for qualitative analyses.
Provide structured forum for cross functional discussions
Provide common understanding and focus to reduce product
or process issues
Provide documentation of risk management effort
© 2009 Ops A La Carte 39
41. Types of FMEAs
Design FMEA
Process FMEA
System FMEA
Functional FMEA
User FMEA
Software FMEA
Many others
© 2009 Ops A La Carte 40
42. When Is a FMEA Performed
FMEA’s are begun early in the design process and
then updated throughout the life cycle of a product to
capture changes in the design.
© 2009 Ops A La Carte 41
43. The 10 Steps
Step 1: Review the Process/Design
Step 2: Brainstorm potential failure modes
Step 3: List potential effects of each failure mode
Step 4: Assign a severity rating for each effect
Step 5: Assign an occurrence rating for failure modes
Step 6: Assign a detection rating for modes/effects
Step 7: Calculate the risk priority numbers
Step 8: Prioritize the failure modes for action
Step 9: Take action to eliminate/reduce high-risk
Step 10: Calculate the resulting RPN
© 2009 Ops A La Carte 42
44. Step 1: Review the Design or Process
Understand the topic of study
Design – drawings, prototypes, etc.
Process – flowcharts, assembly instructions, etc.
Focus on developing common understanding of
design or process
Designers or Process Experts available for questions
© 2009 Ops A La Carte 43
45. Step 2: Brainstorm potential failure
modes
Have fun!
How can the design/process fail?
Break complex designs/processes into smaller
elements
Combine like ideas (affinity plotting)
May have more than one failure mode per item or
function
List failure modes on worksheet
Determine failure modes vs. failure mechanisms
Use Boundary Interface Diagram Tool
Use P-Diagram Tool
© 2009 Ops A La Carte 44
46. Common brainstorming tools
Team dynamics
Consensus-building techniques
Team project documentation
Idea-generation techniques
Group brainstorming with a facilitator
Affinity diagramming
Flowcharting
Boundary Interface Diagram
P-Diagram
Data analysis
Graphing techniques
© 2009 Ops A La Carte 45
47. Step 3: List Potential effects of each
failure mode
If the failure occurs, what are the consequences?
List effect for each failure mode (not mechanism).
List more than one effect, when necessary
(note: more than one effect if ratings will be different, or
solution would have to different)
© 2009 Ops A La Carte 46
48. Step 4: Assign a severity rating for each
effect
What is the consequence of the failure should it
occur?
Assign a severity rating for each effect
An estimation of how serious the effects would be if
the failure mode occurs
Historical data
Engineering judgment
Experimentation, DOE, if needed
© 2009 Ops A La Carte 47
49. Severity
Severity is the assessment of the seriousness of the
effect of the failure mode to the next component,
subsystem, system or customer if it occurs.
Below is a typical Severity Rating Table.
Rating Description Definition
10 Dangerously High Catastrophic Failure Causing Replacement of the Entire System)
9 Very high Failure of a FRU Component, MTTR > 1 Hour
8 High Failure of a FRU Component, MTTR < 1 Hour
6 Moderate Failure that results in reduced throughput
4 Minor Failure that requires a tool reset or recalibration
2 Very minor Failure that can be corrected during a PM cycle
1 None Failure that does not affect system performance
© 2009 Ops A La Carte 48
50. Step 5: Assign an occurrence rating for
each failure mode
What is the probability of the failure occurring
List the potential causes of failure
Use actual data when available for rating
When real data is not available:
Engineering estimates or models
Consider the failure causes probabilities
Rank order then assign rating
© 2009 Ops A La Carte 49
51. Probability of Occurrence
Probability of Occurrence can be in terms of failure rate or
can just be a scale of 1-10 relative to all other failure modes.
Below is a typical Probability Rating Table
Rating Description Definition
10 Dangerously Likely to Occur Chronically, (Daily or Hourly)
High
9 Very High Likely to Occur during one week of operation
8 High Likely to occur during one month of operation.
6 Medium Likely to occur during one year of operation.
4 Moderate Is likely to Occur during the Life of the System.
2 Low A Remote Probability of Occurrence in the Life of the System
1 Remote An Unlikely Probability of Occurrence in the Life of the System
© 2009 Ops A La Carte 50
52. Step 6: Assign a detection rating for each
failure mode and/or effect
What is the probability of the failure being detected
before the impact of the effect is realized
List known current controls
Those items without controls are unlikely to be
detected (scoring 9 or 10)
Again, use actual data when possible
© 2009 Ops A La Carte 51
53. Detection
A third factor used in assessing the risk of a failure is
likelihood of Detection of the failure before releasing the
product. The following table is an example of detection
scores (note that a high score indicates that the failure
is more difficult to detect).
Below is a typical Detection Rating Scale
Rating Description Definition
No ability to detect before it occurs or and some ability to detect
5 Very Low after (unconfirmed failures)
No ability to detect before it occurs but can detect after
3 Moderate
Some ability to detect before it occurs but can detect after
2 High
Very likely it will be detectable before it occurs and after
1 Almost Certain
Note that the Detection Scale has been derated (scale 1-5 only). For many industries, the
key drivers are severity and probability.
In many industries, there is a high unconfirmed failure rate. Yet there is a high
probability of failures repeating themselves when they go back to the field after not
being confirmed – hence the importance of health diagnostics and the conditional
based maintenance strategy based on these health monitoring diagnostics.
© 2009 Ops A La Carte 52
54. Step 7: Calculate the risk priority number
for each effect
RPN = S x P x D
Risk Priority Number equals
Severity rating times
Probability of Occurrence rating times
Detection rating
© 2009 Ops A La Carte 53
55. Risk Priority Number
Risk Priority Number (RPN)
The RPN is the product of the Severity Score, the
Probability Score, and the Detection Score.
Once all of the RPN’s have been calculated, the data
can be sorted from highest to lowest RPN to show
which are the most critical items to work on.
Below is an example of an RPN Table
RISK VALUE (RPN)
251-500 Intolerable Risk Additional measures are required to ensure
adequate safety.
101-250 Undesirable Risk Risk is tolerable only if risk reduction is impractical or
if reduction costs are grossly disproportionate to the
improvement(s) gained. (Requires Executive Mgt.
Approval.)
11-100 Tolerable Risk The risk is tolerable if the cost of risk reduction will
exceed the improvement(s) gained. (Requires Project
Mgt. Approval.)
1-10 Negligible Acceptable as implemented.
© 2009 Ops A La Carte 54
56. Step 8: Prioritize the failure modes for
action
Simple rank ordering from high to low based on RPN
Decide on cutoff value
Those above get attention & resources to improve
Those below are left alone for now
Consider including above the cut off any Severity
rating of 9 or 10
© 2009 Ops A La Carte 55
57. Step 9: Take action to eliminate or reduce
the high risk failure modes
Use an organized problem-solving process
Identify and implement actions to eliminate or reduce
the high-risk failure modes
Consider DOE as tool to break down and solve
multiple variable or complex issues
© 2009 Ops A La Carte 56
58. Step 10: Calculate the resulting RPN as
the failure modes are reduced or
eliminate
Document progress in reducing product risk with an
update by team of resulting RPN.
You should expect 50% or greater reduction in total
PRN after an FMEA
Continue to make improvements on highest risk items
until time, resources or overall ROI shift focus.
© 2009 Ops A La Carte 57
59. Linking FMEAs with Test Plans
In order to write better test plans,
we must first understand;
- the use environment
- the key risks to the design
The best tool for this is FMEA
60. Developing Better Test Plans
Stated another way, we cannot
know what to test for unless we
understand the key risks.
Therefore, FMEA is one of the
best sources of input for a
Reliability Test Plan.
62. Developing a Test Plan
without FMEA
What types of tests can you think of for
this device?
63. Developing a Test Plan
without FMEA
We used the IEC standards and came up
with a number of solid tests, including:
High/Low Temperature
Temperature Cycling
Vibration
Drop
Shock
Crush
Humidity
Altitude
Did we miss any?
65. FMEA Generated Tests
Then
we performed an FMEA and
came up with the following:
Different cleaning solutions
66. FMEA Generated Tests
Then
we performed an FMEA and
came up with the following:
Different cleaning solutions
Pen test
67. FMEA Generated Tests
Then
we performed an FMEA and
came up with the following:
Different cleaning solutions
Pen test
Lipstick test
68. FMEA Generated Tests
Then
we performed an FMEA and
came up with the following:
Different cleaning solutions
Pen test
Lipstick test
Motor Oil Test
69. FMEA Generated Tests
Then
we performed an FMEA and
came up with the following:
Different cleaning solutions
Pen test
Lipstick test
Motor Oil Test
Cap Tether Test
70. FMEA Generated Tests
Then
we performed an FMEA and
came up with the following:
Different cleaning solutions
Pen test
Lipstick test
Motor Oil Test
Cap Tether Test
Did we miss any?
71. Conclusion
FMEA is a development tactic
that can help solve the problem
of testing too little by uncovering
failure modes that require
tailored test methods rather than
only cookbook methods from
industry standards.
73. HALT - Highly Accelerated
Life Test
Quickly discover design issues.
Evaluate & improve design margins.
Release mature product at market introduction.
Reduce development time & cost.
Eliminate design problems before release.
Evaluate cost reductions made to product.
Developmental HALT is not really a test you pass or fail,
it is a process tool for the design engineers.
There are no pre-established limits.
© 2009 Ops A La Carte 72
74. HALT, How It Works
Start low and step up the
stress, testing the product
during the stressing
© 2009 Ops A La Carte 73
75. HALT, How It Works
Gradually increase
stress level until a
failure occurs
© 2009 Ops A La Carte 74
76. HALT, How It Works
Analyze
the failure
© 2009 Ops A La Carte 75
77. HALT, How It Works
Make
temporary
improvements
© 2009 Ops A La Carte 76
78. HALT, How It Works
Increase
stress and
start
process
over
© 2009 Ops A La Carte 77
79. HALT, How It Works
Fundamental
Technological
Limit
© 2009 Ops A La Carte 78
80. HALT, Why It Works
Classic S-N Diagram
(stress vs. number of cycles)
S0= Normal Stress conditions
S2
N0= Projected Normal Life
S1
S0
N2 N1 N0
© 2009 Ops A La Carte 79
81. HALT, Why It Works
Classic S-N Diagram
(stress vs. number of cycles)
Point at which failures become non-relevant
S0= Normal Stress conditions
S2
N0= Projected Normal Life
S1
S0
N2 N1 N0
© 2009 Ops A La Carte 80
82. Margin Improvement Process
Lower Lower Upper Upper
Destruct Oper. Product Oper. Destruct
Limit Limit Operational Limit Limit
Specs
Stress
© 2009 Ops A La Carte 81
83. Margin Improvement Process
This is what the product spec distribution really looks like
Lower Lower Upper Upper
Destruct Oper. Product Oper. Destruct
Limit Limit Operational Limit Limit
Specs
Stress
© 2009 Ops A La Carte 82
84. Margin Improvement Process
Lower Lower Upper Upper
Destruct Oper. Product Oper. Destruct
Limit Limit Operational Limit Limit
Specs
Destruct
Margin
Operating
Margin
Stress
© 2009 Ops A La Carte 83
85. Developmental HALT Process
Planning a HALT
Setting up for a HALT
Executing a HALT
Post Testing
© 2009 Ops A La Carte 84
86. When to Perform HALT ?
Feasibility Development Qualification Launch
P1- P2 → Late P2 → P3 →
Perform HALT Perform HALT on Demonstrate Tracking
on 1 to 2 early more samples. 100% reliability reliability through
prototypes. These samples will target @ 80% C.L. field data
These samples be closer to final Shipping /
may be hand- product and Packaging test
made and test functional tests will Validation HALT
coverage may be more refined can be performed
be low, but we with higher test here
can still get coverage.
clues as to
gross design
issues.
Lessons learned feedback to next generation product
© 2009 Ops A La Carte 85
87. Summary of Results
- by stress -
Cold Step Stress: 14%
Hot Step Stress: 17%
Rapid Thermal Transitions: 4%
Vibration Step Stress: 45%
Combined Environment: 20%
Significance:
Without Combined Environment, 20% of all
failures would have been missed
© 2009 Ops A La Carte 86
88. Traditional vs HALT
Engineering Needs
Product Development Manpower Requirements
Spending
Rate
6 DVT1 ..... DVTn,
5
4 MR
3
MR
2
1 $ Savings
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Time
© 2009 Ops A La Carte 87
89. HALT
Cost Benefits
Reduced product time to market
Lowered warranty cost through higher MTBF
Faster DVT with fewer product samples
Accelerated screening (HASS) allowed
© 2009 Ops A La Carte 88
91. Accelerated Life Test (ALT)
An Accelerated Life Test (ALT) is the process of
determining the reliability of a product in a short period
of time by accelerating the use environment.
ALT's are good for finding dominant failure
mechanisms.
ALT's are usually performed on individual assemblies
rather than full systems.
ALT's are also frequently used when there is a wear-out
mechanism involved.
© 2009 Ops A La Carte 90
92. Stress
Anything applied to a product, either electrically or
environmentally, to accelerate finding possible
weaknesses
Examples of Electrical Stress: Current, Voltage (DC
and AC), Power Cycling, and Frequency (line and
board)
Examples of Environmental Stress: Temperature
Extremes, Temperature Cycling, Vibration, Shock,
Humidity, ESD, Drop, Altitude
© 2009 Ops A La Carte 91
93. Physical Acceleration
Acceleration means that operating a unit at high
stress (temperature, voltage, humidity, or duty cycle,
etc.) produces the same failures that would occur at
typical-use stresses, except that they happen much
quicker.
Failure may be due to mechanical fatigue, corrosion,
chemical reaction, diffusion, migration, etc. The
causes are the same, the time scale is simply
different.
Changing the stress is equivalent to transforming the
time scale. This is often a linear transform, which
means the time-to-fail at high stress is multiplied by a
constant (acceleration factor) to obtain the equivalent
time-to-fail at use stress.
© 2009 Ops A La Carte 92
94. Failure Mode Dependence
Keep in mind that the acceleration factor is highly
dependent on the failure mechanism.
Each failure mechanism will most likely have a
different acceleration factor.
During testing, conduct thorough failure analysis and
separate the failure mechanisms for separate
analysis.
Selecting the stress to apply must be done with the
expected failure mechanisms in mind.
© 2009 Ops A La Carte 93
95. Theory of ALT
Classic S-N Diagram
(stress vs. number of cycles)
S0= Normal Stress conditions
S2
N0= Projected Normal Life
Stress S1
S0
N2 N1 N0
Number of Cycles 94
© 2009 Ops A La Carte
96. When to Apply ALT
ALT Region of Application
© 2009 Ops A La Carte 95
97. ALT Parameters
In order to set up an ALT, we must know several different
parameters, including
Length of Test
Number of Samples
Goal of Test
Confidence Desired
Accuracy Desired
Cost
Acceleration Factor
• Field Environment
• Test Environment
• Acceleration Factor Calculation
Slope of Weibull Distribution (Beta parameter)
© 2009 Ops A La Carte 96
98. Review
When wear-out is a dominant failure
mechanism, we must be able to predict or
characterize this wear-out mechanism to
assure that it occurs outside customer
expectations and outside the warranty period.
ALT is an excellent method for doing this
© 2009 Ops A La Carte 97
100. Overview
HALT and ALT are two of the most
popular testing methods but often
times engineers are confused about
which to use when.
© 2009 Ops A La Carte 99
101. Overview
Highly Accelerated Life Testing (HALT) is a great
reliability technique to use for finding predominant
failure mechanisms in a hardware product.
However, in many cases, the predominant failure
mechanism is wear-out.
When this is the situation, we must be able to predict or
characterize this wear-out mechanism to assure that it
occurs outside customer expectations and outside the
warranty period.
The best technique to use for this is a slower test method
Accelerated Life Testing (ALT).
© 2009 Ops A La Carte 100
102. Overview
In many cases, it is best to use both
because each technique is good at
finding different types of failure
mechanisms.
The proper use of both techniques
together will offer a complete picture
of the reliability of the product.
© 2009 Ops A La Carte 101
103. HALT
Highly Accelerated Life Testing
used for Product Ruggedization
ALT
Accelerated Life Testing
used to Characterize Predominant Failure Mechanisms,
Especially for Wearout
© 2009 Ops A La Carte 102
104. Comparison Between
ALT and HALT
FAILURE TESTING
HALT ALT
OBJECTIVES OBJECTIVES
1. Root Cause Analysis 1. Reliability Evaluation (e.g. Failure Rates)
2. Corrective Action Identification 2. Dominant Failure Mechanisms Identification
3. Design Robustness Determination
TESTING REQUIREMENTS TESTING REQUIREMENTS
1. Detailed Product Knowledge 1. Detailed Parameters
2. Engineering Experience (a) Test Length
(b) Number of Samples
(c) Confidence/Accuracy
(d) Acceleration Factors
(e) Test Environment
2. Test Metrology & Factors
(a) 4:2:1Procedure Or Other
(b) Costs
ANALYTICAL MODELS
1. Weibull Distribution
2. Arrhenius
3. Coffin-Manson
4. Norris-Lanzberg
© 2009 Ops A La Carte 103
105. Advantage of ALT over
HALT
One key advantage of ALT over HALT is when we
need to know the life of the product.
In HALT, we don’t concern ourselves with this
much because we are more interested in making
the product as reliable as we can, and measuring
the amount of reliability is not as important.
However, with mechanical items that wear over
time, it is very important to know the life of the
product as accurately as possible.
© 2009 Ops A La Carte 104
106. Advantage of ALT over HALT
Another advantage is that we often do not need any
environmental equipment. Benchtop testing is often adequate.
© 2009 Ops A La Carte 105
107. Advantage of HALT over
ALT
A big advantage of HALT over ALT is time. We
are not so worried about time to failure as we are
which failure mode is dominant. And this we can
usually find out in a matter of days rather than
weeks or months.
This savings in time is also a big savings in money
since it takes less time at a test lab.
The number of samples is far fewer (usually 10 to
1)
We don’t need to calculate acceleration factor
We don’t need to stay with the same stresses as the
field environment because of the cross-over effect
© 2009 Ops A La Carte 106
108. Combining ALT with HALT
Often times we will run a product through HALT and then
run the subassemblies through ALT that were not good
candidates for HALT.
HALT on System ALT on System Fan
© 2009 Ops A La Carte 107
109. Developing ALT from HALT
And at other times, we may develop the ALT based on the
HALT limits, using the same accelerants but lowering the
acceleration factors to measurable levels.
HALT on System ALT on System
© 2009 Ops A La Carte 108
110. Examples of Products for
HALT and ALT
Component
Robot
Fan
Infusion Pump
Hard Drive
Medical
Cabinet
Automotive
Electronics
Cell Phone
Automobile
These pictures are samples of products we have tested. These are not the
actual products to protect the proprietary nature of the products we test.
© 2009 Ops A La Carte 109
111. Component
Characteristic Accelerant
Aging High Temperature
Contamination, Package Temp/Humidity
Hermeticity
Mismatch of Thermal Temp Cycling
Characteristics of Package Matls
Die Attachment, Bond Wires Vibration
© 2009 Ops A La Carte 110
112. Automobile
Test Accelerant
Electronics Temperature, Vibration, Humidity
Contamination
Mechanical Repetitive cycling test
© 2009 Ops A La Carte 111
113. Fan
Test Accelerant
Spinning Duty Cycle, Speed, Torque,
Backpressure
Lubricant Longevity Temperature, Humidity,
Contamination
© 2009 Ops A La Carte 112
114. Hard Drive
Test Accelerant
Head Spinning Duty Cycle, Start/Stop, Speed,
Temperature?, Vibration?
Contamination on Head Surface Non-Operational Vibration
Board Derating Temperature/Voltage
Connectors – Power, Data Duty Cycle, Force, Angle
© 2009 Ops A La Carte 113
115. Robot
Test Accelerant
Arm Movement (side to side) Duty Cycle, Speed, Torque
Z-Stage (up and down) Duty Cycle, Speed, Torque
Vacuum Hold-down Temperature, Altitude
Repeatability Duty Cycle
© 2009 Ops A La Carte 114
116. Automotive
Electronics –
GPS Receiver
Test Accelerant
Electronics Temperature, Vibration, Humidity
Contamination
Button Pushing Duty Cycle, Force?, Angle
© 2009 Ops A La Carte 115
117. Infusion Pump
Test Accelerant
Battery Charging Duty Cycle, Deep Discharge, Speed
of Charge
Touchscreen Duty Cycle, Location, Force?
Pumping Duty Cycle, Rate, Plunger Force
Connectors – Battery, Charger, Pole Duty Cycle, Force, Angle
Clamp, IV Line, Cassette
© 2009 Ops A La Carte 116
118. Drawer for
Medical Cabinet
Test Accelerant
Opening/Closing of Drawer Duty Cycle, Force, Angle
Locking Mechanism Duty Cycle, Force, Contamination
© 2009 Ops A La Carte 117
119. Cell Phone
Test Accelerant
Button Pushing Duty Cycle, Force?, Angle
Touchscreen Duty Cycle, Location, Force?
Connectors – Headset, Battery, Duty Cycle, Force, Angle
Charger
© 2009 Ops A La Carte 118
120. Summary
When wear-out is not a dominant failure
mechanism, HALT is an excellent tool for
finding product weaknesses in a short
period of time.
© 2009 Ops A La Carte 119
121. Summary
When wear-out is a dominant failure
mechanism, we must be able to predict or
characterize this wear-out mechanism to
assure that it occurs outside customer
expectations and outside the warranty
period.
ALT is an excellent method for doing this
© 2009 Ops A La Carte 120
123. Reliability Demonstration Testing (RDT)
A sample of units are tested at accelerated
stresses for several months.
The stresses are a bit lower than the HALT
stresses and they are held constant (or cycled
constantly) rather than gradually increasing.
This enables us to calculate the acceleration
factor for the test.
The RDT can be used to validate the reliability
prediction analyses.
122
© 2009 Ops A La Carte
124. RDT vs. ALT
RDT and ALT are very similar in that the stresses
are usually accelerated but at a lower level than
HALT.
The main difference between RDT and ALT is that
ALT is usually used to characterize the wearout
region of the product whereas RDT is usually used
to demonstrate the MTBF in the steady state region
of the product.
In an RDT, you CAN substitute samples for time.
In an ALT, you CANNOT substitute samples for
time.
123
© 2009 Ops A La Carte
125. RDT vs. ALT
ALT Region
RDT Region
124
© 2009 Ops A La Carte
129. Overview
Highly Accelerated Life Testing (HALT) is a great
reliability technique to use for finding predominant
failure mechanisms in a hardware product.
However, in many cases, customers need to know the
MTBF or Annualized Failure Rate (AFR) of a product in
the field.
When this is the situation, most people turn to RDT.
However, recently we have developed a method for
estimating MTBF from HALT data.
© 2009 Ops A La Carte 128
130. The AFR Estimator
The AFR Estimator is a patent pending
mathematical model that, when provided with
the appropriate HALT and product
information, will accurately estimate the
product’s field AFR or Annual Failure Rate.
This methodology has been used on a number
of products with significant positive financial
results.
© 2009 Ops A La Carte 129
131. Justification for the
AFR Estimator
As HALT takes only a few days to run and to implement its
corrective action(s), and even if it took a bit longer, this time
would be far less than waiting for an RDT to be run and to
implement its corrective action(s). The application of this
model can be a huge time and cost saver.
As higher HALT limits equate to lower AFR, you now have a
tool that can accurately estimate the field AFR before
launching the product. Stress levels that are depicted in the
table in Section E are highly recommended for HALT. These
levels can assure the producer that the product will exceed
customer expectations and allow the producer to accurately
forecast warranty expenditures.
© 2009 Ops A La Carte 130
132. Justification for the
AFR Estimator
By not performing life tests and simply doing HALT, time and
money will be saved. This is not to say that life testing isn’t
important. It should be considered for new technologies and
for an existing part/design with a different application but not
as a process to accurately estimate AFR.
With seven to ten simple data entry points and most of them
coming from the HALT effort, the AFR Estimator will provide
an accurate field AFR instantaneously with its associated 90%
statistical confidence limits. The inputs for HASS and HASA
are: will you perform HASS or HASA, the daily sample size,
and the detectable shift in the AFR you wish to detect.
© 2009 Ops A La Carte 131
133. Justification for the
AFR Estimator
The AFR Estimator has been validated on over twenty products
from diverse manufacturers and design environments.
The model can accommodate HALT samples sizes from one to
six with the optimum size being four. Sample sizes of greater
than four will default to four.
90% upper and lower confidence limits are calculated based on
the HALT AFR and the HALT Sample Size.
© 2009 Ops A La Carte 132
134. Recommendations when
using the AFR Estimator
An effective HALT needs to be done with at least three units
and highly preferable four although the model can
accommodate sample sizes from one to six.
Please realize that HALT sample sizes of three or less will
dramatically affect the ability to detect product defects and
hence, the statistical confidence is likewise impacted.
© 2009 Ops A La Carte 133
135. Recommendations when
using the AFR Estimator
1. Root Cause for Failures
2. Robust Protocols
3. Achieve at least the Guard Band Limits
4. For HASS or HASA, normalize chamber vibration tables
5. Obtain a copy of, “HALT, HASS, & HASA Explained”, by
Harry McLean and use it as a reference.
© 2009 Ops A La Carte 134
136. Recommendation 1:
Root Cause for Failures
Each of the issues encountered needs to have root cause
analysis understood, corrective action implemented, then
verified in HALT under the same stress conditions in which the
defect was detected. Exceptions to this would be limitations
that occur beyond the Guard Band Limits in the table following
Section E. Issues encountered beyond these levels are to have
root cause analysis performed but corrective action
implemented as a business decision based on timeliness, cost,
and program delays.
© 2009 Ops A La Carte 135
137. Recommendation 3:
Achieve Guard Band Limits
For the maximum benefit of a low field AFR or a high MTBF,
it is suggested that the product achieve at least the levels shown
under the Guard Band Limits in Section E below. These are
very achievable with time and understanding within the
organization without having to use extended (more costly)
temperature range components.
© 2009 Ops A La Carte 136
138. How to Use the Estimator:
© 2009 Ops A La Carte 137
139. How to Use the Estimator:
Calculated MTBF Estimate
The MTBF estimate in kHours can be from Telcordia, Relex,
or a similar tool. If this estimate is not available, use 40,000
as a default value for the estimator. This parameter has very
little effect on the final field AFR or MTBF estimate due to
the highly variable processes followed by the many
assumptions used in estimating an MTBF. Enter this value in
the table following Section H. Please note that the estimator
will recommend an MTBF of 40,000 when a value to less than
40,000 is used.
© 2009 Ops A La Carte 138
140. How to Use the Estimator:
HALT Operating Limits
The final Hot operating limit (HOL) achieved in HALT as
measured on the product and not the chamber setpoint. Enter
this value in the table following Section H.
The final Cold operating limit (COL) achieved in HALT as
measured on the product and not the chamber setpoint. Enter
this value in the table following Section H.
The final Vibration operating limit (VOL) achieved in HALT
as measured on the product and not the chamber setpoint.
Enter this value in the table following Section H.
© 2009 Ops A La Carte 139
141. How to Use the Estimator:
Product Environment
The product’s published thermal operating specifications, in
C. Try to match your product's Published Specifications to a
corresponding Level number listed in the table below, i.e., a
high-end consumer product equates to a Level 2.
Product's Published Specs Category Guard Band Limits Level
0C to 40C Consumer ‐30C to +80C 1
0C to +50C Hi‐end Consumer ‐30C to +100C 2
‐10C to +50C Hi Performance ‐40C to +110C 3
‐20C to +50C Critical Application ‐50C to +110C 4
‐25C to +65C Sheltered ‐50C to +110C 5
‐40C to +85C All Outdoor ‐65C to +110C 6
© 2009 Ops A La Carte 140
142. How to Use the Estimator:
Running the Estimator
Once the Value for AFR Estimator column is completed, you
are ready to run the AFR Estimator and determine the
product’s AFR, MTBF, Confidence Limits, and days to detect
shift in AFR if HASS or HASA is being used.
© 2009 Ops A La Carte 141
144. Design for Reliability (DfR) Tools
Reliability Assessment, Goal Setting, and Planning
Reliability Modeling and Prediction
Thermal Analysis
Derating Analysis
Failure Modes and Effects Analysis (FMEA)
Fault Tree Analysis (FTA)
Design of Experiments (DoE)
Human Engineering/Human Factors Analysis
Highly Accelerated Life Test (HALT)
Accelerated Life Test (ALT)
RDT and ORT
Highly Accelerated Stress Screen (HASS)
Root Cause Analysis (RCA)
Restriction of Hazardous Substances (RoHS)
Outsourced Engineering and Reliability
Field Data Analysis
Red shows tools we introduced today. 143
© 2009 Ops A La Carte
146. Contact Information
Ops A La Carte, LLC Ops A La Carte, LLC
Mike Silverman Vijay Prasad
Managing Partner Program Manager, S. Cal
(408) 472-3889 (858) 349-0443
mikes@opsalacarte.com vijayp@opsalacarte.com
www.opsalacarte.com www.opsalacarte.com
© 2009 Ops A La Carte 145
148. Presenter’s Biographical Sketch – Mike Silverman
◈ Mike Silverman is founder and managing partner at Ops A La Carte, a Professional
Consulting Company that has in intense focus on helping customers with end-to-end
reliability. Through Ops A La Carte, Mike has had extensive experience as a consultant
to high-tech companies, and has consulted for over 300 companies including Cisco,
Ciena, Siemens, Abbott Labs, and Applied Materials. He has consulted in a variety of
different industries including power electronics, telecommunications, networking,
medical, semiconductor, semiconductor equipment, consumer electronics, and defense.
◈ Mike has 20 years of reliability and quality experience. He is also an expert in
accelerated reliability techniques, including HALT&HASS (and recently purchased a HALT
Lab), testing over 500 products for 100 companies in 40 different industries. Mike has
authored and published 8 papers on reliability techniques and has presented these
around the world including China, Germany, Canada, Taiwan, Singapore, and Korea.
He has also developed and currently teaches 27 courses on reliability techniques.
◈ Mike has a BS degree in Electrical and Computer Engineering from the University of
Colorado at Boulder, and is both a Certified Reliability Engineer and a course instructor
through the American Society for Quality (ASQ), IEEE, Effective Training Associates, and
Hobbs Engineering. Mike is a member of ASQ, IEEE, SME, ASME, PATCA, and IEEE
Consulting Society and is the current chapter president in the IEEE Reliability Society for
Silicon Valley.
© 2009 Ops A La Carte 147
149. We Can Help You Sell to Your
Management
Often times, our main contact has difficulty
selling reliability into their company. We
have many techniques to help:
1) Detailed Proposals with Case Examples
2) Free Presentations at your site
3) Technical Articles/White Papers
4) Blog Articles covering your situation
5) Articles from our quarterly Newsletter
© 2009 Ops A La Carte 148
150. What’s New at Ops?
0) New Book “50 Ways to Improve Your Reliability”
1) A new HALT Calculator
2) A new Reliability Blog
3) Semiconductor Reliability services
4) Software Reliability services
5) RoHS conversion services
6) Warranty analysis services
7) New Accelerated Life Test methodology
8) Quality/6 Sigma Seminars
9) Offices: Singapore, China, Taiwan, UK, India
10) Complete Reliability Solutions
11) Green Reliability Services
© 2009 Ops A La Carte 149
151. Reliability Integration Education
- 31 different seminars on reliability -
1) Overall Program Reliability Integration 17) Design for ‘X’ (DfX)
2) Concept Phase Reliability Tools & Integration 18) Mechanical Design for IC Packaging
3) Design Phase Reliability Tools & Integration 19) Design of Experiments (DoE)
4) Prototype Phase Reliability Tools & Integration 20) HALT and HASS Application
5) Manufacturing Phase Reliability Tools & Integr. 21) Statistics for 6 Sigma
6) Reliability Techniques for Beginners 22) Fundamentals of Climatic Testing
7) Reliability Statistics 23) Design for Vibration and Shock
8) FMECA 24) Software Reliability
9) CRE Exam Preparation 25) Root Cause Analysis
10) CQE Exam Preparation 26) Reality of Pb-Free Reliability
11) Design for Reliability (DfR) 27) Statistical Process Control
12) Design for Manufacturability (DfM) 28) Innovative Problem Solving
13) Design for Testability (DfT) 29) Mechanical Design for Reliability
14) Design for Warranty Cost Reduction (DfW) 30) Problem Solving Tools
15) Design for 6 Sigma (DfSS) 31) Applied Data Analysis
16) Design for Safety
Red – Part of our yearly symposium
© 2009 Ops A La Carte 150
152. Upcoming Seminars
CQE Course – Apr-Jun and Oct-Dec, 2010
CRE Course – Jan-Mar and Aug-Oct, 2010
We offer 31 different courses and seminars in Reliability,
Quality, and Technical Operations.
Please see our Educational Brochure inside your Ops A La
Carte packet for more details
© 2009 Ops A La Carte 151
153. Upcoming Events
ARS – June, 2010 Reno
We are a co-sponsor and we will be exhibiting and will be
presenting a paper on our new HALT Calculator
ASTR – October, Denver
We are on the committee and will be exhibiting and
presenting.
RAMS – January, 11, Orlando
We are on the committee and will be exhibiting and
presenting.
© 2009 Ops A La Carte 152