Innovation day 2013 2.5 joris vanderschrick (verhaert) - embedded system development
1. 2
When first time right embedded system
developments need to become cost
effective
CONFIDENTIAL
Joris Vanderschrick
Business development embedded systems
joris.vanderschrick@verhaert.com
THEME 2: RISK MANAGEMENT IN INNOVATION
3. 4
Cut
• Development
phases
• Functional
subsystems
Tangible
• Visualize
• Simulate
• Test
• Review
• Roadshow
Risk focus
• Criticalities
• Added value
• 360°
Early
• Rapid
prototyping
• First time right
Options
• Backup
• Buffer
• Requirements
vs. design
Risk based development methods
Reliability
4. 5
Introduction
Reliability: The measure of a product’s ability to
…perform the specified function
…at the customer (within their use environment)
…over the desired lifetime
7. 8
Objectives of a reliability approach
Objectives:
• Early identification of weak points in design to:
• Limit the risk/cost of modifications in production or deployment phase
• Reduce product failures/returns/recalls during the product lifecycle
• Improve time to market by early detection of weakness and flaws
• Minimize number of dead-on-arrivals
• Increase customer satisfaction
11. 12
Inventarisation
The definition of the Reliability approach starts with the inventarisation by subsystem:
• System breakdown in subsystems-assembly-subassembly-components
• Typical systems: Electronic & electrical systems, mechanical, hydraulic, process
systems,…
• Which critical topics are relevant (FMECA: Failure modes)?
• How will these critical topics be evaluated ifo life-time. (via which norm or guideline)
This will invlove the inventarisation of the norms or guidelines that are the most relevant
for the application or intended purpose.
12. 13
How? Example: FMECA
RPN = Severity x Occurrence x Detection
The RPN can then be used to compare issues within the analysis and to prioritize problems for corrective
Action. The ratings are defined by:
• Main published standards for this type of analysis, like SAE J1739, AIAG FMEA-3 and MIL-STD-
1629A.
• Industries and companies have developed their own procedures to meet the specific requirements of
their products/processes
13. 14
Why use a FMECA
FMECA/FMEA is useful as a survey method to identify effects of major failure modes in a system
It can contribute to improved designs for products and processes, resulting in higher reliability, better
quality, increased safety, enhanced customer satisfaction and reduced costs.
• Avoid time and cost consuming design changes at a late stage in the development
• The tool can also be used to establish and optimize maintenance plans, control plans and other
quality assurance procedures.
• In addition, an FMEA or FMECA is often required to comply with safety and quality requirements,
such as ISO 9001, QS 9000, ISO/TS 16949, 13485, FDA,…
Remarks:
• Complex systems & processes makes the task of defining a detailed FMEA/FMECA time-consuming
• Assumes the causes of problems are all single event in nature (combinations of events = 1 event)
• The process relies on the right participants & open communication & cooperation
• Human error sometimes overlooked
It’s just a tool. Without a follow-up plan & actions, It will not improve the reliability of your system
14. 15
Scoping
Evaluation & definition of the appropriate calculation methods of the failure rate
• For the defined building blocks (sub-systems) & specific parts, we will analyze which norm or
standard provides the best method for the evaluation & calculation of the failure rate.
Work packages
1.1. Voorbereiding
met AGFA
Reliability
Electronic
components
Reliability
Mechanical parts
General Approach
& Study logic ifo
reliability design &
production
Software reliability
ECSS-E-ST-33-01C Space Mechanisms
oScope of the standard: requirements applicable to the:
concept definition, design, analysis, development, production,
test verification and operation of space mechanisms
to meet the mission performance requirements
16. 17
MTBF, FIT calculations (Prediction Method)
To obtain high product reliability, consideration of reliability issues should be integrated from the very
beginning of the design phase. This leads to the concept of reliability prediction.
• MTBF: Mean Operating Time Between Failures
• The failure rate of the system is calculated by summing up the failure rates of each component in
each category (based on probability theory). This applies under the assumption that a failure of any
component is assumed to lead to a system failure.
• Constant failure rate Relevant for Useful life-time
• Fault is repairable
• MIL-HDBK-217F is probably the most internationally recognized empirical prediction method, by far.
Parts count Parts Stress
18. 19
Simulations
FEA Simulations
FEM Analysis: (FEA)
FEA consists of a computer model (2D, 3D)of a material or design that is stressed and analyzed for
specific results.
It is used in new product design, and existing product refinement. A company is able to verify a proposed
design and will be able to perform to the client's specifications prior to manufacturing or construction.
What can you check at an early stage?
Point, pressure, thermal, gravity, and centrifugal static loads
Thermal loads from solution of heat transfer analysis
Enforced displacements
Heat flux and convection
Point, pressure and gravity dynamic loads
Examples:
• Drop/shock
• Bending, load
• Vibration
• Thermal cross points
• …
19. 20
Simulations
DESTECS (Design Support and Tooling for Dependable Embedded Control Software)
• Inspiration
o Use collaborative multidisciplinary design of Embedded Systems
o Rapid construction and evaluation of system models
o Evaluated on industrial applications
• Need because of Embedded Systems
o More demanding requirements for Reliability, Fault Tolerance
o Increasingly distributed: more complex design possibilities more fault scenario’s
21. 22
Conclusions
Advantages of empirical methods:
• Easy to use, and a lot of component models exist.
• Relatively Indicators of inherent reliability.
• Provide an approximation of field failure rates.
Disadvantages of empirical methods:
• Based on statistical data & sometimes out-dated
• Not all components from new designs are described in
the Standard.
• Failure of the components is not always due to
component-intrinsic mechanisms but can be caused by
the system design.
Simulations
• Early validation of your system
• More and faster iterations
• Parallel hw & sw development
• Early full system validation and risk
mitigation without hw
• Less real-life testing
(= the poor man’s approach)
23. 24
Not Traditional Testing!!
• Traditional (QA) testing is done before product release but after the design & development phase (ex.
Burn-in test, environmental testing, drop testing, shock & vibration testing,…)
• Many of today's products are capable of operating under extremes of environmental stress and for
thousands of hours without failure. Traditional test methods are no longer sufficient to identify design
weaknesses or validate life predictions.
Disadvantages
• Test under operating conditions Takes too long
• Testing is costly! (equipment, time-consuming,…)
• Will not tell you anything about the realiability during useful life. Just about infant failures. (DOA)
• Too late in NPD process, Design corrections will be
very expensive
24. 25
Highly accelerated testing
HALT = Highly Accelerated Life Time Test
What?
• Highly accelerated life testing (HALT) techniques are important in uncovering many of
the weak links of a product DURING THE DESIGN PHASE
• These discovery tests rapidly find weaknesses using accelerated stress conditions
• Stresses are applied in a controlled, incremental fashion while the unit under test is
continuously monitored for failures
Why?
HALT reveals product failure modes in a matter of hours or days
Traditional test methods that can take weeks or even months to find, if at all
The purpose of HALT is to determine the operating and destruct limits of a design – why
those limitations exist and what is required to increase those margins. HALT, therefore,
stresses products beyond their design specifications.
25. 26
Procedure?
• Using a test environment that is more severe than that experienced during normal equipment use.
• Done on early prototypes & different design concepts
Since higher stresses are used, accelerated testing must be approached with caution to avoid introducing
failure modes that will not be encountered in normal use. Accelerating factors used, either singly or in
combination, include:
• More frequent power cycling
• Higher vibration levels
• High humidity
• More severe temperature cycling
• Higher temperatures
‘ It’s not a Pass/Fail test but a discovery process! ’
26. 27
Results
• Structural weaknesses
• Electronic weaknesses
• Component failures
• Component dislocation
• PCB delamination, via-cracking, …
• Solder failure
• Software failures due to component degradation
• Connector problems
• ...
• Information on product limits and product capabilities outside the limits
• Product weaknesses & design errors
27. 28
Goals
HALT provides engineers with the opportunity to improve
product design, increasing its robustness and minimizing
possibility of costly warranty services and expensive
product recalls after release
Once the weaknesses of the product are uncovered and corrective actions taken, the limits of the
product are clearly understood and the operating margins have been extended as far as possible.
A much more mature product can be introduced much more quickly with a
higher degree of reliability.
28. 29
Taking It a step further…
• Define the S-N curve for the specific failure mechanisms
• Use test data in a model relating the reliability (or life) measured under high stress conditions to that
which is expected under normal operation to determine length of life
• Accelerated test models relate the failure rate or the life of a component to a given stress such that
measurements taken during accelerated testing can then be extrapolated back to the expected
performance under normal operating conditions
Design for Reliability!!! PoF
30. 32
Thermal cycle vs measurement errors
Goal:
Life-time expectancy necessary for product = 10years
Verify the reliability of measurements with HALT test setup
Discover design weakness, improve & repeat test
Setup:
• acceleration : cycle 1x/day => 1x/hour
• acceleration : min-max temperatures & high transient
• statistical number of test samples (one is not enough)
• Identify & measure performance parameter(s)
33. 35
Conclusions
• Upfront definition of evaluation criteria are important.
• Multiple failure modes
• Early failures
• Non-constant (random) failures
• Performance degradation over time: Quality of the measurements will degrade in time.
• Temperature induced (thermo-mechanical stress)
34. 36
HALT vs Field & Traditional testing
•Time-consuming
•Network
•Costly Installations
•More spread on the test results
•Same test conditions cannot be
guaranteed: Difficult for quatative
comparison
Field testing
• Faster results (accelerated stress)
• Correct & increase design
reliability throughout the test
procedures
• Control over test conditions
• Main costs:
Fabrication of samples, test setup,
assembly, testing,…
HALT
Traditional testing
•Time-consuming (operational
stress)
• Expensive setups
• Expensive corrective actions
• Too late in design cycle
• Only for infant failures (DOA)
36. 38
Current approaches = not sufficient?
• Mostly only FMECA executed. Rarely identifies design issues because of limited focus on the failure
mechanism
• Incorporation of HALT and failure analysis (HALT is test, not DfR; failure analysis is too late)
• MTBF/MTTF calculations tend to assume that failures are random in nature
Provides no motivation for failure avoidance
• Easy to manipulate numbers
Tweaks are made to reach desired MTBF
E.g., quality factors for each component are
modified
• Often misinterpreted
50K hour MTBF does not mean no failures in
50K hours
Source: Loughborough University
Alternative = Physics-of-Failure principle:
The use of science (physics,chemistry, etc.) to capture an understanding of failure mechanisms
and evaluate useful life under actual operating conditions
37. 40
Focus on failure mechanisms
Failure Mode:
o The EFFECT by which a failure is OBSERVED, PERCEIVED or SENSED.
Failure Mechanism:
o The PROCESS (elect., mech., phy., chem. ... etc.) that causes failures.
FMMEA: Add failure mechanisms to FMEA
39. 42
Further break-down to PBA level
Failure site = CBGA IC broken-off from PCB
Failure Mode = Solder-joint fatigue
Failure effect: Solder-Joint crack
Solder-joint = Surface mount solder attachment.
Electrical interconnection & mechanical attachment of electronic
component on the PCB but also critical heat transfer in
between
40. 43
Example: Solder-joint cracks
Failure Mechanism: Solder-joint fatigue by CTE mismatch
Caused by the local thermal mismatches between the different material characteristics of IC, PCB and
solder itself = CTE mismatch. (Coeficient of Thermal expansion)
Result: Different thermal expansions, due to thermal energy dissipated stress on solder joints
fatigue
Fatigue leads to growing of the grains inside the solder Result: Cracks!
41. 44
S-N curve of solder-joint fatigue
• For each failure mode a S-N curve can be defined
• Solder-joint fatigue = Function of Thermal strain vs N cycles to failure
Established out of:
• Test data
• Statistics
• FE simulation
• Physical modeling
42. 45
Acceleration
Acceleration:
Thermal swings (dT) in the operational environment
accelerating the thermal strain
accelerating solder-joint fatigue
accelerating failure effect: Solder-joint crack
Acceleration test:
Thermal cycling test requirements:
• Heat/cool rate limited (transient)
• Allow for minimal dwell times at extreme temperatures: time is essential.
• Materials set limits to temperature extremes
Establish accelerating factor = Thermal strain (accelerated temp conditions)/Thermal strain
(normal temp conditions)
Acceleration Model:
These are mathematical models that can extrapolate the Number cycles to failure under accelerated
Temp conditions to the number of cycles to failure under operational Temp conditions
43. 46
Example: Solder-joint cracks
Establish test failure distribution and predict operational failure distribution
using the acceleration factors and the operational use of the product
Use test data in a model relating the reliability (or life) measured under high
stress conditions to that which is expected under normal operation to
determine length of life
Test
Point Operation
Point
44. 47
Characteristics, benefits and limitations:
• Physics not statistics.
• The only way to predict long term wearout lifetime.
• Testing is in general done on specially designed test samples, not on the actual product.
• It is input for the design process. Can be established independent from design cycle. Time-to–
market!
• Requires profound understanding of technologies used in the product and the wearout physics
involved.
• Limitation: Establishing the S-N curves and acceleration factors is a tedious, time-consuming and
expensive job with a lot of pitfalls. Therefore, for many relevant failure mechanisms S-N or
acceleration factor information is not available.
• Still subject of scientific research.
49. 52
VERHAERT MASTERS IN INNOVATION®
Headquarters
Hogenakkerhoekstraat 21
9150 Kruibeke (B)
tel +32 (0)3 250 19 00
fax +32 (0)3 254 10 08
ezine@verhaert.com
More at www.verhaert.com
VERHAERT MASTERS IN INNOVATION®
Netherlands
ESIC European Space Innovation Centre
Kapteynstraat 1
2201 BB Noordwijk (NL)
Tel: +31 (0)618 12 19 19
derk.schneemann@verhaert.com
More at www.verhaert.com
MASTERS IN INNOVATION® is a platform set up by VERHAERT to train, stimulate and incubate
you as an innovator.
We provide an extensive training program with different tracks and covering critical areas of new
products and business innovation.
Furthermore we manage the VERHAERT venturing program and organize our Innovation Day, an
annual conference on best practices and insights on new products & business innovation.
Editor's Notes
learning to fail fast = go looking for
Today I would like to give you some insights about some different Reliability approaches and how they they have an impact on the cost-effectiveness of your product.
The titel indicated embedded systems but Let’s broathen the scope towards product in general.
At the customer = at the inteded Use environment (Operator, stand-alone,…)
Not only cost efficiency is a good driver to define a reliabilty approach, but It has also a direct link towards customer satisaction.
I’ve taken an example of a commercial product, because reliability is not only important for mission or safety critical devices
Survey where they compare different brands of mobile phones.
Relaibility is the most important feature appreciated by the customer. Just before the Useability. Other features like design and technical gadgets are a factor less important!!!
Conclusion: Every product needs a reliability approach to encrease customer statisfaction and Sales.
This curve is taken from the automotive sector.
Vert: Cost of making changes (x = 4K~6K USD per change)
Hor: Development stages
Black curve: You will notice that the cost for design changes will rise throughout the development stages.
Blue curve: typical number of design changes. Lots of changes throughout the design & development stages.
Important here is that design changes, due to reliability, in the production or even worse in the field phase, will cause big cost compared to changes in the design & development stage.
Conclusion: Your relaibility approach must strive to address relaibility problems in the design & development stages to improve the cost-efficiency of your product and the time-to-market!!
At the DFMEA level, it is usually recommended to study each subsystem separately, and each component separately. Their inter-relations can be evaluated in the System FMEA.The System FMEA examines system deficiencies caused by potential failure modes between the functions of the system. This includes the interactions between the systems and the elements of the systems.The PFMEA is conducted on a process, whether it be in a manufacturing or a service environment. It is generally recommended to study each machine or sub-process separately. Their inter-relationships can also be studied in a System FMEA. Service FMEAs are usually not preceded by a DFMEA.
In general, FMEA / FMECA requires the identification of the following basic information:
Item(s)
Function(s)
Failure(s)
Effect(s) of Failure
Cause(s) of Failure
Current Control(s)
Recommended Action(s)
Plus other relevant details
Risk Priority Numbers (RPNs)
Criticality Analysis (FMEA with Criticality Analysis =
FMECA)
A mature product will be attained much sooner.The ability to detect and correct defects much earlier in the design and production cycles provides major advantages in terms of time and dollars saved. A high degree of product maturity is realized prior to product shipment as opposed to traditional “life cycle time” in the field.Production release will be expedited.Proper use of HALT technology greatly enhances the probability of completing DVT or Qualification testing on initial passes. Minimizing “redesigns” and repeated test cycles and thus cutting weeks or months from the development schedule, translates into significant savings in total program costs. Warranty costs will be greatly reduced.HALT & HASS implementation delivers a far more mature product than previous test processes. Early life failures have been minimized, operational margins have expanded, manufacturing defects have been controlled, and overall product reliability has been elevated to new levels. These factors have led to real-world reductions in reliability issues, warranty costs, and NDF (no defect found) situations.Customer satisfaction will be enhanced.The ability to consistently deliver reliable, cost-effective product solutions is one of the keys to achieving and maintaining a high level of customer satisfaction. HALT & HASS technology has consistently demonstrated the ability to provide the product quality required to maintain positive relationships with customers. And given the fact that it costs at least five times as much to gain a new customer as it does to maintain a current one, the advantages inherent in the use of these technologies becomes quite evident
We gaan niet alleen de failure modes in kaart brengen, maar ook het physische process (failure mechanism) dat de failure mode gaat controleren of beïnvloeden.
Bvb. Failure omwille van metaal vermoeidheid = failure mode. Gecontroleerd door bvb. thermische stress die erop uitgevoerd wordt Failure mechanisme.
2. One test focuses on the nominal stress required to cause a fatigue failure in some
number of cycles. This test results in data presented as a plot of stress (S) against the
number of cycles to failure (N), which is known as an S-N curve.
Door experimenten, modeling, simulatie en statische berekeningen, kunnen we
Failure mechanisme in kaart brengen dmv de S-N curve, wat een functie is van de
stress tov aantal cycli tot er een een bepaalde vermoeidheid optreedt.
Je ziet dan hoe lager de stress, hoe meer cycli je kan doorlopen tot er een fout (nl.
vermoeidheid) optreedt.
3; Nu er bestaan analystische modellen, die accelerated test data kunnen
extrapoleren naar het normale operatinele regime. Dit betekent dat je versnelde
meetdata St en Nt naar de normale operation time. Op die manier kan je dus
voorspellingen gaan maken over het optreden van fouten