11
Sean Carter, NASA JSC
Daniel Deans, ManTech SRS Technologies
Constellation Reliability
Engineering Process –
Optimizing CxP Risk
Used with Permission
2
DFRAM Overview
 Why does reliability engineering exist?
 How does it fit within the life cycle?
 Success space vs. failure space
 Partnership on system engineering team
 The value of “designing-out” failure modes
 Where does it fit in the lifecycle?
 What are some of the tools?
 How are they applied?
 Real examples
2
3
 Failure is not an option…
 A design engineer does not
know what he does not know
 An extra set of eyes and ears
is always good
 You have to spend money to
make money
 Mr. Murphy tends to rear
his ugly head when you are
not expecting it…
 What all this means is: You
have to work at it – nothing
worth accomplishing
comes easy
 Reliability engineering is a
discipline that adds value
to the systems engineering
process!
3
Reliability Engineering Value - Clichés
4
Typical System Engineering Lifecycle
5
Reliability Engineering Throughout Project Life
66
The Life Cycle Approach
 Reliability is best designed-in;
it is, for the most part, not:
 Analyzed in
 Tested in
 Operated in
 Successful reliability performance
begins with a diligent, intentional
approach at the very beginning of a project
 Pre-phase A: requirements
 Phase A: allocation; plan; resources
 Phase B: analysis, design input, preliminary design review
 Phase C: detailed design inputs; more analysis; trade studies;
design verification; critical design review
 Phase D: test planning, test readiness, manufacturing, final
validation; flight readiness review
 Phase E/F: ops, growth, disposal and lessons learned
System EngineeringSystem Engineering Test and AssessmentTest and Assessment
Element
Integration & Test
System
Integration Test
System Element
Data Reduction and
Assessment
System Concept
Exploration
Preliminary
Design
Design Synthesis
Component Fabrication, Assembly,
Integrate, & Test
Requirements
Compliance
Configuration
Management
Project Direction,
Control, & Planning
Risk
Management
System
Analysis
Project
Direction
and
Control
Project
Direction
and
Control
• System, Element,
Subsystem Models
• System Performance
Analyses
• Specifications
• Verification
• Management Plan
• Budget Development & Control
• Project Plan Development
• Schedule Development & Control
• Design Data Base
• Problem/Failure
Reports (PFR)
• Engineering Change
Orders
• Risk Planning
• Risk Assessment
• Risk Handling/Mitigation
• Risk Monitoring
77
Success Space vs. Failure Space
 A design engineer thinks in success space (typically)
 How will the widget work?
 When it is designed, what function will it perform?
 What are the performance requirements?
 Reliability engineer paid to think in failure space
 How will the widget fail?
 What about the operating environment will cause issues?
 What materials, processes, and tools will accentuate failure modes?
 Is redundancy required
 Are there operational work-arounds?
 How will faults propagate through the system?
 What are the effects of a failure mode on the mission
 Superimpose the two processes, you get success!
88
Credibility: Partnership on
System Engineering Team
 Safety and Mission Assurance organization provides
discipline experts to support design teams
 Our job is to serve; not to inhibit
 We help the system engineering teams identify
hazards and failure modes and design them out
 Our sole reason for existing is to ensure
project/program success and to reduce/eliminate
operational risk
 We are partners for success
 The aim in partnership is to duplicate our knowledge
in the collective heads of our design-team partners
9
The Value of “Designing-Out” Failure Modes
 A failure mode is an obstacle to mission success
 Not all may cause mission failure, but, any failure of a
component has potential
 In the commercial world, a failure in the field costs 10 times
what it costs to mitigate in the design process
 In the space business, a failure can and will cost the
mission and quite possibly endanger people
 Identifying and designing-out failure modes is important!
9Company Confidential
1010
How Do We Design Out Failure Modes?
 Methodical process; starts in pre-phase A, follows the lifecycle.
 DMEDI – Define, Measure, Explore, Develop, Implement
(12 steps)
 Define requirements
 Allocate requirements
 Plan activities and analysis, including test and verification
 Collect data and develop data sources
 Use RAM simulation, FMEA, FTA, worst case analysis, derating,
proven design practices to drive the design
 Support design reviews and require improvement
 Verify and ensure that design will meet requirements
 Plan and implement thorough testing
 Finalize verification, ascertain flight readiness
 Identify reliability growth opportunities once design is complete
 Investigate and eliminate root causes to anomalies
 Develop lessons learned, provide feedback to future engineering teams
11
Pre-Phase A Concept
Development
 Very important part of process –
DFRAM starts here
 Develop requirements that will
optimize RAM for program/project
 Requirements include availability,
mean time to failure, fault tolerance,
mean time to repair, time to replace
 Import lessons learned from similar
programs/systems
 Collect similar system failure history
data
 Begin development of system model
 Begin development of RAM Plan
12
Phase A: Preliminary Analysis
 Refine requirements, negotiate
allocations with design elements
 Finalize RAM Plan and educate design
team on process; what role reliability
engineering team will fill
 Continue to develop preliminary model;
begin FMEAs, FTAs, Probabilistic
assessments
 Allocate requirements to lowest
design-to level
 Negotiate failure definitions, failure
budgets with design teams
 Identify initial critical items, compare with
lessons learned from previous systems
 Continue to identify data sources
 Identify critical suppliers; begin to form
partnerships
13
Phase B – Preliminary Design
 Continue to build simulation (model) and
add more details
 Identify most effective analyses tools to use
to drive design
 Complete preliminary FMEA, FTA, PRA
 Continue to develop supplier partnerships
 Prepare for preliminary design review
 Perform maintenance task analysis
 Identify design improvement initiatives and
optimize using simulation
 Perform other sensitivity studies based on
fault tolerance requirements
 Begin developing and finalizing FRACAS,
test plans, reliability growth strategy
 Partner with designers to identify failure
modes, design them out
 Support concept of operations optimization
14
Phase C – Detailed Design
 Perform detailed design analysis – PDR recovery
 Focus on pareto items identified from analyses (Top 10)
 Continue to develop and use RAM simulation, FMEA,
FTA, etc. to design out failure modes
 Use Con-Ops to develop operational work-arounds as
failure mode mitigation
 Finalize test plans –review for reliability success criteria
 Audit suppliers, provide support for reliability
improvement
 Mitigate schedule risks
 Finalize critical items, document for testing
 Begin life testing of components and subsystems as
feasible
 Perform specialized analysis (sneaks, fault propagation)
 Prepare for and support CDR
15
Phase D –Development
 Finalize design - CDR recovery, cut into
manufacturing
 Finalize FMEAs, FTAs, Simulations, CILs
 Support testing, root cause
investigations and corrective action
 Begin collection of failure and
operational history data (upon first
application of power)
 Finalize reliability growth strategy
 Develop and begin implementation of
reliability-centered maintenance
approach
 Make “last minute” improvements based
on test results
 Identify lessons learned and document
 Update Con-Ops with operational work-
arounds for critical items
16
Phase E/F – Ops and Disposal
 Continue to gather data, monitor
operations for anomalies
 Support failure analyses, root cause
investigations
 Implement reliability growth process,
identify areas for growth, design
solutions
 Document lessons learned
 Use simulation to validate reliability
growth strategy, sensitivities
 Update RAM Plan with lessons
learned
 Support system disposal via
identification of reliability challenges
to shutdown
17
What are the Tools?
 Some of the tools that we use are:
 Requirements allocation
 RAM simulation/probabilistic risk assessment
 FMEA/FMECA
 Fault tree analysis (FTA)/event tree assessment
 Parts stress analysis/derating
 Detailed design analysis
 Worst case analysis
 Redundancy screens
 Extensive testing and verification analysis
 Reliability growth planning and implementation
 Others….
18
Reliability and Maintainability Simulation
 A very powerful process
 Can help design out failure modes without cutting metal
 Provides for the Pareto Principle (20/80)
 Gives design team a tool for sensitivity analysis
 Allows for trying many different scenarios
 Helps to optimize the return on investment based on cost to
improve curve
$ Cost
Reliability
High rate of return
KITC
Area of diminishing return
KITC = Point on Curve where rise
becomes less than run (reliability
improvement = rise, cost to
improve = run)
19
Simulation Basics
 Simulations are built based on the system architecture
 Model provides for “RAM” characteristics of system
 Input data includes failure rates, repair times, sparing
information, logistics information, operational work-
arounds
 Simulation is run based on mission profiles
 “Monte Carlo” methodology is used
 Typically data is input using statistical distributions
 Outputs are system availability and cutsets (and other
failure “illuminators”)
 Cutsets lead to sensitivity analyses which in turn can
drive improvements (failure mode elimination)
20
RAM Simulation Example
 Simulation is dynamic, not static analysis
 Can provide much information about overall availability
of system under many different sets of conditions
 Today’s tools can include operational concepts and
rules, optimization of spares (some automatic)
 Requires specific input data
21
How Results are Used
 Outputs of baseline simulations are verified and
validated using expert elicitation
 Once all agree that the simulation is in the “ballpark,” (do
not get wrapped around the axle on the numbers; it is the
gap elimination that provides the most value) – begin the
sensitivity analyses
 Identify opportunities for improvement, plug those back
into the sim, ascertain value of improvements
 Continue this process until gaps are eliminated or at
least reduced.
 This can include block improvement of overall
component failure rates – get the suppliers in on the act
(supplier partnerships)
 Ensure data from simulation is used in the design
process
22
Success Stories: NASA Instrument Design
 Validation of proper installation of sample cup retaining springs
on Sample Manipulation System to preclude workmanship
failures. (single ring failure would result in loss of solid sample
science)
 Use of physics of failure methods to identify and eliminate,
where possible, failure modes of Pyrolysis Oven.
 Implementation of HiPot test for Wide Range Pump motor to
eliminate workmanship related failures.
 Identification of Hall Effect Device on actuators as possible
Radiation Sensitive device. Subsequent testing validated
suitability of device.
 Identification of thermal switch on Gas Trap as Reliability
Issue. Redesign produced higher Reliability solution.
 FMEA of Gas Processing System provided justification for
addition of limited redundancy.
 Improved reliability of instrument by approximately 25% based in
initial predictions.
23
Complex Space Systems Application
 Predicated on effective
requirements
implementation
 Detailed RAM Plan
developed and
implemented at Program
Level
 RAM requirements, RAM
Plan flowed down to
systems, elements of
systems
 System owners
responsible for DFRAM,
but program will facilitate
and audit
 Program level analyses
including simulation, FMEA,
PRA being performed
 Verification and validation
will be program level
functions
 PRA will be part of flight
readiness decision
 Software included in DFRAM
activities (no longer black
box)
 System Engineering
organization partnering with
S&MA organization for RAM
implementation
23
24
SUMMARY
 Success of a system
predicated on intentional
implementation of DFRAM
 It will not happen
spontaneously
 Must be married with the
system engineering
process
 Program management
must be disciples – will
not work otherwise
 It is always easier and
more cost effective to do
it right the first time
 Implementation requires
people skills and a
service mentality
24

Sean carter dan_deans

  • 1.
    11 Sean Carter, NASAJSC Daniel Deans, ManTech SRS Technologies Constellation Reliability Engineering Process – Optimizing CxP Risk Used with Permission
  • 2.
    2 DFRAM Overview  Whydoes reliability engineering exist?  How does it fit within the life cycle?  Success space vs. failure space  Partnership on system engineering team  The value of “designing-out” failure modes  Where does it fit in the lifecycle?  What are some of the tools?  How are they applied?  Real examples 2
  • 3.
    3  Failure isnot an option…  A design engineer does not know what he does not know  An extra set of eyes and ears is always good  You have to spend money to make money  Mr. Murphy tends to rear his ugly head when you are not expecting it…  What all this means is: You have to work at it – nothing worth accomplishing comes easy  Reliability engineering is a discipline that adds value to the systems engineering process! 3 Reliability Engineering Value - Clichés
  • 4.
  • 5.
  • 6.
    66 The Life CycleApproach  Reliability is best designed-in; it is, for the most part, not:  Analyzed in  Tested in  Operated in  Successful reliability performance begins with a diligent, intentional approach at the very beginning of a project  Pre-phase A: requirements  Phase A: allocation; plan; resources  Phase B: analysis, design input, preliminary design review  Phase C: detailed design inputs; more analysis; trade studies; design verification; critical design review  Phase D: test planning, test readiness, manufacturing, final validation; flight readiness review  Phase E/F: ops, growth, disposal and lessons learned System EngineeringSystem Engineering Test and AssessmentTest and Assessment Element Integration & Test System Integration Test System Element Data Reduction and Assessment System Concept Exploration Preliminary Design Design Synthesis Component Fabrication, Assembly, Integrate, & Test Requirements Compliance Configuration Management Project Direction, Control, & Planning Risk Management System Analysis Project Direction and Control Project Direction and Control • System, Element, Subsystem Models • System Performance Analyses • Specifications • Verification • Management Plan • Budget Development & Control • Project Plan Development • Schedule Development & Control • Design Data Base • Problem/Failure Reports (PFR) • Engineering Change Orders • Risk Planning • Risk Assessment • Risk Handling/Mitigation • Risk Monitoring
  • 7.
    77 Success Space vs.Failure Space  A design engineer thinks in success space (typically)  How will the widget work?  When it is designed, what function will it perform?  What are the performance requirements?  Reliability engineer paid to think in failure space  How will the widget fail?  What about the operating environment will cause issues?  What materials, processes, and tools will accentuate failure modes?  Is redundancy required  Are there operational work-arounds?  How will faults propagate through the system?  What are the effects of a failure mode on the mission  Superimpose the two processes, you get success!
  • 8.
    88 Credibility: Partnership on SystemEngineering Team  Safety and Mission Assurance organization provides discipline experts to support design teams  Our job is to serve; not to inhibit  We help the system engineering teams identify hazards and failure modes and design them out  Our sole reason for existing is to ensure project/program success and to reduce/eliminate operational risk  We are partners for success  The aim in partnership is to duplicate our knowledge in the collective heads of our design-team partners
  • 9.
    9 The Value of“Designing-Out” Failure Modes  A failure mode is an obstacle to mission success  Not all may cause mission failure, but, any failure of a component has potential  In the commercial world, a failure in the field costs 10 times what it costs to mitigate in the design process  In the space business, a failure can and will cost the mission and quite possibly endanger people  Identifying and designing-out failure modes is important! 9Company Confidential
  • 10.
    1010 How Do WeDesign Out Failure Modes?  Methodical process; starts in pre-phase A, follows the lifecycle.  DMEDI – Define, Measure, Explore, Develop, Implement (12 steps)  Define requirements  Allocate requirements  Plan activities and analysis, including test and verification  Collect data and develop data sources  Use RAM simulation, FMEA, FTA, worst case analysis, derating, proven design practices to drive the design  Support design reviews and require improvement  Verify and ensure that design will meet requirements  Plan and implement thorough testing  Finalize verification, ascertain flight readiness  Identify reliability growth opportunities once design is complete  Investigate and eliminate root causes to anomalies  Develop lessons learned, provide feedback to future engineering teams
  • 11.
    11 Pre-Phase A Concept Development Very important part of process – DFRAM starts here  Develop requirements that will optimize RAM for program/project  Requirements include availability, mean time to failure, fault tolerance, mean time to repair, time to replace  Import lessons learned from similar programs/systems  Collect similar system failure history data  Begin development of system model  Begin development of RAM Plan
  • 12.
    12 Phase A: PreliminaryAnalysis  Refine requirements, negotiate allocations with design elements  Finalize RAM Plan and educate design team on process; what role reliability engineering team will fill  Continue to develop preliminary model; begin FMEAs, FTAs, Probabilistic assessments  Allocate requirements to lowest design-to level  Negotiate failure definitions, failure budgets with design teams  Identify initial critical items, compare with lessons learned from previous systems  Continue to identify data sources  Identify critical suppliers; begin to form partnerships
  • 13.
    13 Phase B –Preliminary Design  Continue to build simulation (model) and add more details  Identify most effective analyses tools to use to drive design  Complete preliminary FMEA, FTA, PRA  Continue to develop supplier partnerships  Prepare for preliminary design review  Perform maintenance task analysis  Identify design improvement initiatives and optimize using simulation  Perform other sensitivity studies based on fault tolerance requirements  Begin developing and finalizing FRACAS, test plans, reliability growth strategy  Partner with designers to identify failure modes, design them out  Support concept of operations optimization
  • 14.
    14 Phase C –Detailed Design  Perform detailed design analysis – PDR recovery  Focus on pareto items identified from analyses (Top 10)  Continue to develop and use RAM simulation, FMEA, FTA, etc. to design out failure modes  Use Con-Ops to develop operational work-arounds as failure mode mitigation  Finalize test plans –review for reliability success criteria  Audit suppliers, provide support for reliability improvement  Mitigate schedule risks  Finalize critical items, document for testing  Begin life testing of components and subsystems as feasible  Perform specialized analysis (sneaks, fault propagation)  Prepare for and support CDR
  • 15.
    15 Phase D –Development Finalize design - CDR recovery, cut into manufacturing  Finalize FMEAs, FTAs, Simulations, CILs  Support testing, root cause investigations and corrective action  Begin collection of failure and operational history data (upon first application of power)  Finalize reliability growth strategy  Develop and begin implementation of reliability-centered maintenance approach  Make “last minute” improvements based on test results  Identify lessons learned and document  Update Con-Ops with operational work- arounds for critical items
  • 16.
    16 Phase E/F –Ops and Disposal  Continue to gather data, monitor operations for anomalies  Support failure analyses, root cause investigations  Implement reliability growth process, identify areas for growth, design solutions  Document lessons learned  Use simulation to validate reliability growth strategy, sensitivities  Update RAM Plan with lessons learned  Support system disposal via identification of reliability challenges to shutdown
  • 17.
    17 What are theTools?  Some of the tools that we use are:  Requirements allocation  RAM simulation/probabilistic risk assessment  FMEA/FMECA  Fault tree analysis (FTA)/event tree assessment  Parts stress analysis/derating  Detailed design analysis  Worst case analysis  Redundancy screens  Extensive testing and verification analysis  Reliability growth planning and implementation  Others….
  • 18.
    18 Reliability and MaintainabilitySimulation  A very powerful process  Can help design out failure modes without cutting metal  Provides for the Pareto Principle (20/80)  Gives design team a tool for sensitivity analysis  Allows for trying many different scenarios  Helps to optimize the return on investment based on cost to improve curve $ Cost Reliability High rate of return KITC Area of diminishing return KITC = Point on Curve where rise becomes less than run (reliability improvement = rise, cost to improve = run)
  • 19.
    19 Simulation Basics  Simulationsare built based on the system architecture  Model provides for “RAM” characteristics of system  Input data includes failure rates, repair times, sparing information, logistics information, operational work- arounds  Simulation is run based on mission profiles  “Monte Carlo” methodology is used  Typically data is input using statistical distributions  Outputs are system availability and cutsets (and other failure “illuminators”)  Cutsets lead to sensitivity analyses which in turn can drive improvements (failure mode elimination)
  • 20.
    20 RAM Simulation Example Simulation is dynamic, not static analysis  Can provide much information about overall availability of system under many different sets of conditions  Today’s tools can include operational concepts and rules, optimization of spares (some automatic)  Requires specific input data
  • 21.
    21 How Results areUsed  Outputs of baseline simulations are verified and validated using expert elicitation  Once all agree that the simulation is in the “ballpark,” (do not get wrapped around the axle on the numbers; it is the gap elimination that provides the most value) – begin the sensitivity analyses  Identify opportunities for improvement, plug those back into the sim, ascertain value of improvements  Continue this process until gaps are eliminated or at least reduced.  This can include block improvement of overall component failure rates – get the suppliers in on the act (supplier partnerships)  Ensure data from simulation is used in the design process
  • 22.
    22 Success Stories: NASAInstrument Design  Validation of proper installation of sample cup retaining springs on Sample Manipulation System to preclude workmanship failures. (single ring failure would result in loss of solid sample science)  Use of physics of failure methods to identify and eliminate, where possible, failure modes of Pyrolysis Oven.  Implementation of HiPot test for Wide Range Pump motor to eliminate workmanship related failures.  Identification of Hall Effect Device on actuators as possible Radiation Sensitive device. Subsequent testing validated suitability of device.  Identification of thermal switch on Gas Trap as Reliability Issue. Redesign produced higher Reliability solution.  FMEA of Gas Processing System provided justification for addition of limited redundancy.  Improved reliability of instrument by approximately 25% based in initial predictions.
  • 23.
    23 Complex Space SystemsApplication  Predicated on effective requirements implementation  Detailed RAM Plan developed and implemented at Program Level  RAM requirements, RAM Plan flowed down to systems, elements of systems  System owners responsible for DFRAM, but program will facilitate and audit  Program level analyses including simulation, FMEA, PRA being performed  Verification and validation will be program level functions  PRA will be part of flight readiness decision  Software included in DFRAM activities (no longer black box)  System Engineering organization partnering with S&MA organization for RAM implementation 23
  • 24.
    24 SUMMARY  Success ofa system predicated on intentional implementation of DFRAM  It will not happen spontaneously  Must be married with the system engineering process  Program management must be disciples – will not work otherwise  It is always easier and more cost effective to do it right the first time  Implementation requires people skills and a service mentality 24