How do you know what to monitor in your environment? Failure modes have become so complex that we need a cross-functional view of the system to identify what failure looks like. This talks walks through the FMEA process as applied to monitoring and metrics collection. The process will help you identify your failure points and the risks associated with a particular failure mode.
2. Bio - Jeff Smith
• Manager, Site Reliability Engineering
at Grubhub
• Yes, we are also hiring.
• Yes, there is free food. Yes, it's totally
awesome to work here.
Email: jsmith@grubhub.com
Twitter: @DarkAndNerdy
Blog: http://www.allthingsdork.com
12. FMEAFailure Mode Effects Analysis is a step-by-step approach for
identifying the possible ways a process, product or service
might fail. The process is commonly leveraged in quality
organizations across a wide range of industries.
13. FMEA in Software Engineering
We can use FMEA in a number of ways in software to help us
brainstorm, rank and prioritize different actionable bits about
the system. The process will help us
• Identify key metrics that need tracking
• Identify monitoring and or alerts that need to be created
• Identify necessary feedback loops
15. The Process
1. Examine the process
2. Brainstorm potential failures
3. List potential effects of failure
4. Identify Your Scale
5. Assign Severity ranking
6. Assign Occurrence ranking
7. Assign Detection ranking
17. Brainstorm Potential Failures
• Brainstorming should be fluid. Everything goes
• Cross-Functional teams should be involved. (Business,
development, operations, design)
18. List Potential Effects of Failure
Think through the impact of failure. The impact might be
something process related, reputation related or technical, just
to name a few. Examples:
• Degraded customer experience
• Order not fulfilled
• Delay in payment to accounts receivable
19. Agree on Risk Level Scales
Technology Industry
• Low severity could be degraded performance
• High severity could be complete site outage
Airline Industry
• Low severity could be departure delay
• High severity could be customer death
20. Assign Severity Ranking
Rank the severity on a scale between 1-10.
• 1 being the severity is inconsequential
• 10 being a catastrophic failure
In some organizations, 9 and 10 are reserved for personal
injury and death.
If a failure mode has more than one effect, select only the most
severe of the effects
21. Assign Occurrence Ranking
Rank the likelihood that this condition will occur.
• 1 being extremely unlikely
• 10 being inevitable.
22. Assign Detection Ranking
Rank the likelihood that this condition would be detected if it
occurred. A scenario is only considered "detected" if it is found
before it would impact a customer or user.
• 1 means the control would absolutely be detected
• 10 means the control is certain to not detect the failure.
23. Calculate the Risk Priority
Number
The Risk Priority Number is a value that is calculated to rank a
particular failure mode. The higher the RPN the sooner the
failure mode should be addressed
25. Develop an Action Plan
Evaluate the list and develop an action plan to eliminate or
mitigate the items with the highest RPN value first.
• Prioritize solutions that are self-healing and exist within the
system under consideration.
• Develop metrics that help to track the health surrounding a
failure item
• The goal is to reduce the RPN by lowering Severity,
Occurrence or Detection scores
26. Ensuring You Have a Feedback
Loop
The feedback loop is a constant evaluation of these
measurements and indicators. The feedback loop should give
a strong indicator that the system is working as expected, while
at the same time exposing trends in the environment.
27. Leading and Lagging Indicators
Leading Indicator - A measurable factor that changes before
the system enters a particular state of failure. (Metrics)
Lagging Indicator - A measurable factor that changes after
the system enters a particular state of failure. (Logs/Reporting)
28. Recap
• Examine your process, and assemble a cross-functional
team with different views of the system
• Brainstorm all your potential failure modes
• Calculate your RPN
• Develop action plans to reduce risk. Ensure the system is
providing feedback loops to be able to identify the current
state of the system
• Profit
29. Resources
• Quality One FMEA Writeup
• Purdue University FMEA Presentation
• iSixSigma
• Google Docs FMEA Template
• Brainstorming Tools Mind Node, FreeMind, XMind