11.analysing
ProblemManagementFoundation
Analysing a major incident
• The causes of a major incident are a problem
• Other problems are highlighted by the manner in which the major
incident is handled
• Refer the Major Incident Classification Tool in the Appendix
• Tool is used to ensure the correct classification of a Major
incident and that all details are captured
ProblemManagementFoundation
Benefits of effective classification
Successful Categorization helps in many ways, here are a few of them:
• Quickly find solutions (workarounds and/or fixes)
• Properly route incidents to the correct support group
• Gather sufficient data to speed diagnoses by the next level support
• Aid in building and maintaining a knowledge base
• Improves efficiency of technical and functional groups
• Enhances customer satisfactions
• Increases user productivity
• Builds maturity toward more proactive operations
ProblemManagementFoundation
Operational outage classification
Prioritization
Critical - An immediate and sustained effort using all available resources until resolved. On-call
procedures activated, vendor support invoked.
High - Technicians respond immediately, assess the situation, and may interrupt other staff working
low or medium priority jobs for assistance.
Medium - Respond using standard procedures and operating within normal supervisory
management structures.
Low - Respond using standard operating procedures as time allows.
Urgency Caused by uncontrolled trigger, scheduled/unscheduled change or maintenance
Operations Effect on business functions and activities and interference with deliverables
Credibility Areas affected including external/internal
Scope Quality of customers impacted
ProblemManagementFoundation
Outage analysis: Service period
• Critical - App, server, link (network or voice) unavailable for greater than 4 hours
or degraded for greater than 1 day – negative business delivery for more than 1
month
• Major - App, server, link (network or voice) unavailable for greater than 1 hour or
degraded for greater than 4 hours - negative business delivery for more than 1
week
• Moderate - App, server, link (network or voice) unavailable for greater than 30
minutes or degraded for greater than 1 hour - negative business delivery for
more than 1 day
• Minor - App, server, link (network or voice) unavailable greater than 5 minutes or
degraded for greater than 30 minutes - negative business delivery for more than
1 hour
• Low - App, server, link (network or voice) unavailable for less than 5 minutes or
degraded for less than 30 minutes - negative business delivery for less than 1
hour
ProblemManagementFoundation
Outage analysis: Service consequence
• Critical - Financial loss, which puts a business unit in a critical position - greater
than R100m or substantial loss of credibility or litigation or prosecution or fatality
or disability.
• Major - Financial loss which severely impacts the profitability of a business unit -
greater than R10m or serious loss of credibility or sanction or impairment
• Moderate - Financial loss which impacts the profitability of the business unit,
greater than R1m or embarrassment or reported to regulator or hospitalization.
• Minor -Financial loss with a visible impact on profitability but no real effect,
greater than R100k or some embarrassment or rule or process breaches or
medical treatment.
• Low - Financial loss with no real effect, less than R50k or irritating or no legal or
regulatory issue or no medical treatment.
ProblemManagementFoundation
Types of
incident to
use for
metrics
Outage Usage Priority
Scrutiny by
management
Profit 3
Effect on
productivity*
Staff 10
Impact on
company’s
image
Share price 5
Direct financial
impact
Assets Actual value
Liability or
vulnerability
Nominal 1
Limited to
internal IT
process
Budget 2
ProblemManagementFoundation
An integrated approach
• Problems are latently present in all work environments (not only IT).
• Focus on outlining an integrated approach to working with problems.
• Number of process and methodologies need to be integrated.
• The approach is mature, analytical and provides a measured method
to have excellence achieved within an organization.
• Major incidents are a special case (caused by underlying problems) –
a detailed methodology is presented for dealing with major incidents
• Not about acquiring skills to solve problems but accessing impact
ProblemManagementFoundation
VirtualSalt tips for solving
problems
• Take time to examine and explore the problem thoroughly before
setting out in search of a solution. Often, to understand the problem
is to solve it.
• Breaking the problem into smaller parts will often make solving it
much easier. Solve each part separately.
• The resources for problem solving are immense and ubiquitous.
• You can always do something.
• A problem is not a punishment; it is an opportunity to increase the
happiness of the world, an opportunity to show how powerful you
really are.
ProblemManagementFoundation
Problem solving tips
• The formulation of a problem determines the range of choices: the
questions you ask determine the answers you receive.
• Be careful not to look for a solution until you understand the
problem, and be careful not to select a solution until you have a
whole range of choices.
• The initial statement of a problem often reflects a preconceived
solution.
• A wide range of choices (ideas, possible solutions) allows you to
choose the best from among many. A choice of one is not a choice.
• People work to implement their own ideas and solutions much more
energetically than they work to implement others' ideas and
solutions.
ProblemManagementFoundation
Problem solving tips
• Remember the critical importance of acceptance in solving problems. A solution
that is technologically brilliant but sociologically stupid is not a good solution.
• When the goal state is clear but the present state is ambiguous, try working
backwards.
• Procrastinators finish last.
• Denying a problem perpetuates it.
• Solve the problem that really exists, not just the symptoms of a problem, not the
problem you already have a solution for, not the problem you wish existed, and not
the problem someone else thinks exists.
• A maker follows a plan; a creator produces a plan.
• Creativity is the construction of somethings new out of somethings old, through
effort and imagination.
lessons learnt
ProblemManagementFoundation
Lessons learnt
• In February 1945, a force of around 70,000 US
Marines invaded Iwo Jima, an significant volcanic
island 840 kilometres south of Tokyo. The island
was defended by over 22,000 Japanese with the
Americans expecting the island to fall within five
days. Instead the battle lasted more than seven
times longer with 6,800 US. fatalities, 20,000 US
wounded, and the death of 20,700 defenders.
• The significance of this action was that the
Marines resolved to aggregate all the lessons
learned from the non-optimal engagement and
action in invading Iwo Jima and channelled these
lessons into future conflict.
ProblemManagementFoundation
After Action Review (AAR)
• First used after the heavy losses suffered by the Marines in the battle
of Iwo Jima
(watch Iwo Jima video)
• The Marines coined the term After Action Review for their lessons
learnt process
• Similar methodologies to that use for other emergencies, such as fire
(watch fire AAR video)
ProblemManagementFoundation
Lessons from Apollo
13
(watch the Apollo 13 video)
• Lessons include
• Technical proficiency
• Teambuilding
• Conflict management
• Decision making and problem solving
• Creativity and innovation
• Effective and efficient communications
ProblemManagementFoundation
What we learn from the Major
incident timelines
The timelines in the major incident lifecycle need to be analysed
and determined whether they are accurate and optimal. We need
to know and correct:
• Was the incident detected quickly and it a suitable time?
• Was the incident diagnosed and are the diagnosis tools appropriate. If
any checklists were used do they need to be updated or refined? Would
the checklist assist in quicker diagnoses if it was reordered?
• Was there any delay between diagnosis and the repair being imitated?
Are we dealing with any logistical problems?
• Were there any suitable workarounds and were they successfully
implemented? Can a new type of workaround be developed for any
similar types of major incidents?
Separately we need to evaluate and be certain that the
communications during the major incident was smooth. This
includes having the correct stakeholders informed and interactions
between the various tiger teams not being constrained.
ProblemManagementFoundation
Lessons learnt
• Some problems have been solved before and it is wasted resource to
resolve these from primary sources and effort repetitively.
• In a work environment, different people might work on resolving
different problems at different times.
• If the successful resolution of problems is pooled into a knowledge
base then future problems will be dealt with in a dramatically more
optimal fashion.
• Thus it is important to not only populate a lessons learnt knowledge
base when a problem is solved but also to reference it when dealing
with a problem to find a potential resolution or even insight into how
to deal with the current problem.

Problem management foundation - Analysing

  • 1.
  • 2.
    ProblemManagementFoundation Analysing a majorincident • The causes of a major incident are a problem • Other problems are highlighted by the manner in which the major incident is handled • Refer the Major Incident Classification Tool in the Appendix • Tool is used to ensure the correct classification of a Major incident and that all details are captured
  • 3.
    ProblemManagementFoundation Benefits of effectiveclassification Successful Categorization helps in many ways, here are a few of them: • Quickly find solutions (workarounds and/or fixes) • Properly route incidents to the correct support group • Gather sufficient data to speed diagnoses by the next level support • Aid in building and maintaining a knowledge base • Improves efficiency of technical and functional groups • Enhances customer satisfactions • Increases user productivity • Builds maturity toward more proactive operations
  • 4.
    ProblemManagementFoundation Operational outage classification Prioritization Critical- An immediate and sustained effort using all available resources until resolved. On-call procedures activated, vendor support invoked. High - Technicians respond immediately, assess the situation, and may interrupt other staff working low or medium priority jobs for assistance. Medium - Respond using standard procedures and operating within normal supervisory management structures. Low - Respond using standard operating procedures as time allows. Urgency Caused by uncontrolled trigger, scheduled/unscheduled change or maintenance Operations Effect on business functions and activities and interference with deliverables Credibility Areas affected including external/internal Scope Quality of customers impacted
  • 5.
    ProblemManagementFoundation Outage analysis: Serviceperiod • Critical - App, server, link (network or voice) unavailable for greater than 4 hours or degraded for greater than 1 day – negative business delivery for more than 1 month • Major - App, server, link (network or voice) unavailable for greater than 1 hour or degraded for greater than 4 hours - negative business delivery for more than 1 week • Moderate - App, server, link (network or voice) unavailable for greater than 30 minutes or degraded for greater than 1 hour - negative business delivery for more than 1 day • Minor - App, server, link (network or voice) unavailable greater than 5 minutes or degraded for greater than 30 minutes - negative business delivery for more than 1 hour • Low - App, server, link (network or voice) unavailable for less than 5 minutes or degraded for less than 30 minutes - negative business delivery for less than 1 hour
  • 6.
    ProblemManagementFoundation Outage analysis: Serviceconsequence • Critical - Financial loss, which puts a business unit in a critical position - greater than R100m or substantial loss of credibility or litigation or prosecution or fatality or disability. • Major - Financial loss which severely impacts the profitability of a business unit - greater than R10m or serious loss of credibility or sanction or impairment • Moderate - Financial loss which impacts the profitability of the business unit, greater than R1m or embarrassment or reported to regulator or hospitalization. • Minor -Financial loss with a visible impact on profitability but no real effect, greater than R100k or some embarrassment or rule or process breaches or medical treatment. • Low - Financial loss with no real effect, less than R50k or irritating or no legal or regulatory issue or no medical treatment.
  • 7.
    ProblemManagementFoundation Types of incident to usefor metrics Outage Usage Priority Scrutiny by management Profit 3 Effect on productivity* Staff 10 Impact on company’s image Share price 5 Direct financial impact Assets Actual value Liability or vulnerability Nominal 1 Limited to internal IT process Budget 2
  • 8.
    ProblemManagementFoundation An integrated approach •Problems are latently present in all work environments (not only IT). • Focus on outlining an integrated approach to working with problems. • Number of process and methodologies need to be integrated. • The approach is mature, analytical and provides a measured method to have excellence achieved within an organization. • Major incidents are a special case (caused by underlying problems) – a detailed methodology is presented for dealing with major incidents • Not about acquiring skills to solve problems but accessing impact
  • 9.
    ProblemManagementFoundation VirtualSalt tips forsolving problems • Take time to examine and explore the problem thoroughly before setting out in search of a solution. Often, to understand the problem is to solve it. • Breaking the problem into smaller parts will often make solving it much easier. Solve each part separately. • The resources for problem solving are immense and ubiquitous. • You can always do something. • A problem is not a punishment; it is an opportunity to increase the happiness of the world, an opportunity to show how powerful you really are.
  • 10.
    ProblemManagementFoundation Problem solving tips •The formulation of a problem determines the range of choices: the questions you ask determine the answers you receive. • Be careful not to look for a solution until you understand the problem, and be careful not to select a solution until you have a whole range of choices. • The initial statement of a problem often reflects a preconceived solution. • A wide range of choices (ideas, possible solutions) allows you to choose the best from among many. A choice of one is not a choice. • People work to implement their own ideas and solutions much more energetically than they work to implement others' ideas and solutions.
  • 11.
    ProblemManagementFoundation Problem solving tips •Remember the critical importance of acceptance in solving problems. A solution that is technologically brilliant but sociologically stupid is not a good solution. • When the goal state is clear but the present state is ambiguous, try working backwards. • Procrastinators finish last. • Denying a problem perpetuates it. • Solve the problem that really exists, not just the symptoms of a problem, not the problem you already have a solution for, not the problem you wish existed, and not the problem someone else thinks exists. • A maker follows a plan; a creator produces a plan. • Creativity is the construction of somethings new out of somethings old, through effort and imagination.
  • 12.
  • 13.
    ProblemManagementFoundation Lessons learnt • InFebruary 1945, a force of around 70,000 US Marines invaded Iwo Jima, an significant volcanic island 840 kilometres south of Tokyo. The island was defended by over 22,000 Japanese with the Americans expecting the island to fall within five days. Instead the battle lasted more than seven times longer with 6,800 US. fatalities, 20,000 US wounded, and the death of 20,700 defenders. • The significance of this action was that the Marines resolved to aggregate all the lessons learned from the non-optimal engagement and action in invading Iwo Jima and channelled these lessons into future conflict.
  • 14.
    ProblemManagementFoundation After Action Review(AAR) • First used after the heavy losses suffered by the Marines in the battle of Iwo Jima (watch Iwo Jima video) • The Marines coined the term After Action Review for their lessons learnt process • Similar methodologies to that use for other emergencies, such as fire (watch fire AAR video)
  • 15.
    ProblemManagementFoundation Lessons from Apollo 13 (watchthe Apollo 13 video) • Lessons include • Technical proficiency • Teambuilding • Conflict management • Decision making and problem solving • Creativity and innovation • Effective and efficient communications
  • 16.
    ProblemManagementFoundation What we learnfrom the Major incident timelines The timelines in the major incident lifecycle need to be analysed and determined whether they are accurate and optimal. We need to know and correct: • Was the incident detected quickly and it a suitable time? • Was the incident diagnosed and are the diagnosis tools appropriate. If any checklists were used do they need to be updated or refined? Would the checklist assist in quicker diagnoses if it was reordered? • Was there any delay between diagnosis and the repair being imitated? Are we dealing with any logistical problems? • Were there any suitable workarounds and were they successfully implemented? Can a new type of workaround be developed for any similar types of major incidents? Separately we need to evaluate and be certain that the communications during the major incident was smooth. This includes having the correct stakeholders informed and interactions between the various tiger teams not being constrained.
  • 17.
    ProblemManagementFoundation Lessons learnt • Someproblems have been solved before and it is wasted resource to resolve these from primary sources and effort repetitively. • In a work environment, different people might work on resolving different problems at different times. • If the successful resolution of problems is pooled into a knowledge base then future problems will be dealt with in a dramatically more optimal fashion. • Thus it is important to not only populate a lessons learnt knowledge base when a problem is solved but also to reference it when dealing with a problem to find a potential resolution or even insight into how to deal with the current problem.

Editor's Notes

  • #2 Analysing
  • #3 The causes of a major incident are a problem Other problems are highlighted by the manner in which the major incident is handled Refer the Major Incident Classification Tool in the Appendix Tool is used to ensure the correct classification of a Major incident and that all details are captured
  • #4 Successful Categorization helps in many ways, here are a few of them: Quickly find solutions (workarounds and/or fixes) Properly route incidents to the correct support group Gather sufficient data to speed diagnoses by the next level support Aid in building and maintaining a knowledge base Improves efficiency of technical and functional groups Enhances customer satisfactions Increases user productivity Builds maturity toward more proactive operations
  • #5 Scope Quality of customers impacted Credibility Areas affected including external/internal Operations Effect on business functions and activities and interference with deliverables Urgency Caused by uncontrolled trigger, scheduled/unscheduled change or maintenance Prioritization Critical - An immediate and sustained effort using all available resources until resolved. On-call procedures activated, vendor support invoked. High - Technicians respond immediately, assess the situation, and may interrupt other staff working low or medium priority jobs for assistance. Medium - Respond using standard procedures and operating within normal supervisory management structures. Low - Respond using standard operating procedures as time allows.
  • #6 Critical - App, server, link (network or voice) unavailable for greater than 4 hours or degraded for greater than 1 day – negative business delivery for more than 1 month Major - App, server, link (network or voice) unavailable for greater than 1 hour or degraded for greater than 4 hours - negative business delivery for more than 1 week Moderate - App, server, link (network or voice) unavailable for greater than 30 minutes or degraded for greater than 1 hour - negative business delivery for more than 1 day Minor - App, server, link (network or voice) unavailable greater than 5 minutes or degraded for greater than 30 minutes - negative business delivery for more than 1 hour Low - App, server, link (network or voice) unavailable for less than 5 minutes or degraded for less than 30 minutes - negative business delivery for less than 1 hour
  • #7 Critical - Financial loss, which puts a business unit in a critical position - greater than R100m or substantial loss of credibility or litigation or prosecution or fatality or disability. Major - Financial loss which severely impacts the profitability of a business unit - greater than R10m or serious loss of credibility or sanction or impairment Moderate - Financial loss which impacts the profitability of the business unit, greater than R1m or embarrassment or reported to regulator or hospitalization. Minor -Financial loss with a visible impact on profitability but no real effect, greater than R100k or some embarrassment or rule or process breaches or medical treatment. Low - Financial loss with no real effect, less than R50k or irritating or no legal or regulatory issue or no medical treatment.
  • #8 Types of incidents to use for metrics
  • #9 Problems are latently present in all work environments (not only IT). Focus on outlining an integrated approach to working with problems. Number of process and methodologies need to be integrated. The approach is mature, analytical and provides a measured method to have excellence achieved within an organization. Major incidents are a special case (caused by underlying problems) – a detailed methodology is presented for dealing with major incidents Not about acquiring skills to solve problems but accessing impact
  • #10 Refer to http://www.virtualsalt.com/crebook4.htm
  • #11 The formulation of a problem determines the range of choices: the questions you ask determine the answers you receive. Be careful not to look for a solution until you understand the problem, and be careful not to select a solution until you have a whole range of choices. The initial statement of a problem often reflects a preconceived solution. A wide range of choices (ideas, possible solutions) allows you to choose the best from among many. A choice of one is not a choice. People work to implement their own ideas and solutions much more energetically than they work to implement others' ideas and solutions.
  • #12 Remember the critical importance of acceptance in solving problems. A solution that is technologically brilliant but sociologically stupid is not a good solution. When the goal state is clear but the present state is ambiguous, try working backwards. Procrastinators finish last. Denying a problem perpetuates it. Solve the problem that really exists, not just the symptoms of a problem, not the problem you already have a solution for, not the problem you wish existed, and not the problem someone else thinks exists. A maker follows a plan; a creator produces a plan. Creativity is the construction of somethings new out of somethings old, through effort and imagination.
  • #14 The Battle of Iwo Jima was fought between February and March 1945, during World War II. The battle was marked by some of the fiercest fighting of the war and resulted in what is known as an “After Action Review”. This review provided direction for the Marines to change their strategy to which they approached subsequent battles. In essence this is what is known as the lessons learnt phase of crisis management. The experience gained in previous actions, tasks and endeavours must not be lost and needs to be applied. It would be a total lack of due diligence to do otherwise. The primary goal that the Marines had was to limit the loss of life and resources. The bruising encounter and disproportionate loss of life has often been put forward was the reason that the United States opted to use destructive nuclear weapons during the closing stages of the war. Information Technology (IT) can benefit from an “After Action Review” and an appropriate application is the analysis phase of a major incident, an incident with severe negative business consequences.
  • #15 After Action Review (AAR)
  • #16 The most famous example of problem solving, which was also a great movie is the flight of Apollo 13. As a result we became familiar with the term "Failure is not an option," and "Houston, we have a problem" is a common colloquialism. The technicians and astronauts in Ron Howard's epic provide rational leadership during a crisis. In essence the movie is also the perfect example of tiger teams (read about tiger teams here). Gene Kranz (Ed Harris), in charge of flight operations in Houston, and Jim Lovell (Tom Hanks), commander of the lunar mission are required to use their leadership skills when an explosion occurs on the Apollo 13 craft. Through teamwork, ingenuity, and rational process these leaders solve some near impossible problems and handle a major incident in a manner which was known as a “successful failure!” The techniques used in the “successful failure” is attributed to Kepner-Tregoe (KT) (read about it here). In the context of the flight, Krantz and Lovell maintain control in a chaotic situation which inspires confidence among crew of the space craft and Mission Control. Apollo 13 shows that although leaders desire loyalty and passion, it is important to secure their group's confidence first. Below are some of the lessons provided by the flight of Apollo 13: Technical proficiency. When the explosion occurs, the moment of disaster Krantz is dependent on the skills and expertise of his available technical resources. Lovell needs to initiate actions in the disabled spacecraft and balance this with his other responsibilities to ensure that the immediate requirements to stay alive but also to ensure that his crew is rescued. Teambuilding. After three hours of practicing the docking procedure, Mattingly wants to continue and Lovell agrees. Krantz calls his team into a war room and motivates them to achieve the goal of returning the Apollo 13 spacecraft safely to earth. Conflict management. There is less conflict when teams are busy and focused than when they have time on their hands. As an example, Krantz and Lovell resolve the conflict with the medical team by allowing the sensors to be disconnected. Decision making and problem solving. Right after the explosion Krantz asks Mission Control "What do we have on the Space Craft that’s good?" The focus was now on recovery and concentrating on what caused the problem was not beneficial at that stage. The priority was to rescue the Apollo 13 crew, not determine immediately why the spacecraft failed. Creativity and innovation. Lovell stated, "Thousands of people worked to bring the three of us back home." These people displayed innovation in the solutions provided to rescue the astronauts. Effective and efficient communications. While there appears to be complete havoc in Mission Control, Krantz asks his team to "Work the Problem." He then listened to the experts report in on their areas of the mission. He uses effective and efficient communications to make certain that the disabled spacecraft is able to return to Earth. The above is a generic list of what aspects of process need to be addressed during any major incident and not only the “successful failure” of Apollo 13. The exact actions required would need to be documented in an “After Action Review”, as the Marines conducted way back during the latter stages of World War II. Preferable some of these issues would be addressed before a debilitating major incident occurs as a swift resolution corresponds to an optimal return to service. Often, a simulation needs to be conducted as part of training and building a suitable team to be able to address and handle any future major incidents. These simulation were used extensively before and during the Apollo programme to assist in resolution of major incidents. The lessons learnt phase would be conducted by the alpha team, as part of the tiger team hierarchy. The lessons learnt phase need to follow some strict guidelines. Firstly the timelines in the major incident lifecycle need to be analysed and determined whether they are accurate and optimal. We need to know and correct: Was the incident detected quickly and it a suitable time? Was the incident diagnosed and are the diagnosis tools appropriate. If any checklists were used do they need to be updated or refined? Would the checklist assist in quicker diagnoses if it was reordered? Was there any delay between diagnosis and the repair being imitated? Are we dealing with any logistical problems? Were there any suitable workarounds and were they successfully implemented? Can a new type of workaround be developed for any similar types of major incidents? Separately we need to evaluate and be certain that the communications during the major incident was smooth. This includes having the correct stakeholders informed and interactions between the various tiger teams not being constrained. Finally, an appropriate risk assessment needs to be conducted. Threats that have been rated need to have suitable countermeasures and acceptance of these mitigations or lack thereof by stakeholders. The lessons learnt phase of the major incident process is an intrinsic part of continuous improvement in IT. The techniques used in continuous improvement are formulated in the Deming wheel or cycle (read about it here). The concepts applied and application of lessons learnt fit into the philosophy of making continuous incremental improvement to the services of IT. The lessons learnt are continuously applied to ensure that the services provided develop to be more mature
  • #17 The timelines in the major incident lifecycle need to be analysed and determined whether they are accurate and optimal. We need to know and correct: Was the incident detected quickly and it a suitable time? Was the incident diagnosed and are the diagnosis tools appropriate. If any checklists were used do they need to be updated or refined? Would the checklist assist in quicker diagnoses if it was reordered? Was there any delay between diagnosis and the repair being imitated? Are we dealing with any logistical problems? Were there any suitable workarounds and were they successfully implemented? Can a new type of workaround be developed for any similar types of major incidents? Separately we need to evaluate and be certain that the communications during the major incident was smooth. This includes having the correct stakeholders informed and interactions between the various tiger teams not being constrained.
  • #18 Lessons learnt Some problems have been solved before and it is wasted resource to resolve these from primary sources and effort repetitively. In a work environment, different people might work on resolving different problems at different times. If the successful resolution of problems is pooled into a knowledge base then future problems will be dealt with in a dramatically more optimal fashion. Thus it is important to not only populate a lessons learnt knowledge base when a problem is solved but also to reference it when dealing with a problem to find a potential resolution or even insight into how to deal with the current problem.