1. 1
IT Crisis and Problem Management
(Technical Escalation and Action Planning)
With the advent of more and more published standards related to the management and delivery of IT
services (ITIL, cobIT, et. al.), we still often discover that formal IT process control disciplines are not as
mature or robust an IT management science as we would desire. We are learning more and more about
“what needs to be done” but have yet to fully develop well-defined, prescriptive measures to make it
happen effectively.
With ever-increasing economic pressures on business and the subsequent pressures on IT to provide
higher levels of services at reduced costs, some IT organizations are increasing the level of technical risk
within their operations, both through conscious and unconscious decisions that are made with IT
management.
The following are examples of decisions which increase risk for IT and the business:
• Implementation of changes (hardware, software, firmware) directly into a production environment
without prior testing on a development/test configuration
• Multiple single points of failure in hardware configurations
• Reduced or limited service coverage from supporting vendors without committed response and
repair times
• Limited windows of time for implementation of change which then may require multiple changes
to be made in a short timeframe (If a problem occurs, which change caused the failure?)
• Complex multi-vendor solutions which bring together multiple software and hardware products
that have not been certified to operate together effectively causing the potential for multiple
interoperability issues, etc.
• The outsourcing of specific functions within IT, such as development and maintenance can
present additional challenges from both a “time to resolution” perspective and communications.
Outsourcing often brings with it a set of performance standards which differ from that of the IT
organization.
Significant technological advances in underlying hardware platforms are out-pacing the ability of software
providers to keep up with the innovation and develop management tools to control these environments.
2. 2
This dynamic has led to increased system complexity and a technical gap in knowledge of the IT service
providers to effectively support those environments.
Those involved in IT crisis (problem) management as a primary role tend to see more clearly the places
where the lack of sound process control exists. Problems occur and re-occur within a technology
environment at an alarming rate, often having significant impact on the ability to produce or improve the
overall efficacy of the environment. A lack of sound problem management strategies can exacerbate
current operational failures of information technology systems, increase their frequency and/or be the
direct cause of future failures. Within IT, it has become more of a matter of “when” your next significant,
critical problem will occur, not “if” one will occur.
An effective discipline around IT crisis management can give business managers control of their IT
service priorities by providing the procedures for quick restoration of impacted services, safeguarding
existing services, and creating a basis for learning from failures to reduce the advent of future failures
before they can occur. Well defined problem and crisis management procedures will increase overall
service availability and IT efficiency by reducing the number of unnecessary escalations (crisis) within the
environment.
Historically, IT professionals have focused on technology as the primary influencer on the overall reliability
within IT. With that being said, the implementation of effective technical crisis processes will certainly
NOT provide a “silver bullet” for remedial problems and technical crisis. While there must be an
underlying set of standards upon which IT services are provisioned, as with the processes defined
through frameworks such as ITIL, effective problem and crisis processes can provide a basis for us to
understand the advent of a problem and help feed the problem management phase of any remedial
action.
According to ITIL
It should be noted that use herein of the term “problem” and “crisis” management does not strictly adhere
to the definitions as laid out in ITIL. The terms are interchanged between problem management and crisis
management intentionally.
In strict ITIL terms, what is being referenced could be more clearly defined as an incident that has
significant impact and visibility within the business. Crisis and problem management are being referenced
in terms more readily used throughout the IT industry.
In summary without a full dialog relative to ITIL service lifecycle management, the ITIL theory of problem
resolution begins with the “Incident,” an unplanned interruption or performance reduction to an IT Service.
The service request is made, and when a cause of one or more incidents is identified, a “Problem” record
is created and made available to Service Desk personnel. Once a root cause or effective work around
had been identified, it is then acknowledged as a “known error.”
In version three of ITIL, there are additional provisions made for “Continual Service Improvement” that
recommend ongoing trend analysis for all services and IT processes and identify areas for improvement.
Real World Experience
The practicality of experience demonstrates that the finding of root cause can take significant time to
determine. Efforts to restore operation can directly conflict with data gathering and problem analysis.
An effective work around becomes victim to root cause analysis. SLA targets for incident management
are used to drive quicker problem resolution with limited resources available to do both incident
management and effective problem resolution.
Technical IT resources often focus on any specific problem from a myopic perspective, either focusing on
a small sub-set of the environment that is too small to effectively remediate the broader issues or not fully
understanding the true impact to the business. They can focus on priorities that do not match those of the
business based upon impact or risk.
3. 3
There can be situations when IT resources continue to solve issues repeatedly without properly
documenting the resolution or determining the effective root cause of the issue. Instead of removing the
underlying problem and reducing the number of overall incidents, one is driven to quicker problem
resolution and moving on to the next critical problem.
Newer, increasingly more complex technologies are being seen in increasingly smaller footprints. IT
environments are becoming more complex while skilled resources to manage those environments
become less and less available due to the breadth of knowledge required to manage those environments.
The speed with which technology changes also presents challenges in maintaining sufficient qualified
resources with the requisite knowledge and skill set.
Service requirements are becoming more demanding with increased interoperability issues between
different vendor products as vendors are under increasing pressure to ship products that may not yet be
ready for “prime time” operation. In a “perfect world,” effective use of people, process and technology
within IT would limit risk to the business.
The process and strategies required to avoid theses problems are generally understood. There are
substantial sources of information currently available that speak to these strategies and techniques. What
is lacking is a clear path to take when significant problems do occur within an IT environment.
What is laid out herein is a framework that defines clear and effective communication techniques in
dealing with the business, a strategy for sound troubleshooting techniques and action planning, a direction
to share knowledge and the effective use of the resources available in support of efforts toward problem
remediation.
Specifics – a How To …
This process must begin with IT involving the business in the incident. Once IT is made aware of the
issue, in most cases through the logging of an incident through a help desk, one must quickly determine
the business’s level of urgency and the impact the problem is causing.
First, listen to what is being said by the customer. This may sound like an intuitive first step, but the reality
is that technical people will often head off to “solve the problem” before there is a clear and full
understanding of just what is the actual problem. Having a clear understanding of the initial problem and
what will constitute a successful remediation is often half of the battle in trying to solve technical incidents.
One must obtain a complete picture of the issues trying to be solved. Always take the time to get a clear
and complete problem description. If the problem changes during the crisis, we must make sure we take
time to re-evaluate our initial problem statement.
Make sure we understand any timing issues. When does the problem happen? How often and under
what conditions? Have there been any recent changes to accounts or applications impacted by the
failure? You need to test any assumptions you make with reflective dialog. “To make sure I am clear, you
are telling me …”
Take the time to learn what has been tried before for any repeat problems, either by the business, other IT
professionals or another vendor services which may have been involved earlier in the incident before the
crisis became evident. It may be beneficial to examine all changes made in the environment preceding
the crisis even though the changes may seem unrelated to the failure.
Effective communication with the users throughout the entire incident lifecycle is essential. One example
of why this may be needed can be seen in changes that the business may make to system loading or
applications being run can significantly impact the problem(s) being seen and IT efforts to remediate the
situation. While efforts are made to understand the problem, other resources may be working at cross
purposes in an effort to limit the impact to overall production. Without understanding what changes are
being made across the environment, it can make remediation efforts that much more problematic.
Human factors in any problem need to be considered. Has human error caused or exacerbated the
advent of any problem? Have IT professionals been working this problem with inadequate food or rest
4. 4
periods? Has the overall workload on staff contributed to the crisis? Has the technology advanced
beyond the expertise of those working on the equipment? If outsourcers or contractors are in use within
the IT environment, are there any cultural barriers that are contributing to ineffective communications?
As for the equipment, has it ever performed to expectations; and if so, for how long? Is this a new issue
that has never been seen, a repeat problem or possibly a situation where the equipment had never been
set up properly originally? Are there interoperability issues at play here and have all the component
pieces in the configuration been tested and qualified by the original vendors? Has system or application
performance (capacity) ever had a baseline established? Has additional loading been placed on the
system since establishment of the baseline?
Along with a problem description, make sure you have had some discussions around what constitutes a
solution. For example, if the business has experienced intermittent service interruptions every three days,
how many days of error-free running are required in order to consider the issue resolved. As a basic rule
of thumb, an error-free run rate of three times the original failure rate should be expected in order to have
some level of confidence that the original problem has been resolved. Without properly defining what the
solution looks like, there may be disagreement between IT and the business on when the situation can
officially be considered resolved.
After a clear problem description has been established, a solution defined and all of the obvious issues
have been addressed, make sure to move beyond the obvious. Check for any recent changes to
hardware or software, or unusual situations that may be in close proximity to the advent of the problem.
Over 80% of all IT failures occur around some change in the environment. One of the most effective
questions that can be asked is, “What has changed recently?” One can focus so closely on being able to
define a root cause to any incident that effort to define effective workarounds as quickly as possible goes
unaddressed.
Multiple perspectives in your remediation efforts ensure that everyone has a voice at the table, or more
likely conference call. Get input from the business, database administrators, system administrators,
product (hardware and software) vendors, end users, etc. You may want to do this in different forums and
then consolidate notes between the discussions.
Note that it is advisable to separate technical conversations from the necessary business management
updates that may be required. Keeping key stakeholders informed while not impeding an open flow of
technical dialog is critical. Management admonishments or assertions during a crisis event can inhibit
technical brainstorming and an open technical dialog.
Make sure to look for patterns, as might be found in a review of repeat service incidents. Seek to discover
inter-relationships that may be involved, which should help to set the direction toward determining root
cause(s).
It is also critical that all interested parties are kept aware of progress during the incident; otherwise
technical resources that should be focused on remediation of the effort will be pulled away into
conversations with non-technical stakeholders that are struggling to keep aware of progress, especially as
higher levels of business management become aware. We must have a well-defined, predetermined
process for communication that will take place during any remediation effort or crisis situation. Assign a
liaison to provide technical updates to the management team, as well as bring back any management
concerns to the technical team.
In a situation where multiple incidents are open, keep focus on the highest priority incidents, even as
additional issues or incidents are raised. These may not be the same as the high priority incidents. We
need to clearly understand that the business’ priorities may change if frequency or the impact of the
problem changes; or if other issues take priority. We must always be willing to evaluate functional
priorities on a continual basis throughout the crisis situation.
There is a common tendency on the part of technical resources to work on incidents that are the most
technically complex, not necessarily focusing on the issues in priority order – defined as those incidents
have the greatest impact and urgency to the user community and the business.
5. 5
Quickly identify effective workarounds for any problems with significant impact to the business. Once (If)
a work around is in place or a solution is found, root cause analysis work can begin. Before this work
proceeds, the question of whether finding a root cause is necessary should be raised. Determining root
cause analysis can take considerable time and resources. A search for root causes can be counter-
productive and hamper work which should take priority. Don’t be afraid to ask questions related to the
need for root cause identification. Do we really need to define the root cause for this incident considering
the costs required to do so? What is the likelihood that this problem will reoccur? Is the potential impact
to the business worth the knowledge that might be gained in the analysis? Is there potential that our
reputation with the business can be damaged if we fail to find a root cause?
These questions reflect a need for effective risk management within the IT environment. A decision may
be made to continue with root cause analysis, but it should be made as a conscious decision and not an
automatic response to crisis.
Technical Action Planning – in summation
In its simplest definition, the process of technical action planning for any incident is merely a matter of
thinking through how best to attain desired goals given the available resources and time.
Effective incident remediation efforts should include the following:
• Identify triggers and workarounds as quickly as is possible.
• Reduce the impact of incidents to the business.
• Continue to work to reduce the time to resolve incidents, or at least the impact to the end-users.
• Only then should you start thinking about root cause and reducing the number of incidents
Practical Do’s and Don’ts
From years of experience, some best practice guidelines can be defined that have been developed
through years of escalation experience. These should be taken as intended, to be general guidelines for
best practice in any remediation effort. The specifics of any incident may dictate a different course of
action:
• Although it may be tempting in a suspected hardware failure to pull parts from a functioning
machine to use in a failed machine in an effort to isolate the specific failure, it often can become
counterproductive and cause more problems. Don’t pull parts from a working machine to
troubleshoot a failed one.
• When presented with multiple systems where one or a smaller number of systems are not failing,
it can be an effective troubleshooting technique to compare the configuration of a failed machine
to a non-failing machine to look for differences in firmware levels, applications being run,
differences in amount of memory available, etc
• Balance the desire to “shot-gun” a problem (replacing multiple parts at the same time) as a
troubleshooting strategy against the limited time that may be available. If symptoms change
following multiple parts replacement, one would have no idea if there were an impact to the
original problem or there were a DOA part put into the mix. Note, also, that there may come a
point in the remediation process where the cost of the effort to troubleshoot a problem is higher
then overall replacement cost of that option.
• When changing out hardware, use electro-static discharge precautions when directly handling
equipment, such as through the use of static mats and wrist straps. Static electricity can have a
detrimental affect on the smaller and smaller technical footprints with which we work, yet use of
static electricity precautionary measures seems to be utilized less frequently.
• Label any original parts so that they do not get confused with replacement. Original parts can
quickly become interchanged with replacement parts and the effort of remediation will have
6. 6
become extremely complex at that point. When a part is replaced and it does not fix the original
problem, restore the original part in the option/system.
• Communicate effectively with all stakeholders during a crisis, but allowing non-technical
resources to drive the technical action planning process will be a clear recipe for disaster. Keep
them aware of progress, meet your commitments to the business, and let feedback they provide
influence your actions. Don’t allow the end-user or business management to drive the technical
action planning process.
• Maintain trust during the crisis period. Keeping people informed of your actions and fulfilling any
commitments will assist a general understanding of any progress being made throughout the
crisis
• Take time to define the “real” problem. Over 50% of the effort required to solving any problem is
properly defining it in the first place
• Upgrading product firmware during an outage without first understanding how this will impact the
remediation effort can present additional problems. Making random changes within an IT
environment during a failure event can oftentimes exacerbate the original condition.
• Use a thoughtful approach to troubleshooting. During the stress of any significant incident, it can
become very easy to get distracted by unrelated or non-critical tasks. The best way to avoid that
possibility is to have a defined and thoughtful approach that is understood by all involved.
Document a step-by-step process flow, if at all possible. This will aid in any post mortem effort
that may follow.
• Document existing configurations and operational settings before beginning any invasive
troubleshooting or reconfiguration. During an extended crisis, it can be difficult to return a system
to an original configuration if not properly documented beforehand.
• Document the sequence of events throughout your efforts. Difficult issues can span several days
and multiple resources. If the steps taken at the advent of the incident are not well documented,
you may find that the same tasks are performed over and over again with the same results, not
leading to technical resolution, or that the most obvious actions are assumed to have already
taken place when they have not.
• Understand what has changed recently - in the system, loading or within facility. A significant
number of critical incidents occur as a result of changes in an IT environment. Make sure you
ask at various levels what changes have been made recently that coincide with the advent of the
specific incident. Have they updated the Operating System, updated firmware or added a new
Application?
• Consider a troubleshooting approach that involves minimizing the original configuration and
building back, especially in highly complex environments. What would be the minimum needed
to any configuration to run, start at that level and eliminate all else. If that minimized configuration
continues to fail, you have less to troubleshoot. If it does not fail, then slowly build back to the
original configuration.
• Ask for help from others when needed throughout the event. Asking for help does not equate to
failure. A desire to know everything to be the sole source provider of information is sometimes a
dynamic that is seen with technical resources. Asking for assistance must be viewed as a sign of
strength, not weakness. With the complexity of our technology nowadays, no one person can
have all the required information needed to solve a problem. It also affords others the chance to
learn experientially throughout the remediation effort.
• Check for the obvious and then move beyond it. One can too easily get hung up on reviewing
what is believed to be the “obvious” solution to any problem, but if the issue were that obvious,
7. 7
would it have evolved to the urgency now being experienced? The reverse can also be true, that
the obvious solution should not be over looked for the sake of the more intricate solution.
• Verify full operation once you think you have fixed any incident. It can hurt the relationship
between IT and the business if success is declared prematurely. Remember to evaluate any
solution against the original description of success as defined early on in any action planning
process.
• Take the learnings from any crisis to help avoid the same crisis at a later time and date.
Even with the increased number of published standards related to the management and delivery of IT
services, the overall maturity of Information Technology and our ability to manage IT environments
continues to evolve.
Formalized IT process control disciplines are not as mature or robust as is required throughout the
industry. The IT industry is learning more and more about “what needs to be done” but have yet to fully
develop well-defined prescriptive measures to make it happen effectively. Hopefully this information has
added to the overall body of knowledge in support of IT crisis management and problem solving.
9. 9
Technical Action Planning Template
Business Impact Summary
• Provide a brief summary of the key stakeholders, their business and the technical
impact of problem.
Problem Statement
Detailed problem description – use as much detail as required to clearly communicate the
nature of the problem from a technical perspective, including any specific business
considerations that the technical team needs to understand through the effort.
Closure Criteria
• Define closure criteria: “What constitutes success for this effort?”
Problem Management Team (Who is involved in this effort?)
– Problem Manager – key customer communication interface; focal point for all
communication – technical and business related
– Technical Escalation Resource – understands the technical issues and impact of
the problem and helps to translate technical information between the business and
the technical team
– Technical Escalation Manager – coordinates additional technical resources as
required by the plan, including resources from vendors as may be required
– Business contact focal – primary communication link between key business
stakeholders and the technical remediation efforts, including all key players at the
customer site or service partners
Name Title Email Telephone Cell
Problem Manager
Technical Escalation
Resource
Technical Escalation
Manager
Business
Communication
Contact Focal
Logistics (parts), Resource or Tool Requirements:
• This section would be used to define any resources currently not available and
begin the process to source anything that may be required.
Resources
needed
Description Source Comments / Status
1
2
3
10. 10
Action Plan
Action Summary
Action
Number
Description
Owner(s) &
Date Due
Comments / Status
1
2
3
4
SPECIFIC ACTIONS (Detailed action and owner)
• Action one
– Schedule of events – when due for completion
– Measureable milestones
– Expected results
– Event triggers
– Identified resources or needs for resources
– Management escalation
– Communication schedule (who, what, where, when)
• Action Two
– Schedule of events – when due for completion
– Measureable milestones
– Expected results
– Event triggers
– Identified resources or needs for resources
– Management escalation
– Communication schedule (who, what, where, when)
• Action Three… repeated as required.
– Schedule of events – when due for completion
– Measureable milestones
– Expected results
– Event triggers
– Identified resources or needs for resources
– Management escalation
– Communication schedule (who, what, where, when)
Contingency Planning:
“What if” planning – location for contingency planning if initial action plan outlined does not
accomplish effective remediation. For example, if action plan does not result in desired
hardware fix, product replacement would take place. This would also be the point to
discuss the transfer of operations to any predefined disaster recovery site, if available.
Technical Contact List:
Monitoring criteria and timing – Once it is believed that a solution has been provided, how
long is it agreed that the situation will be monitored prior to incident closure and declaration
of success. (Note: This should mirror what was originally defined at the advent of the
incident in Closure Criteria.)
Technical Planning Contact List:
Name Title Email Telephone Cell
11. 11
Communications Template
Problem Statement
• Clearly and specifically state the problem for any professional audience to read and
understand.
Business Impact
• Take the time to define details of the operational impact and financial impact of any
problems, if possible.
Closure Criteria
• Define closure criteria: “What constitutes success for this effort?”
New This Update:
• This section would be used on a rotating basis to understand NEW information for
the effort or crisis. It provides a section for those that are familiar with the basics of
the issue to
Action Plan Summary:
Action
Number
Description
Owner(s) &
Date Due
Comments / Status
1
2
3
4
Stakeholder Contact List:
Name Title Email Telephone Cell
Technical Contact List:
Name Title Email Telephone Cell
Other Contacts:
Name Title Email Telephone Cell
12. For more information
http://h20219.www2.hp.com/services/cache/457080-0-0-225-121.html
Author:
Chuck Boutcher, PMP, CISSP, ITIL Expert
Americas Escalation Team
HP Services
charles.boutcher@hp.com
Contributors:
• Mr. Gary Blew
Americas Escalation Team
HP Services
• Mr. Mark Hastings
Americas Escalation Team
HP Services
• Mr. Dan Phalen, PMP
• Professor Jacques Sauvé
Systems and Computing Department
Federal University of Campina Grande
Campina Grande, PB, Brazil
• Mr. Tony Vohsemer
Canadian Escalation Manager
HP Services, Canada