SlideShare a Scribd company logo
1 of 12
Download to read offline
1
IT Crisis and Problem Management
(Technical Escalation and Action Planning)
With the advent of more and more published standards related to the management and delivery of IT
services (ITIL, cobIT, et. al.), we still often discover that formal IT process control disciplines are not as
mature or robust an IT management science as we would desire. We are learning more and more about
“what needs to be done” but have yet to fully develop well-defined, prescriptive measures to make it
happen effectively.
With ever-increasing economic pressures on business and the subsequent pressures on IT to provide
higher levels of services at reduced costs, some IT organizations are increasing the level of technical risk
within their operations, both through conscious and unconscious decisions that are made with IT
management.
The following are examples of decisions which increase risk for IT and the business:
• Implementation of changes (hardware, software, firmware) directly into a production environment
without prior testing on a development/test configuration
• Multiple single points of failure in hardware configurations
• Reduced or limited service coverage from supporting vendors without committed response and
repair times
• Limited windows of time for implementation of change which then may require multiple changes
to be made in a short timeframe (If a problem occurs, which change caused the failure?)
• Complex multi-vendor solutions which bring together multiple software and hardware products
that have not been certified to operate together effectively causing the potential for multiple
interoperability issues, etc.
• The outsourcing of specific functions within IT, such as development and maintenance can
present additional challenges from both a “time to resolution” perspective and communications.
Outsourcing often brings with it a set of performance standards which differ from that of the IT
organization.
Significant technological advances in underlying hardware platforms are out-pacing the ability of software
providers to keep up with the innovation and develop management tools to control these environments.
2
This dynamic has led to increased system complexity and a technical gap in knowledge of the IT service
providers to effectively support those environments.
Those involved in IT crisis (problem) management as a primary role tend to see more clearly the places
where the lack of sound process control exists. Problems occur and re-occur within a technology
environment at an alarming rate, often having significant impact on the ability to produce or improve the
overall efficacy of the environment. A lack of sound problem management strategies can exacerbate
current operational failures of information technology systems, increase their frequency and/or be the
direct cause of future failures. Within IT, it has become more of a matter of “when” your next significant,
critical problem will occur, not “if” one will occur.
An effective discipline around IT crisis management can give business managers control of their IT
service priorities by providing the procedures for quick restoration of impacted services, safeguarding
existing services, and creating a basis for learning from failures to reduce the advent of future failures
before they can occur. Well defined problem and crisis management procedures will increase overall
service availability and IT efficiency by reducing the number of unnecessary escalations (crisis) within the
environment.
Historically, IT professionals have focused on technology as the primary influencer on the overall reliability
within IT. With that being said, the implementation of effective technical crisis processes will certainly
NOT provide a “silver bullet” for remedial problems and technical crisis. While there must be an
underlying set of standards upon which IT services are provisioned, as with the processes defined
through frameworks such as ITIL, effective problem and crisis processes can provide a basis for us to
understand the advent of a problem and help feed the problem management phase of any remedial
action.
According to ITIL
It should be noted that use herein of the term “problem” and “crisis” management does not strictly adhere
to the definitions as laid out in ITIL. The terms are interchanged between problem management and crisis
management intentionally.
In strict ITIL terms, what is being referenced could be more clearly defined as an incident that has
significant impact and visibility within the business. Crisis and problem management are being referenced
in terms more readily used throughout the IT industry.
In summary without a full dialog relative to ITIL service lifecycle management, the ITIL theory of problem
resolution begins with the “Incident,” an unplanned interruption or performance reduction to an IT Service.
The service request is made, and when a cause of one or more incidents is identified, a “Problem” record
is created and made available to Service Desk personnel. Once a root cause or effective work around
had been identified, it is then acknowledged as a “known error.”
In version three of ITIL, there are additional provisions made for “Continual Service Improvement” that
recommend ongoing trend analysis for all services and IT processes and identify areas for improvement.
Real World Experience
The practicality of experience demonstrates that the finding of root cause can take significant time to
determine. Efforts to restore operation can directly conflict with data gathering and problem analysis.
An effective work around becomes victim to root cause analysis. SLA targets for incident management
are used to drive quicker problem resolution with limited resources available to do both incident
management and effective problem resolution.
Technical IT resources often focus on any specific problem from a myopic perspective, either focusing on
a small sub-set of the environment that is too small to effectively remediate the broader issues or not fully
understanding the true impact to the business. They can focus on priorities that do not match those of the
business based upon impact or risk.
3
There can be situations when IT resources continue to solve issues repeatedly without properly
documenting the resolution or determining the effective root cause of the issue. Instead of removing the
underlying problem and reducing the number of overall incidents, one is driven to quicker problem
resolution and moving on to the next critical problem.
Newer, increasingly more complex technologies are being seen in increasingly smaller footprints. IT
environments are becoming more complex while skilled resources to manage those environments
become less and less available due to the breadth of knowledge required to manage those environments.
The speed with which technology changes also presents challenges in maintaining sufficient qualified
resources with the requisite knowledge and skill set.
Service requirements are becoming more demanding with increased interoperability issues between
different vendor products as vendors are under increasing pressure to ship products that may not yet be
ready for “prime time” operation. In a “perfect world,” effective use of people, process and technology
within IT would limit risk to the business.
The process and strategies required to avoid theses problems are generally understood. There are
substantial sources of information currently available that speak to these strategies and techniques. What
is lacking is a clear path to take when significant problems do occur within an IT environment.
What is laid out herein is a framework that defines clear and effective communication techniques in
dealing with the business, a strategy for sound troubleshooting techniques and action planning, a direction
to share knowledge and the effective use of the resources available in support of efforts toward problem
remediation.
Specifics – a How To …
This process must begin with IT involving the business in the incident. Once IT is made aware of the
issue, in most cases through the logging of an incident through a help desk, one must quickly determine
the business’s level of urgency and the impact the problem is causing.
First, listen to what is being said by the customer. This may sound like an intuitive first step, but the reality
is that technical people will often head off to “solve the problem” before there is a clear and full
understanding of just what is the actual problem. Having a clear understanding of the initial problem and
what will constitute a successful remediation is often half of the battle in trying to solve technical incidents.
One must obtain a complete picture of the issues trying to be solved. Always take the time to get a clear
and complete problem description. If the problem changes during the crisis, we must make sure we take
time to re-evaluate our initial problem statement.
Make sure we understand any timing issues. When does the problem happen? How often and under
what conditions? Have there been any recent changes to accounts or applications impacted by the
failure? You need to test any assumptions you make with reflective dialog. “To make sure I am clear, you
are telling me …”
Take the time to learn what has been tried before for any repeat problems, either by the business, other IT
professionals or another vendor services which may have been involved earlier in the incident before the
crisis became evident. It may be beneficial to examine all changes made in the environment preceding
the crisis even though the changes may seem unrelated to the failure.
Effective communication with the users throughout the entire incident lifecycle is essential. One example
of why this may be needed can be seen in changes that the business may make to system loading or
applications being run can significantly impact the problem(s) being seen and IT efforts to remediate the
situation. While efforts are made to understand the problem, other resources may be working at cross
purposes in an effort to limit the impact to overall production. Without understanding what changes are
being made across the environment, it can make remediation efforts that much more problematic.
Human factors in any problem need to be considered. Has human error caused or exacerbated the
advent of any problem? Have IT professionals been working this problem with inadequate food or rest
4
periods? Has the overall workload on staff contributed to the crisis? Has the technology advanced
beyond the expertise of those working on the equipment? If outsourcers or contractors are in use within
the IT environment, are there any cultural barriers that are contributing to ineffective communications?
As for the equipment, has it ever performed to expectations; and if so, for how long? Is this a new issue
that has never been seen, a repeat problem or possibly a situation where the equipment had never been
set up properly originally? Are there interoperability issues at play here and have all the component
pieces in the configuration been tested and qualified by the original vendors? Has system or application
performance (capacity) ever had a baseline established? Has additional loading been placed on the
system since establishment of the baseline?
Along with a problem description, make sure you have had some discussions around what constitutes a
solution. For example, if the business has experienced intermittent service interruptions every three days,
how many days of error-free running are required in order to consider the issue resolved. As a basic rule
of thumb, an error-free run rate of three times the original failure rate should be expected in order to have
some level of confidence that the original problem has been resolved. Without properly defining what the
solution looks like, there may be disagreement between IT and the business on when the situation can
officially be considered resolved.
After a clear problem description has been established, a solution defined and all of the obvious issues
have been addressed, make sure to move beyond the obvious. Check for any recent changes to
hardware or software, or unusual situations that may be in close proximity to the advent of the problem.
Over 80% of all IT failures occur around some change in the environment. One of the most effective
questions that can be asked is, “What has changed recently?” One can focus so closely on being able to
define a root cause to any incident that effort to define effective workarounds as quickly as possible goes
unaddressed.
Multiple perspectives in your remediation efforts ensure that everyone has a voice at the table, or more
likely conference call. Get input from the business, database administrators, system administrators,
product (hardware and software) vendors, end users, etc. You may want to do this in different forums and
then consolidate notes between the discussions.
Note that it is advisable to separate technical conversations from the necessary business management
updates that may be required. Keeping key stakeholders informed while not impeding an open flow of
technical dialog is critical. Management admonishments or assertions during a crisis event can inhibit
technical brainstorming and an open technical dialog.
Make sure to look for patterns, as might be found in a review of repeat service incidents. Seek to discover
inter-relationships that may be involved, which should help to set the direction toward determining root
cause(s).
It is also critical that all interested parties are kept aware of progress during the incident; otherwise
technical resources that should be focused on remediation of the effort will be pulled away into
conversations with non-technical stakeholders that are struggling to keep aware of progress, especially as
higher levels of business management become aware. We must have a well-defined, predetermined
process for communication that will take place during any remediation effort or crisis situation. Assign a
liaison to provide technical updates to the management team, as well as bring back any management
concerns to the technical team.
In a situation where multiple incidents are open, keep focus on the highest priority incidents, even as
additional issues or incidents are raised. These may not be the same as the high priority incidents. We
need to clearly understand that the business’ priorities may change if frequency or the impact of the
problem changes; or if other issues take priority. We must always be willing to evaluate functional
priorities on a continual basis throughout the crisis situation.
There is a common tendency on the part of technical resources to work on incidents that are the most
technically complex, not necessarily focusing on the issues in priority order – defined as those incidents
have the greatest impact and urgency to the user community and the business.
5
Quickly identify effective workarounds for any problems with significant impact to the business. Once (If)
a work around is in place or a solution is found, root cause analysis work can begin. Before this work
proceeds, the question of whether finding a root cause is necessary should be raised. Determining root
cause analysis can take considerable time and resources. A search for root causes can be counter-
productive and hamper work which should take priority. Don’t be afraid to ask questions related to the
need for root cause identification. Do we really need to define the root cause for this incident considering
the costs required to do so? What is the likelihood that this problem will reoccur? Is the potential impact
to the business worth the knowledge that might be gained in the analysis? Is there potential that our
reputation with the business can be damaged if we fail to find a root cause?
These questions reflect a need for effective risk management within the IT environment. A decision may
be made to continue with root cause analysis, but it should be made as a conscious decision and not an
automatic response to crisis.
Technical Action Planning – in summation
In its simplest definition, the process of technical action planning for any incident is merely a matter of
thinking through how best to attain desired goals given the available resources and time.
Effective incident remediation efforts should include the following:
• Identify triggers and workarounds as quickly as is possible.
• Reduce the impact of incidents to the business.
• Continue to work to reduce the time to resolve incidents, or at least the impact to the end-users.
• Only then should you start thinking about root cause and reducing the number of incidents
Practical Do’s and Don’ts
From years of experience, some best practice guidelines can be defined that have been developed
through years of escalation experience. These should be taken as intended, to be general guidelines for
best practice in any remediation effort. The specifics of any incident may dictate a different course of
action:
• Although it may be tempting in a suspected hardware failure to pull parts from a functioning
machine to use in a failed machine in an effort to isolate the specific failure, it often can become
counterproductive and cause more problems. Don’t pull parts from a working machine to
troubleshoot a failed one.
• When presented with multiple systems where one or a smaller number of systems are not failing,
it can be an effective troubleshooting technique to compare the configuration of a failed machine
to a non-failing machine to look for differences in firmware levels, applications being run,
differences in amount of memory available, etc
• Balance the desire to “shot-gun” a problem (replacing multiple parts at the same time) as a
troubleshooting strategy against the limited time that may be available. If symptoms change
following multiple parts replacement, one would have no idea if there were an impact to the
original problem or there were a DOA part put into the mix. Note, also, that there may come a
point in the remediation process where the cost of the effort to troubleshoot a problem is higher
then overall replacement cost of that option.
• When changing out hardware, use electro-static discharge precautions when directly handling
equipment, such as through the use of static mats and wrist straps. Static electricity can have a
detrimental affect on the smaller and smaller technical footprints with which we work, yet use of
static electricity precautionary measures seems to be utilized less frequently.
• Label any original parts so that they do not get confused with replacement. Original parts can
quickly become interchanged with replacement parts and the effort of remediation will have
6
become extremely complex at that point. When a part is replaced and it does not fix the original
problem, restore the original part in the option/system.
• Communicate effectively with all stakeholders during a crisis, but allowing non-technical
resources to drive the technical action planning process will be a clear recipe for disaster. Keep
them aware of progress, meet your commitments to the business, and let feedback they provide
influence your actions. Don’t allow the end-user or business management to drive the technical
action planning process.
• Maintain trust during the crisis period. Keeping people informed of your actions and fulfilling any
commitments will assist a general understanding of any progress being made throughout the
crisis
• Take time to define the “real” problem. Over 50% of the effort required to solving any problem is
properly defining it in the first place
• Upgrading product firmware during an outage without first understanding how this will impact the
remediation effort can present additional problems. Making random changes within an IT
environment during a failure event can oftentimes exacerbate the original condition.
• Use a thoughtful approach to troubleshooting. During the stress of any significant incident, it can
become very easy to get distracted by unrelated or non-critical tasks. The best way to avoid that
possibility is to have a defined and thoughtful approach that is understood by all involved.
Document a step-by-step process flow, if at all possible. This will aid in any post mortem effort
that may follow.
• Document existing configurations and operational settings before beginning any invasive
troubleshooting or reconfiguration. During an extended crisis, it can be difficult to return a system
to an original configuration if not properly documented beforehand.
• Document the sequence of events throughout your efforts. Difficult issues can span several days
and multiple resources. If the steps taken at the advent of the incident are not well documented,
you may find that the same tasks are performed over and over again with the same results, not
leading to technical resolution, or that the most obvious actions are assumed to have already
taken place when they have not.
• Understand what has changed recently - in the system, loading or within facility. A significant
number of critical incidents occur as a result of changes in an IT environment. Make sure you
ask at various levels what changes have been made recently that coincide with the advent of the
specific incident. Have they updated the Operating System, updated firmware or added a new
Application?
• Consider a troubleshooting approach that involves minimizing the original configuration and
building back, especially in highly complex environments. What would be the minimum needed
to any configuration to run, start at that level and eliminate all else. If that minimized configuration
continues to fail, you have less to troubleshoot. If it does not fail, then slowly build back to the
original configuration.
• Ask for help from others when needed throughout the event. Asking for help does not equate to
failure. A desire to know everything to be the sole source provider of information is sometimes a
dynamic that is seen with technical resources. Asking for assistance must be viewed as a sign of
strength, not weakness. With the complexity of our technology nowadays, no one person can
have all the required information needed to solve a problem. It also affords others the chance to
learn experientially throughout the remediation effort.
• Check for the obvious and then move beyond it. One can too easily get hung up on reviewing
what is believed to be the “obvious” solution to any problem, but if the issue were that obvious,
7
would it have evolved to the urgency now being experienced? The reverse can also be true, that
the obvious solution should not be over looked for the sake of the more intricate solution.
• Verify full operation once you think you have fixed any incident. It can hurt the relationship
between IT and the business if success is declared prematurely. Remember to evaluate any
solution against the original description of success as defined early on in any action planning
process.
• Take the learnings from any crisis to help avoid the same crisis at a later time and date.
Even with the increased number of published standards related to the management and delivery of IT
services, the overall maturity of Information Technology and our ability to manage IT environments
continues to evolve.
Formalized IT process control disciplines are not as mature or robust as is required throughout the
industry. The IT industry is learning more and more about “what needs to be done” but have yet to fully
develop well-defined prescriptive measures to make it happen effectively. Hopefully this information has
added to the overall body of knowledge in support of IT crisis management and problem solving.
8
9
Technical Action Planning Template
Business Impact Summary
• Provide a brief summary of the key stakeholders, their business and the technical
impact of problem.
Problem Statement
Detailed problem description – use as much detail as required to clearly communicate the
nature of the problem from a technical perspective, including any specific business
considerations that the technical team needs to understand through the effort.
Closure Criteria
• Define closure criteria: “What constitutes success for this effort?”
Problem Management Team (Who is involved in this effort?)
– Problem Manager – key customer communication interface; focal point for all
communication – technical and business related
– Technical Escalation Resource – understands the technical issues and impact of
the problem and helps to translate technical information between the business and
the technical team
– Technical Escalation Manager – coordinates additional technical resources as
required by the plan, including resources from vendors as may be required
– Business contact focal – primary communication link between key business
stakeholders and the technical remediation efforts, including all key players at the
customer site or service partners
Name Title Email Telephone Cell
Problem Manager
Technical Escalation
Resource
Technical Escalation
Manager
Business
Communication
Contact Focal
Logistics (parts), Resource or Tool Requirements:
• This section would be used to define any resources currently not available and
begin the process to source anything that may be required.
Resources
needed
Description Source Comments / Status
1
2
3
10
Action Plan
Action Summary
Action
Number
Description
Owner(s) &
Date Due
Comments / Status
1
2
3
4
SPECIFIC ACTIONS (Detailed action and owner)
• Action one
– Schedule of events – when due for completion
– Measureable milestones
– Expected results
– Event triggers
– Identified resources or needs for resources
– Management escalation
– Communication schedule (who, what, where, when)
• Action Two
– Schedule of events – when due for completion
– Measureable milestones
– Expected results
– Event triggers
– Identified resources or needs for resources
– Management escalation
– Communication schedule (who, what, where, when)
• Action Three… repeated as required.
– Schedule of events – when due for completion
– Measureable milestones
– Expected results
– Event triggers
– Identified resources or needs for resources
– Management escalation
– Communication schedule (who, what, where, when)
Contingency Planning:
“What if” planning – location for contingency planning if initial action plan outlined does not
accomplish effective remediation. For example, if action plan does not result in desired
hardware fix, product replacement would take place. This would also be the point to
discuss the transfer of operations to any predefined disaster recovery site, if available.
Technical Contact List:
Monitoring criteria and timing – Once it is believed that a solution has been provided, how
long is it agreed that the situation will be monitored prior to incident closure and declaration
of success. (Note: This should mirror what was originally defined at the advent of the
incident in Closure Criteria.)
Technical Planning Contact List:
Name Title Email Telephone Cell
11
Communications Template
Problem Statement
• Clearly and specifically state the problem for any professional audience to read and
understand.
Business Impact
• Take the time to define details of the operational impact and financial impact of any
problems, if possible.
Closure Criteria
• Define closure criteria: “What constitutes success for this effort?”
New This Update:
• This section would be used on a rotating basis to understand NEW information for
the effort or crisis. It provides a section for those that are familiar with the basics of
the issue to
Action Plan Summary:
Action
Number
Description
Owner(s) &
Date Due
Comments / Status
1
2
3
4
Stakeholder Contact List:
Name Title Email Telephone Cell
Technical Contact List:
Name Title Email Telephone Cell
Other Contacts:
Name Title Email Telephone Cell
For more information
http://h20219.www2.hp.com/services/cache/457080-0-0-225-121.html
Author:
Chuck Boutcher, PMP, CISSP, ITIL Expert
Americas Escalation Team
HP Services
charles.boutcher@hp.com
Contributors:
• Mr. Gary Blew
Americas Escalation Team
HP Services
• Mr. Mark Hastings
Americas Escalation Team
HP Services
• Mr. Dan Phalen, PMP
• Professor Jacques Sauvé
Systems and Computing Department
Federal University of Campina Grande
Campina Grande, PB, Brazil
• Mr. Tony Vohsemer
Canadian Escalation Manager
HP Services, Canada

More Related Content

What's hot

Andrew Vermes: Major Incident Management
Andrew Vermes: Major Incident ManagementAndrew Vermes: Major Incident Management
Andrew Vermes: Major Incident ManagementitSMF UK
 
Incident Management Best Practices
Incident Management Best PracticesIncident Management Best Practices
Incident Management Best PracticesTechExcel
 
Best Practices in Major Incident Management
Best Practices in Major Incident ManagementBest Practices in Major Incident Management
Best Practices in Major Incident ManagementxMatters Inc
 
IT Service Management (ITSM) Model for Business & IT Alignement
IT Service Management (ITSM) Model for Business & IT AlignementIT Service Management (ITSM) Model for Business & IT Alignement
IT Service Management (ITSM) Model for Business & IT AlignementRick Lemieux
 
Good Old UServ Product Derby in the Brave New World of Decision Management
Good Old UServ Product Derby in the Brave New World of Decision Management Good Old UServ Product Derby in the Brave New World of Decision Management
Good Old UServ Product Derby in the Brave New World of Decision Management Decision Management Solutions
 
Business Risk: Effective Technology Protecting Your Business
Business Risk: Effective Technology Protecting Your BusinessBusiness Risk: Effective Technology Protecting Your Business
Business Risk: Effective Technology Protecting Your Businessat MicroFocus Italy ❖✔
 
Lean IT - 8 Elements Of Waste
Lean IT - 8 Elements Of WasteLean IT - 8 Elements Of Waste
Lean IT - 8 Elements Of Wastewatpe01
 
The ambition maturity_gap_report_june_2012_final_final
The ambition maturity_gap_report_june_2012_final_finalThe ambition maturity_gap_report_june_2012_final_final
The ambition maturity_gap_report_june_2012_final_finalIlia Malkov
 
Business Continuation The Basics
Business Continuation   The BasicsBusiness Continuation   The Basics
Business Continuation The Basicsguest13df88e8
 
IT Incident Communication Buyer's Guide: 10 Questions to ask an IT Alerting V...
IT Incident Communication Buyer's Guide: 10 Questions to ask an IT Alerting V...IT Incident Communication Buyer's Guide: 10 Questions to ask an IT Alerting V...
IT Incident Communication Buyer's Guide: 10 Questions to ask an IT Alerting V...Jesse Andrew
 
Back to the Future
Back to the FutureBack to the Future
Back to the Futurecssa
 
Iscope Digital : Integrated IT Service Management
Iscope Digital : Integrated IT Service ManagementIscope Digital : Integrated IT Service Management
Iscope Digital : Integrated IT Service ManagementIscope Digital
 
Business Intelligence Analysis - The key to organisational and business success
Business Intelligence Analysis - The key to organisational and business successBusiness Intelligence Analysis - The key to organisational and business success
Business Intelligence Analysis - The key to organisational and business successcssa
 
Analyzing Project Failure Modes: Lessons learnt from the field
Analyzing Project Failure Modes: Lessons learnt from the fieldAnalyzing Project Failure Modes: Lessons learnt from the field
Analyzing Project Failure Modes: Lessons learnt from the fieldcssa
 

What's hot (20)

Andrew Vermes: Major Incident Management
Andrew Vermes: Major Incident ManagementAndrew Vermes: Major Incident Management
Andrew Vermes: Major Incident Management
 
Incident Management Best Practices
Incident Management Best PracticesIncident Management Best Practices
Incident Management Best Practices
 
Best Practices in Major Incident Management
Best Practices in Major Incident ManagementBest Practices in Major Incident Management
Best Practices in Major Incident Management
 
IT Service Management (ITSM) Model for Business & IT Alignement
IT Service Management (ITSM) Model for Business & IT AlignementIT Service Management (ITSM) Model for Business & IT Alignement
IT Service Management (ITSM) Model for Business & IT Alignement
 
BPM in Healthcare
BPM in HealthcareBPM in Healthcare
BPM in Healthcare
 
Dit yvol5iss41
Dit yvol5iss41Dit yvol5iss41
Dit yvol5iss41
 
Good Old UServ Product Derby in the Brave New World of Decision Management
Good Old UServ Product Derby in the Brave New World of Decision Management Good Old UServ Product Derby in the Brave New World of Decision Management
Good Old UServ Product Derby in the Brave New World of Decision Management
 
Business Risk: Effective Technology Protecting Your Business
Business Risk: Effective Technology Protecting Your BusinessBusiness Risk: Effective Technology Protecting Your Business
Business Risk: Effective Technology Protecting Your Business
 
Problem Management
Problem ManagementProblem Management
Problem Management
 
Lean IT - 8 Elements Of Waste
Lean IT - 8 Elements Of WasteLean IT - 8 Elements Of Waste
Lean IT - 8 Elements Of Waste
 
The ambition maturity_gap_report_june_2012_final_final
The ambition maturity_gap_report_june_2012_final_finalThe ambition maturity_gap_report_june_2012_final_final
The ambition maturity_gap_report_june_2012_final_final
 
Business Continuation The Basics
Business Continuation   The BasicsBusiness Continuation   The Basics
Business Continuation The Basics
 
IT Incident Communication Buyer's Guide: 10 Questions to ask an IT Alerting V...
IT Incident Communication Buyer's Guide: 10 Questions to ask an IT Alerting V...IT Incident Communication Buyer's Guide: 10 Questions to ask an IT Alerting V...
IT Incident Communication Buyer's Guide: 10 Questions to ask an IT Alerting V...
 
Back to the Future
Back to the FutureBack to the Future
Back to the Future
 
IT
ITIT
IT
 
Iscope Digital : Integrated IT Service Management
Iscope Digital : Integrated IT Service ManagementIscope Digital : Integrated IT Service Management
Iscope Digital : Integrated IT Service Management
 
Business Intelligence Analysis - The key to organisational and business success
Business Intelligence Analysis - The key to organisational and business successBusiness Intelligence Analysis - The key to organisational and business success
Business Intelligence Analysis - The key to organisational and business success
 
Analyzing Project Failure Modes: Lessons learnt from the field
Analyzing Project Failure Modes: Lessons learnt from the fieldAnalyzing Project Failure Modes: Lessons learnt from the field
Analyzing Project Failure Modes: Lessons learnt from the field
 
Dit yvol3iss34
Dit yvol3iss34Dit yvol3iss34
Dit yvol3iss34
 
BDIM - Ferrara
BDIM - FerraraBDIM - Ferrara
BDIM - Ferrara
 

Viewers also liked

Ag02b toma de decisiones
Ag02b toma de decisionesAg02b toma de decisiones
Ag02b toma de decisionesjonnymeza51
 
TC 2000 y SÚPER TC 2000
TC 2000 y SÚPER TC 2000TC 2000 y SÚPER TC 2000
TC 2000 y SÚPER TC 2000ejemplo12
 
Qué es el aparato digestivo copia
Qué es el aparato digestivo   copiaQué es el aparato digestivo   copia
Qué es el aparato digestivo copiayulilanchi
 

Viewers also liked (6)

Ag02b toma de decisiones
Ag02b toma de decisionesAg02b toma de decisiones
Ag02b toma de decisiones
 
TC 2000 y SÚPER TC 2000
TC 2000 y SÚPER TC 2000TC 2000 y SÚPER TC 2000
TC 2000 y SÚPER TC 2000
 
Dua o zikr
Dua o zikrDua o zikr
Dua o zikr
 
La paz cns
La paz cnsLa paz cns
La paz cns
 
Qué es el aparato digestivo copia
Qué es el aparato digestivo   copiaQué es el aparato digestivo   copia
Qué es el aparato digestivo copia
 
Evaluation Report pdf
Evaluation Report pdfEvaluation Report pdf
Evaluation Report pdf
 

Similar to IT_Crisis_Problem_Management_Whitepaper

GoToAssist-Be-Mighty-ITIL-Quick-Guide-eBook
GoToAssist-Be-Mighty-ITIL-Quick-Guide-eBookGoToAssist-Be-Mighty-ITIL-Quick-Guide-eBook
GoToAssist-Be-Mighty-ITIL-Quick-Guide-eBookGeorge Yu
 
Global_Technology_Services - Technical_Support_Services_White_Paper_External_...
Global_Technology_Services - Technical_Support_Services_White_Paper_External_...Global_Technology_Services - Technical_Support_Services_White_Paper_External_...
Global_Technology_Services - Technical_Support_Services_White_Paper_External_...Jim Mason
 
5 Keys to Mastering Today's Communication Domain
5 Keys to Mastering Today's Communication Domain5 Keys to Mastering Today's Communication Domain
5 Keys to Mastering Today's Communication DomainAvaya Inc.
 
The Ultimate Guide to Managed IT Services for the Modern Business
The Ultimate Guide to Managed IT Services for the Modern BusinessThe Ultimate Guide to Managed IT Services for the Modern Business
The Ultimate Guide to Managed IT Services for the Modern BusinessTechvera
 
Transformation of legacy landscape in the insurance world
Transformation of legacy landscape in the insurance worldTransformation of legacy landscape in the insurance world
Transformation of legacy landscape in the insurance worldNIIT Technologies
 
Fool With A Tool V2
Fool With A Tool V2Fool With A Tool V2
Fool With A Tool V2Linz1769
 
The it department pain
The it department painThe it department pain
The it department painjohn coaxum
 
The it department pain
The it department painThe it department pain
The it department painjohn coaxum
 
Vistacom in the Facilities Management Journal (September-October 2015)
Vistacom in the Facilities Management Journal (September-October 2015)Vistacom in the Facilities Management Journal (September-October 2015)
Vistacom in the Facilities Management Journal (September-October 2015)Destiny Heimbecker
 
Creating Optimized Business Relationships - Article #1
Creating Optimized Business Relationships - Article #1Creating Optimized Business Relationships - Article #1
Creating Optimized Business Relationships - Article #1Lawrence Dillon
 
The Future of the Help Desk e-book_All Covered
The Future of the Help Desk e-book_All CoveredThe Future of the Help Desk e-book_All Covered
The Future of the Help Desk e-book_All CoveredNeil Solomon
 
The Future of the Help Desk
The Future of the Help DeskThe Future of the Help Desk
The Future of the Help DeskRich Rubinstein
 
The Future of the Help Desk e-book_All Covered
The Future of the Help Desk e-book_All CoveredThe Future of the Help Desk e-book_All Covered
The Future of the Help Desk e-book_All CoveredJosh Lippy
 
Network operations center best practices (3)
Network operations center best practices (3)Network operations center best practices (3)
Network operations center best practices (3)Gabby Nizri
 
Cfo insights evaluating_it
Cfo insights evaluating_itCfo insights evaluating_it
Cfo insights evaluating_itKamalakar Yadav
 

Similar to IT_Crisis_Problem_Management_Whitepaper (20)

GoToAssist-Be-Mighty-ITIL-Quick-Guide-eBook
GoToAssist-Be-Mighty-ITIL-Quick-Guide-eBookGoToAssist-Be-Mighty-ITIL-Quick-Guide-eBook
GoToAssist-Be-Mighty-ITIL-Quick-Guide-eBook
 
Dit yvol3iss37
Dit yvol3iss37Dit yvol3iss37
Dit yvol3iss37
 
Global_Technology_Services - Technical_Support_Services_White_Paper_External_...
Global_Technology_Services - Technical_Support_Services_White_Paper_External_...Global_Technology_Services - Technical_Support_Services_White_Paper_External_...
Global_Technology_Services - Technical_Support_Services_White_Paper_External_...
 
Overview to itil
Overview to itilOverview to itil
Overview to itil
 
5 Keys to Mastering Today's Communication Domain
5 Keys to Mastering Today's Communication Domain5 Keys to Mastering Today's Communication Domain
5 Keys to Mastering Today's Communication Domain
 
The Ultimate Guide to Managed IT Services for the Modern Business
The Ultimate Guide to Managed IT Services for the Modern BusinessThe Ultimate Guide to Managed IT Services for the Modern Business
The Ultimate Guide to Managed IT Services for the Modern Business
 
Transformation of legacy landscape in the insurance world
Transformation of legacy landscape in the insurance worldTransformation of legacy landscape in the insurance world
Transformation of legacy landscape in the insurance world
 
NLOGIX
NLOGIXNLOGIX
NLOGIX
 
Dit yvol4iss16
Dit yvol4iss16Dit yvol4iss16
Dit yvol4iss16
 
Fool With A Tool V2
Fool With A Tool V2Fool With A Tool V2
Fool With A Tool V2
 
The it department pain
The it department painThe it department pain
The it department pain
 
The it department pain
The it department painThe it department pain
The it department pain
 
Vistacom in the Facilities Management Journal (September-October 2015)
Vistacom in the Facilities Management Journal (September-October 2015)Vistacom in the Facilities Management Journal (September-October 2015)
Vistacom in the Facilities Management Journal (September-October 2015)
 
Creating Optimized Business Relationships - Article #1
Creating Optimized Business Relationships - Article #1Creating Optimized Business Relationships - Article #1
Creating Optimized Business Relationships - Article #1
 
The Future of the Help Desk e-book_All Covered
The Future of the Help Desk e-book_All CoveredThe Future of the Help Desk e-book_All Covered
The Future of the Help Desk e-book_All Covered
 
The Future of the Help Desk
The Future of the Help DeskThe Future of the Help Desk
The Future of the Help Desk
 
The Future of the Help Desk e-book_All Covered
The Future of the Help Desk e-book_All CoveredThe Future of the Help Desk e-book_All Covered
The Future of the Help Desk e-book_All Covered
 
Network operations center best practices (3)
Network operations center best practices (3)Network operations center best practices (3)
Network operations center best practices (3)
 
Cfo insights evaluating_it
Cfo insights evaluating_itCfo insights evaluating_it
Cfo insights evaluating_it
 
Dit yvol5iss24
Dit yvol5iss24Dit yvol5iss24
Dit yvol5iss24
 

IT_Crisis_Problem_Management_Whitepaper

  • 1. 1 IT Crisis and Problem Management (Technical Escalation and Action Planning) With the advent of more and more published standards related to the management and delivery of IT services (ITIL, cobIT, et. al.), we still often discover that formal IT process control disciplines are not as mature or robust an IT management science as we would desire. We are learning more and more about “what needs to be done” but have yet to fully develop well-defined, prescriptive measures to make it happen effectively. With ever-increasing economic pressures on business and the subsequent pressures on IT to provide higher levels of services at reduced costs, some IT organizations are increasing the level of technical risk within their operations, both through conscious and unconscious decisions that are made with IT management. The following are examples of decisions which increase risk for IT and the business: • Implementation of changes (hardware, software, firmware) directly into a production environment without prior testing on a development/test configuration • Multiple single points of failure in hardware configurations • Reduced or limited service coverage from supporting vendors without committed response and repair times • Limited windows of time for implementation of change which then may require multiple changes to be made in a short timeframe (If a problem occurs, which change caused the failure?) • Complex multi-vendor solutions which bring together multiple software and hardware products that have not been certified to operate together effectively causing the potential for multiple interoperability issues, etc. • The outsourcing of specific functions within IT, such as development and maintenance can present additional challenges from both a “time to resolution” perspective and communications. Outsourcing often brings with it a set of performance standards which differ from that of the IT organization. Significant technological advances in underlying hardware platforms are out-pacing the ability of software providers to keep up with the innovation and develop management tools to control these environments.
  • 2. 2 This dynamic has led to increased system complexity and a technical gap in knowledge of the IT service providers to effectively support those environments. Those involved in IT crisis (problem) management as a primary role tend to see more clearly the places where the lack of sound process control exists. Problems occur and re-occur within a technology environment at an alarming rate, often having significant impact on the ability to produce or improve the overall efficacy of the environment. A lack of sound problem management strategies can exacerbate current operational failures of information technology systems, increase their frequency and/or be the direct cause of future failures. Within IT, it has become more of a matter of “when” your next significant, critical problem will occur, not “if” one will occur. An effective discipline around IT crisis management can give business managers control of their IT service priorities by providing the procedures for quick restoration of impacted services, safeguarding existing services, and creating a basis for learning from failures to reduce the advent of future failures before they can occur. Well defined problem and crisis management procedures will increase overall service availability and IT efficiency by reducing the number of unnecessary escalations (crisis) within the environment. Historically, IT professionals have focused on technology as the primary influencer on the overall reliability within IT. With that being said, the implementation of effective technical crisis processes will certainly NOT provide a “silver bullet” for remedial problems and technical crisis. While there must be an underlying set of standards upon which IT services are provisioned, as with the processes defined through frameworks such as ITIL, effective problem and crisis processes can provide a basis for us to understand the advent of a problem and help feed the problem management phase of any remedial action. According to ITIL It should be noted that use herein of the term “problem” and “crisis” management does not strictly adhere to the definitions as laid out in ITIL. The terms are interchanged between problem management and crisis management intentionally. In strict ITIL terms, what is being referenced could be more clearly defined as an incident that has significant impact and visibility within the business. Crisis and problem management are being referenced in terms more readily used throughout the IT industry. In summary without a full dialog relative to ITIL service lifecycle management, the ITIL theory of problem resolution begins with the “Incident,” an unplanned interruption or performance reduction to an IT Service. The service request is made, and when a cause of one or more incidents is identified, a “Problem” record is created and made available to Service Desk personnel. Once a root cause or effective work around had been identified, it is then acknowledged as a “known error.” In version three of ITIL, there are additional provisions made for “Continual Service Improvement” that recommend ongoing trend analysis for all services and IT processes and identify areas for improvement. Real World Experience The practicality of experience demonstrates that the finding of root cause can take significant time to determine. Efforts to restore operation can directly conflict with data gathering and problem analysis. An effective work around becomes victim to root cause analysis. SLA targets for incident management are used to drive quicker problem resolution with limited resources available to do both incident management and effective problem resolution. Technical IT resources often focus on any specific problem from a myopic perspective, either focusing on a small sub-set of the environment that is too small to effectively remediate the broader issues or not fully understanding the true impact to the business. They can focus on priorities that do not match those of the business based upon impact or risk.
  • 3. 3 There can be situations when IT resources continue to solve issues repeatedly without properly documenting the resolution or determining the effective root cause of the issue. Instead of removing the underlying problem and reducing the number of overall incidents, one is driven to quicker problem resolution and moving on to the next critical problem. Newer, increasingly more complex technologies are being seen in increasingly smaller footprints. IT environments are becoming more complex while skilled resources to manage those environments become less and less available due to the breadth of knowledge required to manage those environments. The speed with which technology changes also presents challenges in maintaining sufficient qualified resources with the requisite knowledge and skill set. Service requirements are becoming more demanding with increased interoperability issues between different vendor products as vendors are under increasing pressure to ship products that may not yet be ready for “prime time” operation. In a “perfect world,” effective use of people, process and technology within IT would limit risk to the business. The process and strategies required to avoid theses problems are generally understood. There are substantial sources of information currently available that speak to these strategies and techniques. What is lacking is a clear path to take when significant problems do occur within an IT environment. What is laid out herein is a framework that defines clear and effective communication techniques in dealing with the business, a strategy for sound troubleshooting techniques and action planning, a direction to share knowledge and the effective use of the resources available in support of efforts toward problem remediation. Specifics – a How To … This process must begin with IT involving the business in the incident. Once IT is made aware of the issue, in most cases through the logging of an incident through a help desk, one must quickly determine the business’s level of urgency and the impact the problem is causing. First, listen to what is being said by the customer. This may sound like an intuitive first step, but the reality is that technical people will often head off to “solve the problem” before there is a clear and full understanding of just what is the actual problem. Having a clear understanding of the initial problem and what will constitute a successful remediation is often half of the battle in trying to solve technical incidents. One must obtain a complete picture of the issues trying to be solved. Always take the time to get a clear and complete problem description. If the problem changes during the crisis, we must make sure we take time to re-evaluate our initial problem statement. Make sure we understand any timing issues. When does the problem happen? How often and under what conditions? Have there been any recent changes to accounts or applications impacted by the failure? You need to test any assumptions you make with reflective dialog. “To make sure I am clear, you are telling me …” Take the time to learn what has been tried before for any repeat problems, either by the business, other IT professionals or another vendor services which may have been involved earlier in the incident before the crisis became evident. It may be beneficial to examine all changes made in the environment preceding the crisis even though the changes may seem unrelated to the failure. Effective communication with the users throughout the entire incident lifecycle is essential. One example of why this may be needed can be seen in changes that the business may make to system loading or applications being run can significantly impact the problem(s) being seen and IT efforts to remediate the situation. While efforts are made to understand the problem, other resources may be working at cross purposes in an effort to limit the impact to overall production. Without understanding what changes are being made across the environment, it can make remediation efforts that much more problematic. Human factors in any problem need to be considered. Has human error caused or exacerbated the advent of any problem? Have IT professionals been working this problem with inadequate food or rest
  • 4. 4 periods? Has the overall workload on staff contributed to the crisis? Has the technology advanced beyond the expertise of those working on the equipment? If outsourcers or contractors are in use within the IT environment, are there any cultural barriers that are contributing to ineffective communications? As for the equipment, has it ever performed to expectations; and if so, for how long? Is this a new issue that has never been seen, a repeat problem or possibly a situation where the equipment had never been set up properly originally? Are there interoperability issues at play here and have all the component pieces in the configuration been tested and qualified by the original vendors? Has system or application performance (capacity) ever had a baseline established? Has additional loading been placed on the system since establishment of the baseline? Along with a problem description, make sure you have had some discussions around what constitutes a solution. For example, if the business has experienced intermittent service interruptions every three days, how many days of error-free running are required in order to consider the issue resolved. As a basic rule of thumb, an error-free run rate of three times the original failure rate should be expected in order to have some level of confidence that the original problem has been resolved. Without properly defining what the solution looks like, there may be disagreement between IT and the business on when the situation can officially be considered resolved. After a clear problem description has been established, a solution defined and all of the obvious issues have been addressed, make sure to move beyond the obvious. Check for any recent changes to hardware or software, or unusual situations that may be in close proximity to the advent of the problem. Over 80% of all IT failures occur around some change in the environment. One of the most effective questions that can be asked is, “What has changed recently?” One can focus so closely on being able to define a root cause to any incident that effort to define effective workarounds as quickly as possible goes unaddressed. Multiple perspectives in your remediation efforts ensure that everyone has a voice at the table, or more likely conference call. Get input from the business, database administrators, system administrators, product (hardware and software) vendors, end users, etc. You may want to do this in different forums and then consolidate notes between the discussions. Note that it is advisable to separate technical conversations from the necessary business management updates that may be required. Keeping key stakeholders informed while not impeding an open flow of technical dialog is critical. Management admonishments or assertions during a crisis event can inhibit technical brainstorming and an open technical dialog. Make sure to look for patterns, as might be found in a review of repeat service incidents. Seek to discover inter-relationships that may be involved, which should help to set the direction toward determining root cause(s). It is also critical that all interested parties are kept aware of progress during the incident; otherwise technical resources that should be focused on remediation of the effort will be pulled away into conversations with non-technical stakeholders that are struggling to keep aware of progress, especially as higher levels of business management become aware. We must have a well-defined, predetermined process for communication that will take place during any remediation effort or crisis situation. Assign a liaison to provide technical updates to the management team, as well as bring back any management concerns to the technical team. In a situation where multiple incidents are open, keep focus on the highest priority incidents, even as additional issues or incidents are raised. These may not be the same as the high priority incidents. We need to clearly understand that the business’ priorities may change if frequency or the impact of the problem changes; or if other issues take priority. We must always be willing to evaluate functional priorities on a continual basis throughout the crisis situation. There is a common tendency on the part of technical resources to work on incidents that are the most technically complex, not necessarily focusing on the issues in priority order – defined as those incidents have the greatest impact and urgency to the user community and the business.
  • 5. 5 Quickly identify effective workarounds for any problems with significant impact to the business. Once (If) a work around is in place or a solution is found, root cause analysis work can begin. Before this work proceeds, the question of whether finding a root cause is necessary should be raised. Determining root cause analysis can take considerable time and resources. A search for root causes can be counter- productive and hamper work which should take priority. Don’t be afraid to ask questions related to the need for root cause identification. Do we really need to define the root cause for this incident considering the costs required to do so? What is the likelihood that this problem will reoccur? Is the potential impact to the business worth the knowledge that might be gained in the analysis? Is there potential that our reputation with the business can be damaged if we fail to find a root cause? These questions reflect a need for effective risk management within the IT environment. A decision may be made to continue with root cause analysis, but it should be made as a conscious decision and not an automatic response to crisis. Technical Action Planning – in summation In its simplest definition, the process of technical action planning for any incident is merely a matter of thinking through how best to attain desired goals given the available resources and time. Effective incident remediation efforts should include the following: • Identify triggers and workarounds as quickly as is possible. • Reduce the impact of incidents to the business. • Continue to work to reduce the time to resolve incidents, or at least the impact to the end-users. • Only then should you start thinking about root cause and reducing the number of incidents Practical Do’s and Don’ts From years of experience, some best practice guidelines can be defined that have been developed through years of escalation experience. These should be taken as intended, to be general guidelines for best practice in any remediation effort. The specifics of any incident may dictate a different course of action: • Although it may be tempting in a suspected hardware failure to pull parts from a functioning machine to use in a failed machine in an effort to isolate the specific failure, it often can become counterproductive and cause more problems. Don’t pull parts from a working machine to troubleshoot a failed one. • When presented with multiple systems where one or a smaller number of systems are not failing, it can be an effective troubleshooting technique to compare the configuration of a failed machine to a non-failing machine to look for differences in firmware levels, applications being run, differences in amount of memory available, etc • Balance the desire to “shot-gun” a problem (replacing multiple parts at the same time) as a troubleshooting strategy against the limited time that may be available. If symptoms change following multiple parts replacement, one would have no idea if there were an impact to the original problem or there were a DOA part put into the mix. Note, also, that there may come a point in the remediation process where the cost of the effort to troubleshoot a problem is higher then overall replacement cost of that option. • When changing out hardware, use electro-static discharge precautions when directly handling equipment, such as through the use of static mats and wrist straps. Static electricity can have a detrimental affect on the smaller and smaller technical footprints with which we work, yet use of static electricity precautionary measures seems to be utilized less frequently. • Label any original parts so that they do not get confused with replacement. Original parts can quickly become interchanged with replacement parts and the effort of remediation will have
  • 6. 6 become extremely complex at that point. When a part is replaced and it does not fix the original problem, restore the original part in the option/system. • Communicate effectively with all stakeholders during a crisis, but allowing non-technical resources to drive the technical action planning process will be a clear recipe for disaster. Keep them aware of progress, meet your commitments to the business, and let feedback they provide influence your actions. Don’t allow the end-user or business management to drive the technical action planning process. • Maintain trust during the crisis period. Keeping people informed of your actions and fulfilling any commitments will assist a general understanding of any progress being made throughout the crisis • Take time to define the “real” problem. Over 50% of the effort required to solving any problem is properly defining it in the first place • Upgrading product firmware during an outage without first understanding how this will impact the remediation effort can present additional problems. Making random changes within an IT environment during a failure event can oftentimes exacerbate the original condition. • Use a thoughtful approach to troubleshooting. During the stress of any significant incident, it can become very easy to get distracted by unrelated or non-critical tasks. The best way to avoid that possibility is to have a defined and thoughtful approach that is understood by all involved. Document a step-by-step process flow, if at all possible. This will aid in any post mortem effort that may follow. • Document existing configurations and operational settings before beginning any invasive troubleshooting or reconfiguration. During an extended crisis, it can be difficult to return a system to an original configuration if not properly documented beforehand. • Document the sequence of events throughout your efforts. Difficult issues can span several days and multiple resources. If the steps taken at the advent of the incident are not well documented, you may find that the same tasks are performed over and over again with the same results, not leading to technical resolution, or that the most obvious actions are assumed to have already taken place when they have not. • Understand what has changed recently - in the system, loading or within facility. A significant number of critical incidents occur as a result of changes in an IT environment. Make sure you ask at various levels what changes have been made recently that coincide with the advent of the specific incident. Have they updated the Operating System, updated firmware or added a new Application? • Consider a troubleshooting approach that involves minimizing the original configuration and building back, especially in highly complex environments. What would be the minimum needed to any configuration to run, start at that level and eliminate all else. If that minimized configuration continues to fail, you have less to troubleshoot. If it does not fail, then slowly build back to the original configuration. • Ask for help from others when needed throughout the event. Asking for help does not equate to failure. A desire to know everything to be the sole source provider of information is sometimes a dynamic that is seen with technical resources. Asking for assistance must be viewed as a sign of strength, not weakness. With the complexity of our technology nowadays, no one person can have all the required information needed to solve a problem. It also affords others the chance to learn experientially throughout the remediation effort. • Check for the obvious and then move beyond it. One can too easily get hung up on reviewing what is believed to be the “obvious” solution to any problem, but if the issue were that obvious,
  • 7. 7 would it have evolved to the urgency now being experienced? The reverse can also be true, that the obvious solution should not be over looked for the sake of the more intricate solution. • Verify full operation once you think you have fixed any incident. It can hurt the relationship between IT and the business if success is declared prematurely. Remember to evaluate any solution against the original description of success as defined early on in any action planning process. • Take the learnings from any crisis to help avoid the same crisis at a later time and date. Even with the increased number of published standards related to the management and delivery of IT services, the overall maturity of Information Technology and our ability to manage IT environments continues to evolve. Formalized IT process control disciplines are not as mature or robust as is required throughout the industry. The IT industry is learning more and more about “what needs to be done” but have yet to fully develop well-defined prescriptive measures to make it happen effectively. Hopefully this information has added to the overall body of knowledge in support of IT crisis management and problem solving.
  • 8. 8
  • 9. 9 Technical Action Planning Template Business Impact Summary • Provide a brief summary of the key stakeholders, their business and the technical impact of problem. Problem Statement Detailed problem description – use as much detail as required to clearly communicate the nature of the problem from a technical perspective, including any specific business considerations that the technical team needs to understand through the effort. Closure Criteria • Define closure criteria: “What constitutes success for this effort?” Problem Management Team (Who is involved in this effort?) – Problem Manager – key customer communication interface; focal point for all communication – technical and business related – Technical Escalation Resource – understands the technical issues and impact of the problem and helps to translate technical information between the business and the technical team – Technical Escalation Manager – coordinates additional technical resources as required by the plan, including resources from vendors as may be required – Business contact focal – primary communication link between key business stakeholders and the technical remediation efforts, including all key players at the customer site or service partners Name Title Email Telephone Cell Problem Manager Technical Escalation Resource Technical Escalation Manager Business Communication Contact Focal Logistics (parts), Resource or Tool Requirements: • This section would be used to define any resources currently not available and begin the process to source anything that may be required. Resources needed Description Source Comments / Status 1 2 3
  • 10. 10 Action Plan Action Summary Action Number Description Owner(s) & Date Due Comments / Status 1 2 3 4 SPECIFIC ACTIONS (Detailed action and owner) • Action one – Schedule of events – when due for completion – Measureable milestones – Expected results – Event triggers – Identified resources or needs for resources – Management escalation – Communication schedule (who, what, where, when) • Action Two – Schedule of events – when due for completion – Measureable milestones – Expected results – Event triggers – Identified resources or needs for resources – Management escalation – Communication schedule (who, what, where, when) • Action Three… repeated as required. – Schedule of events – when due for completion – Measureable milestones – Expected results – Event triggers – Identified resources or needs for resources – Management escalation – Communication schedule (who, what, where, when) Contingency Planning: “What if” planning – location for contingency planning if initial action plan outlined does not accomplish effective remediation. For example, if action plan does not result in desired hardware fix, product replacement would take place. This would also be the point to discuss the transfer of operations to any predefined disaster recovery site, if available. Technical Contact List: Monitoring criteria and timing – Once it is believed that a solution has been provided, how long is it agreed that the situation will be monitored prior to incident closure and declaration of success. (Note: This should mirror what was originally defined at the advent of the incident in Closure Criteria.) Technical Planning Contact List: Name Title Email Telephone Cell
  • 11. 11 Communications Template Problem Statement • Clearly and specifically state the problem for any professional audience to read and understand. Business Impact • Take the time to define details of the operational impact and financial impact of any problems, if possible. Closure Criteria • Define closure criteria: “What constitutes success for this effort?” New This Update: • This section would be used on a rotating basis to understand NEW information for the effort or crisis. It provides a section for those that are familiar with the basics of the issue to Action Plan Summary: Action Number Description Owner(s) & Date Due Comments / Status 1 2 3 4 Stakeholder Contact List: Name Title Email Telephone Cell Technical Contact List: Name Title Email Telephone Cell Other Contacts: Name Title Email Telephone Cell
  • 12. For more information http://h20219.www2.hp.com/services/cache/457080-0-0-225-121.html Author: Chuck Boutcher, PMP, CISSP, ITIL Expert Americas Escalation Team HP Services charles.boutcher@hp.com Contributors: • Mr. Gary Blew Americas Escalation Team HP Services • Mr. Mark Hastings Americas Escalation Team HP Services • Mr. Dan Phalen, PMP • Professor Jacques Sauvé Systems and Computing Department Federal University of Campina Grande Campina Grande, PB, Brazil • Mr. Tony Vohsemer Canadian Escalation Manager HP Services, Canada