An overview of crisis management
What is crisis management
Entities involved in crisis management
Incidents, problems and Major incidents (in an ITIL context)
Vital Business Functions
2. ProblemManagementFoundation
Objectives
An overview of crisis management
• What is crisis management
• Entities involved in crisis management
• Incidents, problems and Major incidents (in an ITIL context)
• Vital Business Functions
3. ProblemManagementFoundation
What is crisis management?
• Structured approach to handling a crisis
• Focus on the process of Major Incidents
• Dealing with a Major Incident
• Engineering to reduce the impact of a Major Incident
• Continuous improvement
• What about disasters?
• The worst case scenario for a Major Incident
• Trigger for implementation of business continuity (business continuity is a
subsection of crisis management)
• For the purposes of this course the focus is on Information and Communications
Technology (ICT) and Data Centres (DCs)
• These are generic principles that can be applied universally
4. ProblemManagementFoundation
Entities involved in crisis management
• A Service Desk (SD) is a primary Information Technology (IT) service. It
is part of the discipline of IT service management (ITSM) as defined by
the Information Technology Infrastructure Library (ITIL). It is intended
to provide a Single Point of Contact ("SPOC") to meet the
communication needs of both customers/users and IT employees.
Service desk:
• A Crisis Management Operations Centre (CMOC) is a central location
from which administrators monitor, manage and control the crisis. The
overall function is to maintain optimal operations across a variety of
platforms, mediums and communications channels such as servers,
storage, networks and data centres.
Crisis Management
Operations Centre
(CMOC):
• Business Continuity is defined as the capability of the organisation to
continue delivery of products or services at acceptable predefined
levels following a disruptive incident ( Source: ISO 22301: 2012),
including natural, physical and emergency events (such as terrorism);
financial, regulatory and reputational events
Business Continuity
6. ProblemManagementFoundation
Various functions
• A SD is typically a single point of contact for clients where the
majority of the load is inbound and reactive.
• The CMOC is proactive in its nature and even in the context of a
crisis. Usually a CMOC does not interact directly with the primary
client/user.
• A Major Incident being triggered from the SD is not optimal as the
majority of Major Incidents should be triggered from the CMOC.
Ideally a client should not trigger a Major Incident or crisis.
• The disaster recovery or business continuity plan is always triggered
from the Major Incident process and handled and communicated in
a standardised manner.
7. ProblemManagementFoundation
ITIL’s Incident definition
• An incident is an unplanned disruption or degradation of service.
• A problem is a cause of one or more incidents.
• A Major Incidents is an incident with severe negative consequences
e.g.
• Incidents are time dependant
• Problems are not necessarily time dependant
• A Major Incident needs to be analysed in the same way as ITIL treats
a Problem
• Symptom – what you see. These need to be recorded.
• Causes – what made it happen. These need to be determined.
• Resolution – how it was fixed. The service is back to normal.
• Associated identified risks need to be mitigated
(countermeasures).
8. ProblemManagementFoundation
• A problem exists when there is an undiagnosed underlying root-
cause of one or more incidents or potential incidents.
• A known error exists when problems are identified and causation
has occurred.
• A workaround is a way of preventing or resolving incidents and
problems. Workarounds can be used to temporarily resolve an
issue or provide guidance to an alternative resolution.
• There is never a single root cause to a problem. There is
causation which means a problem has multiple causes and not a
singular root cause. <will handle root cause analysis later>
derived from ITIL
ITIL’s Problem definition
9. ProblemManagementFoundation
The following questions need to be answered for any Problem
encountered:
• What is the problem?
• Why is there a problem?
• When did the problem happen?
• How did the problem occur?
• Where did the problem manifest itself?
• Who has been experiencing this problem?
derived from ITIL
ITIL’s Problem definition (cont.)
10. ProblemManagementFoundation
Differentiating Incidents & Problems
• Incident management needs to be solely concerned with returning the service to
an operational state, and should not be over-complicated with the analysis of root
causes.
• Problem management is pro-active and is used to combat future incidents. It is not
time dependent.
• Major Incidents have a direct relationship to Problem Management, as the
underlying triggers of the Major Incident are usually Problems.
12. ProblemManagementFoundation
What is a Major Incident?
• An incident is any event that is not part of the standard operation of a
service and that causes an interruption or a reduction in the quality of
that service.
• A Major Incident is an unplanned or temporary interruption of service
with severe negative consequences.
• Any service outage that does not qualify as a Major Incident should be
categorised as a Moderate, Minor or Normal Incident.
• Major Incident reports are escalated to the Problem Manager for
quality assurance. (Problem Managers are part of the Alpha Tiger
team – refer chapter 5).
derived from ITIL
13. ProblemManagementFoundation
Major Incidents (cont.)
• Dealing with these processes is crucial as they are potential
showstoppers for the business.
• Major Incidents can have a severe business impact such as:
• service, system or infrastructure component not functioning
adequately to enable business process
• total loss of service, system or infrastructure component
• Major Incidents could also be those which do not entirely disrupt the
use of the service, system or infrastructure component such as:
• continuous slow response
• general degradation of service
14. ProblemManagementFoundation
ISO 20000 Major Incident process
ITIL provides a definition but does not define a process for
managing Major Incidents.
ISO 20000 clarifies the process around a Major Incident as follows:
The service provider shall document and agree with the client the
definition of a Major Incidents. Major Incidents shall be classified and
managed according to a documented procedure. Top management shall
be informed of Major Incidents. Top management shall ensure that a
designated individual responsible for managing the Major Incidents is
appointed. After the agreed service has been restored, Major Incidentss
shall be reviewed to identify opportunities for improvement. (ISO/IEC
20000:2011, section 8.1)
15. ProblemManagementFoundation
Vital Business Functions
• ITIL defines Vital Business Functions (VBFs) as a critical element of a business
process that is underpinned by IT. These are the business functions that are the
most important across all the business processes being supported.
• A failure of a VBF is most likely classified as a Major Incident.
• The business determines what is a VBF, not IT.
• It is important to:
• Agree on a list of the important aspects of business on which IT should focus
to ensure adequate resource allocation
• Map the business activities to IT operational activities
16. ProblemManagementFoundation
Examples of Major Incidents
• Backhoe hits fibre going into campus
• Power blackout
• Hailstorm takes out infrastructure
• Flooding
• Operator mistakenly deletes database
• Data centre goes hard down
• CEO is arrested for criminal activity
All handled by same process!
17. Review
• A significant proportion of Crisis Management involves the Major
incident process and thus we will deal with the process in depth
Editor's Notes
Overview
Objectives
- What is crisis management
- Entities involved in crisis management
- Incidents, problems and Major incidents
- Vital Business Functions
What is crisis management?
The entities involved
Refer
UCISA Service operations - https://www.ucisa.ac.uk/representation/activities/ITIL/serviceoperation.aspx
Building a network operations center (CMOC) - http://www.n-able.com/resources/_documents/runbook_special.pdf
UCISA Business continuity management and planning - https://www.ucisa.ac.uk/~/media/Files/publications/toolkits/ist/ISTEd3_Section_B%20pdf.ashx
The IT entities
The various functions
Information Technology Infrastructure Library refer https://en.wikipedia.org/wiki/ITIL
ITIL incident and problem management
A problem exists when there is an undiagnosed underlying root-cause of one or more incidents or potential incidents.
A known error exists when problems are identified and causation has occurred.
A workaround is a way of preventing or resolving incidents and problems. Workarounds can be used to temporarily resolve an issue or provide guidance to an alternative resolution.
There is never a single root cause to a problem. There is causation which means a problem has multiple causes and not a singular root cause. <will handle root cause analysis later>
A problem exists when there is an undiagnosed underlying root-cause of one or more incidents or potential incidents.
A known error exists when problems are identified and causation has occurred.
A workaround is a way of preventing or resolving incidents and problems. Workarounds can be used to temporarily resolve an issue or provide guidance to an alternative resolution.
There is never a single root cause to a problem. There is causation which means a problem has multiple causes and not a singular root cause. <will handle root cause analysis later>
Difference between an incident and a problem: Incident management needs to be solely concerned with returning the service to an operational state, and should not be over-complicated with the analysis of root causes which slows things down. Problem management is pro-active and is used to combat future incidents. It is not time dependant.
Major Incidents have a direct relationship to problem management, as the underlying triggers are usually problems. The following answers need to be answered for any problem encountered:
What is the problem?
Why is there a problem?
When did the problem happen?
How did the problem occur?
Where did the problem manifest itself?
Who has been experiencing this problem?
(watch the videos on incident and problem management)
Move from being reactive to proactive
The battery example. Batteries require continuous testing and maintenance as just leaving them unmonitored will result in degradation and failure.
What is a Major Incident?
Incidents are recorded in a standardized system which is used for documenting and tracking outages and disruptions.
Examples are outages involving core infrastructure equipment/services that affects a significant client base, such as isolation of a company site, which is considered a Major Incident or so called ‘showstoppers’.
A Major Incident is defined as an incident with severe negative business consequences. This could translate to an outage of a vital business function (VBF). To be able to determine this we need a clear understanding of what constitutes a VBF and this needs to be done well in advance as it is not appropriate to establish this during a crisis.
A VBF is determined by first identifying critical business services, those services the business depends upon. These are services without which the business cannot continue. VBF Candidates are those services which if not operational could cost the business significant loss or penalty such as legal, financial, environmental, or even reputational. These are those business services that must operate continuously or sustain only brief interruptions, typically for maintenance. Crucially these are the services that are important to the business and not what the Information Technology department (IT) thinks ought to be important to the business.
However, the VBF needs to be isolated from the larger concept and understanding of the critical business service. Business people do not always understand what underlies IT services their business, nor can many unpack the various aspects of a larger service into smaller parts. Thus business and IT works together to identify critical business functions, and uncover those aspects of the critical business function that are vital. Once the VBFs are identified the IT services that underpin the critical business function. This is achieved by identifying the essential product, people, and process resources required to maintain the selected VBF. Preferably use the Configuration Management Database (CMDB) to identify related Configuration Items (CIs).
There should be no silos in a crisis
The different silos of Information technology (IT) often adopt their own process and ways of handling a crisis. Especially, a Major Incidents which is an incident of severe negative consequence to the business. Across all these silos, which include the service desk, development, infrastructure, data centre operations, Information Security and IT Risk, the process for any crisis should be uniform. In reality the silos remain blinked to the overall goal of transparently providing services to the business. They often react in their own fashion with their own methodologies.
In development a crisis is usually known as a “Showstopper!” This concept is narrated in the excellent book by G. Pascal Zachary. In the context of the book a “Showstopper!” was a bug that would impede or delay the release of Windows NT, which was Microsoft’s big new operating system play. Windows NT was seen as being critical to the aspirations of Microsoft as a business. The processes and activities described in the book are unstructured and potentially subjective. In the intervening years since the development of Windows NT, the development silo hasn’t changed much and remains largely immature. Advancements exist in actual solutions development but this hasn’t translated into mechanisms and methodologies for handling a crisis.
The big problem highlighted by the above divergent approach is that developers are blissfully unaware of the crucial role a structured and accurate error handling framework contributes towards the quicker resolution of Major Incidentss. The target for developers has always been to match the functional requirements of the requested solution and not provide a framework to manage loss, errors or failures. Typically, adding an extensive error handling framework in developing a solution would put back the release date. This approach is unbalanced as the crucial minutes lost in poor diagnosis during a Major Incidents when the solution is live negates the faster release times of a solution.
Some that developers know well and one of the terms used in the above book which is also useful for dealing with a crisis is "eating our own dogfood." It first came into prominence in 1988 when Microsoft's Paul Maritz coined the term in an email to a colleague challenging him to increase the internal usage of the company's products. The term exists to this day at Microsoft. Thus in a crisis the teams working on resolution need to have a clear indication of the pain being felt by customers. They need to use an experience the services that are impacted and be expert users of that environment. As an example, if you work for a bank, you need to use that bank’s accounts for your own personal transactions and not rely on a 3rd party bank. Only by being an embedded user of the corporate product do you have an affinity for the problems and difficulties being experienced. Information Security professionals are often big culprits at not “eating their own dogfood.” The typically work in an enterprise where desktops are Windows but prance around with Apple Macs. They are also quick to make self-righteous comments about their own smugness in using a supposed more secure environment. All it displays is an ignorance in a crisis to be able to find resolutions quickly.
Independently, Information Security professionals have developed their own frameworks for handing Information Security related incidents. Examples of the one from NIST (available here) and the other from SANS (available here). These frameworks do not consider incidents beyond the limited scope of Information Security. The Information Security silo is typically the most isolated of all the entities in Information Technology, mostly because the focus on incidents of only a security nature provides a distorted view of the IT landscape. The process should be aligned and the recommendations for Major Incidents handing in this series of articles, of which this article is one, provides a more mature and complete methodology than the ones proposed by NIST or SANS.
In a post mortem analysis, any lessons learnt from non-security related incidents improve the overall risk profile of IT. This include aspects of not only technology but processes and people as well because any efficiencies and effectiveness that can be implemented to restore systems faster from loss, error and failure are beneficial to security and robustness of solutions.
The worth of the Major Incidents process even stretches to business continuity. Business continuity is in reality a workaround where full or alternative resources are utilized to return business to service. One of the components of business continuity is a disaster recovery plan. The plan will always be triggered during the Major Incidents process and hence the close relationship which cannot be divorced. In an optimal IT environment the teams involved in both Major Incidentss and business continuity will use the same resources are alternatively have a very close working relationship? The process and communications channels established during Major Incidentss also need to be leveraged for any disaster plan implementation.
Even if a disaster is not declared and the plan triggered, there are components of the disaster recovery plan that can be utilized as workarounds to return services to operations as a workaround. This allows the business to function while IT continues to restore full normal operations.
Thus in a crisis, there cannot be silos and all resources need to align against a single Major Incidents process. This is provided by ISO 20000 which is an international standard for service management. ISO 20000, against which an enterprise can measure compliance, provides a more detailed description of a Major Incidents than it just being as according to ITIL just and incident with severe negative business consequences. This definition is listed in ISO/IEC 20000:2011, section 8.1 as below:
The service provider shall document and agree with the client the definition of a Major Incidents. Major Incidents shall be classified and managed according to a documented procedure. Top management shall be informed of Major Incidentss. Top management shall ensure that a designated individual responsible for managing the Major Incidents is appointed. After the agreed service has been restored, Major Incidentss shall be reviewed to identify opportunities for improvement.
Fundamentally, a Major Incidents has not only a severe impact but can potentially also be of such a nature that requires a complex coordination of resources to achieve resolution. These resources are typical coordinated in tiger teams, as has been mentioned in a previous article.
When a Major Incidents happens all IT silos need to be singing from the same hymn sheet as the process requires interaction and resources across all areas of IT.
ISO 20000 Refer https://en.wikipedia.org/wiki/ISO/IEC_20000
Refer Appendix – Vital Business Function Truths
Examples of Major Incidents
Review
A significant proportion of Crisis Management involves the Major incident process and thus we will deal with the process in depth.