Problem management foundation - Overview

ProblemManagementFoundation
Objectives
An overview of crisis management
• What is crisis management
• Entities involved in crisis management
• Incidents, problems and Major incidents (in an ITIL context)
• Vital Business Functions

What is crisis management?
• Structured approach to handling a crisis
• Focus on the process of Major Incidents
• Dealing with a Major Incident
• Engineering to reduce the impact of a Major Incident
• Continuous improvement
• What about disasters?
• The worst case scenario for a Major Incident
• Trigger for implementation of business continuity (business continuity is a
subsection of crisis management)
• For the purposes of this course the focus is on Information and Communications
Technology (ICT) and Data Centres (DCs)
• These are generic principles that can be applied universally

Entities involved in crisis management
• A Service Desk (SD) is a primary Information Technology (IT) service. It
is part of the discipline of IT service management (ITSM) as defined by
the Information Technology Infrastructure Library (ITIL). It is intended
to provide a Single Point of Contact ("SPOC") to meet the
communication needs of both customers/users and IT employees.
Service desk:
• A Crisis Management Operations Centre (CMOC) is a central location
from which administrators monitor, manage and control the crisis. The
overall function is to maintain optimal operations across a variety of
platforms, mediums and communications channels such as servers,
storage, networks and data centres.
Crisis Management
Operations Centre
(CMOC):
• Business Continuity is defined as the capability of the organisation to
continue delivery of products or services at acceptable predefined
levels following a disruptive incident ( Source: ISO 22301: 2012),
including natural, physical and emergency events (such as terrorism);
financial, regulatory and reputational events
Business Continuity

Data centre
Cloud
Internet
Service desk
Crisis Management
Operations Centre
(CMOC)
War rooms
Control
rooms
Reactive
Client
Proactive
Services
Network Operations
Centre (NOC)

Various functions
• A SD is typically a single point of contact for clients where the
majority of the load is inbound and reactive.
• The CMOC is proactive in its nature and even in the context of a
crisis. Usually a CMOC does not interact directly with the primary
client/user.
• A Major Incident being triggered from the SD is not optimal as the
majority of Major Incidents should be triggered from the CMOC.
Ideally a client should not trigger a Major Incident or crisis.
• The disaster recovery or business continuity plan is always triggered
from the Major Incident process and handled and communicated in
a standardised manner.

ITIL’s Incident definition
• An incident is an unplanned disruption or degradation of service.
• A problem is a cause of one or more incidents.
• A Major Incidents is an incident with severe negative consequences
e.g.
• Incidents are time dependant
• Problems are not necessarily time dependant
• A Major Incident needs to be analysed in the same way as ITIL treats
a Problem
• Symptom – what you see. These need to be recorded.
• Causes – what made it happen. These need to be determined.
• Resolution – how it was fixed. The service is back to normal.
• Associated identified risks need to be mitigated
(countermeasures).

• A problem exists when there is an undiagnosed underlying root-
cause of one or more incidents or potential incidents.
• A known error exists when problems are identified and causation
has occurred.
• A workaround is a way of preventing or resolving incidents and
problems. Workarounds can be used to temporarily resolve an
issue or provide guidance to an alternative resolution.
• There is never a single root cause to a problem. There is
causation which means a problem has multiple causes and not a
singular root cause. <will handle root cause analysis later>
derived from ITIL
ITIL’s Problem definition

The following questions need to be answered for any Problem
encountered:
• What is the problem?
• Why is there a problem?
• When did the problem happen?
• How did the problem occur?
• Where did the problem manifest itself?
• Who has been experiencing this problem?
derived from ITIL
ITIL’s Problem definition (cont.)

Differentiating Incidents & Problems
• Incident management needs to be solely concerned with returning the service to
an operational state, and should not be over-complicated with the analysis of root
causes.
• Problem management is pro-active and is used to combat future incidents. It is not
time dependent.
• Major Incidents have a direct relationship to Problem Management, as the
underlying triggers of the Major Incident are usually Problems.

move from being
reactive, to
proactive
don’t be a loadshedding statistic

What is a Major Incident?
• An incident is any event that is not part of the standard operation of a
service and that causes an interruption or a reduction in the quality of
that service.
• A Major Incident is an unplanned or temporary interruption of service
with severe negative consequences.
• Any service outage that does not qualify as a Major Incident should be
categorised as a Moderate, Minor or Normal Incident.
• Major Incident reports are escalated to the Problem Manager for
quality assurance. (Problem Managers are part of the Alpha Tiger
team – refer chapter 5).
derived from ITIL

Major Incidents (cont.)
• Dealing with these processes is crucial as they are potential
showstoppers for the business.
• Major Incidents can have a severe business impact such as:
• service, system or infrastructure component not functioning
adequately to enable business process
• total loss of service, system or infrastructure component
• Major Incidents could also be those which do not entirely disrupt the
use of the service, system or infrastructure component such as:
• continuous slow response
• general degradation of service

ISO 20000 Major Incident process
ITIL provides a definition but does not define a process for
managing Major Incidents.
ISO 20000 clarifies the process around a Major Incident as follows:
The service provider shall document and agree with the client the
definition of a Major Incidents. Major Incidents shall be classified and
managed according to a documented procedure. Top management shall
be informed of Major Incidents. Top management shall ensure that a
designated individual responsible for managing the Major Incidents is
appointed. After the agreed service has been restored, Major Incidentss
shall be reviewed to identify opportunities for improvement. (ISO/IEC
20000:2011, section 8.1)

Vital Business Functions
• ITIL defines Vital Business Functions (VBFs) as a critical element of a business
process that is underpinned by IT. These are the business functions that are the
most important across all the business processes being supported.
• A failure of a VBF is most likely classified as a Major Incident.
• The business determines what is a VBF, not IT.
• It is important to:
• Agree on a list of the important aspects of business on which IT should focus
to ensure adequate resource allocation
• Map the business activities to IT operational activities

Examples of Major Incidents
• Backhoe hits fibre going into campus
• Power blackout
• Hailstorm takes out infrastructure
• Flooding
• Operator mistakenly deletes database
• Data centre goes hard down
• CEO is arrested for criminal activity
All handled by same process!

Review
• A significant proportion of Crisis Management involves the Major
incident process and thus we will deal with the process in depth

Problem management foundation - Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Problem management foundation - Overview

Similar to Problem management foundation - Overview (20)

More from Ronald Bartels

More from Ronald Bartels (20)

Recently uploaded

Recently uploaded (17)

Problem management foundation - Overview

Editor's Notes