Every cloud has a silver lining

EVERY CLOUD HAS
A SILVER LINING
A WHITEPAPER ON ITSM INCIDENT MANAGEMENT PROCESS FOR CLOUD
ENVIRONMENT
Cloud Computing has changed the dynamics of IT Services business but organizations have
not been able to foresee the changes required in ITSM Processes and Procedures to adopt
the Cloud Computing. In this publication, I have tried to explore the procedure and process
level changes needed in ITIL Incident Management process in order to work smoothly in Cloud
Environment.
Published by: Aditya Dashora

© Conceptualized and Published by Aditya Dashora
1
About the Author
Aditya Dashora, a senior consultant from Infosys Limited is an IT Enthusiast with around 9 years of experience in delivering many IT Service Management consulting projects for large enterprises across the globe.
Aditya is quite passionate about helping CIOs and CTOs in improving their IT Strategy to meet the current and future demands. Also, he is instrumental in exploring and defining new ways of working for the organizations by leveraging technology. Aditya is based out of Bangalore, India.
Contact Information:
adydashora@gmail.com
https://www.linkedin.com/in/adityadashora

2
CONTENTS
1. Executive Summary .................................................................................................................................. 3
2. A sneak peek into the world of “Cloud” .............................................................................................. 4
3. Incident Management process for Cloud ........................................................................................... 6
4. Procedural Level Changes ..................................................................................................................... 8
5. Key Performance Indicators ................................................................................................................. 15
6. Key Policies ............................................................................................................................................... 16
7. Technology Considerations .................................................................................................................. 17
References........................................................................................................................................................ 18

Executive Summary 3
1. EXECUTIVE SUMMARY
With the rapidly growing adoption rate, it is already conceived that within next 5-6 years, Cloud Computing is going to change the rules of the game, played by victorious IT Service Providers across the world. Firms, doing business in IT Infrastructure space have started feeling nervousness about the growing acceptability of IaaS and PaaS services provided by Cloud Vendors. IT Service Management, an instrument or weapon used by IT Service Providers and IT Support Organization to fight the so called challenges in delivering IT Services to the customers, also considered as a style statement within the IT Service Industry is going to play a vital role in the Cloud IT Shop. However, concepts of ITSM will require some restructuring and renovation in order to attain the capabilities to support the Cloud based IT Shop.
In this article, I have tried to explain the operational level changes needed in a traditional Incident Management process to ensure accurate and speedy reaction to the Incidents/Issues/Events in a Cloud Environment.

A sneak peek into the world of “Cloud” 4
2. A SNEAK PEEK INTO THE WORLD OF “CLOUD” 2.1. CLOUD ENVIRONMENT OVERVIEW
NIST definition of Cloud says that Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models.
The three Cloud service models are defined as i.e. 1) Software as a Service, 2) Platform as a Service & 3) Infrastructure as a Service. Similarly, there are four Cloud deployment models i.e. Private Cloud, Public Cloud, Community Cloud and Hybrid Cloud.
There are five essential characteristics of Cloud Computing defined by NIST and they are: 1) On- demand self-service, 2) broadband network access, 3) Resource Pooling, 4) Rapid Elasticity and 5) Measured Service.
Traditionally in an IT organization, IT support function(including managed service providers) is responsible for procurement, implementation, support and maintenance of IT Services and components like Critical Business Application, Enterprise/Corporate Applications, Messaging Services, Databases, Batch Processing Services, Servers, Middleware, Storage and Back-up infrastructure, Network, and IT Security Management etc.. In cloud implementation, some of the mentioned IT components are provided and supported by a Cloud Service Provider on pay per use basis. In case of a Private Cloud, the Technology Management function becomes the Cloud Provider while in Public Cloud organizations avail services from providers like AWS, Rack Space, Google Compute, MS Azure, Salesforce etc.
Focus of any Cloud implementation is to reduce cost of IT and ensure high availability and in order to achieve that, it is important to identify and analyze “IT Services” & “Critical Business Applications” and define a Cloud implementation strategy.
Some organizations choose to retain some of its critical IT Service components on-premise and move reminder to the cloud. For example, a manufacturing company can choose to retain its “Order Management System” applications and supporting infrastructure in-premise and offload supporting services like Collaboration Portal, Messaging, CRM, HR Portal etc. to the cloud. This setup is commonly known as Hybrid Cloud or IT Mix.
A common Cloud adoption approach is to move entire non- production into cloud, which will ensure significant amount of cost savings. Applications which require unpredictable capacity during peak load hours are also good candidates of cloud services.

A sneak peek into the world of “Cloud” 5
2.2. HYBRID CLOUD – THE REALITY OF THE FUTURE
Hybrid Cloud Environment is said to be the reality of the future of Cloud Computing. In the cloud adoption journey, on one hand enterprises will transform their data centers into a private cloud and also, they will engage multiple Cloud Providers to enjoy the benefits of Public Cloud. For this white paper, I have considered a case of a big enterprise with a hybrid cloud environment. They are using SaaS and IaaS from Public Cloud and along with their Private Cloud. In next section, I have elaborated the required changes in the Incident Management process to manage a hybrid cloud environment.
The reason to choose this scenario is that majority of the organizations will opt to walk on this path. Organizations have already invested a lot into their IT environment and own IT Assets of worth millions of dollars. Also, many organizations would choose to retain some of the IT Services related to their critical business processes. Therefore, Hybrid Cloud deployment model provides enough control, governance and flexibility so that enterprises can enjoy best of the both worlds.

Incident Management process for Cloud 6
3. INCIDENT MANAGEMENT PROCESS FOR CLOUD 3.1. INCIDENT MANAGEMENT PROCESS OVERVIEW
ITIL defines Incident as, “An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet impacted service is also an incident, for example failure of one disk from a mirror set.
“Incident Management (referred as IM hereafter) is the process for dealing with all incidents; this can include failures, questions or queries reported by the users (usually via a telephone call to the Service Desk), by technical staff, or automatically detected and reported by event monitoring tools.”
These definitions are very much relevant in a cloud environment. The only update is that we can be more specific in this definition to cover all the dependencies of IT Services: “An unplanned interruption to an IT service or reduction in the quality of an IT service, service degradation/failure of a configuration item or any enabler technology i.e. orchestration, hypervisor, self service module, monitoring platform.”
Other important aspect which we need to keep in mind is dynamic nature of cloud environment and because of that; many of the “Incidents” would end-up becoming minor Change Requests. So, one need to be specific with the qualification of an Incident in a cloud environment.
INCIDENT MANAGEMENT (IM) HIGH LEVEL PROCESS DIAGRAM:
Figure 1
This process has been working seamlessly for any IT support function and would be instrumental in cloud environment as well. There may be a need to emphasize more on some activities than others. Also, cloud includes significant amount of automation and self-service and therefore, some of the procedures or activities would be performed automatically or overlapped with sub-sequent activities. (Red dotted circles)

Incident Management process for Cloud 7
SUGGESTED INCIDENT MANAGEMENT PROCESS FOR CLOUD:
Figure 2

Procedural Level Changes 8
4. PROCEDURAL LEVEL CHANGES 4.1. INCIDENT IDENTIFICATION
Incident Identification is performed in two fashions, 1) Identification through Event Management platform & 2) User Reported Incidents. In cloud environment, there would be a higher degree of dependency on the Event Monitoring systems and therefore a mature Event Management process is a pre-requisite. Incidents related to enabler technologies like Hypervisors, Orchestration Engine, Load balancers, Network or Domain Controllers will be identified by the monitoring tools based on pre-defined thresholds. IT Security related Incidents will be identified by the IT Security Monitoring tools and will share the information with central Event Management system or Manager or Managers (MoM) layer. It is important to understand that in Public Cloud, there is a potential risk of data leakage or security breach. Therefore, an IT organization must be sensitive towards the security risk measures taken by Public Cloud vendors, and should try to establish real- time monitoring of security issues.
For all user reported Incidents, it is important to determine the single point of failure using network topology or CMDB (Configuration Management Database). Traditionally, this activity was performed by L1/L2 support teams but in cloud environment, Service Desk or first point of contact should be able to detect that in most of the cases. Cloud implementation ensures good amount of automation and transparency which enables support staff to determine the single point of failure.
A mature CMDB can provide CI dependency details but in cloud environment it might not be relevant to identify faulty CIs due to dynamic nature of technology. Rather, it would be more meaningful to trace the failed service, its dependencies on other services and failover plans. Also, orchestration engine and self-service module can be configured to display on-going major incidents to the users which can avoid incident queue.
In case of Incidents related to public cloud, most of the issues will be identified by the Cloud Vendor and reminder would be reported by business-users/end-users. Ideally, orchestration engine and service management platform should be capable of fetching real-time data from the public cloud vendor and display ongoing outages/Incidents. That would suppress the related Incidents. Apart from that, issues related to Network Connectivity, Application Functionality (SaaS), Application Deployment (PaaS) related issues would be identified and reported by users as usual.
In Hybrid Cloud model, issues related to Storage Gateway or connecter between In Premise infra and Public Cloud will be identified and logged by both cloud service consumer and the vendor (e.g. AWS). However, for better governance, the ownership of the ticket must remain with Service Desk/L1/L2 support and not with the Cloud Vendor. Same policy would be applicable for tickets created by monitoring tools at the vendor. 4.2. INCIDENT LOGGING
Incident Logging is second activity in IM lifecycle and it holds equal relevance in cloud environment as traditional IT setup. As mentioned in ITIL v3 Service Operations book “All relevant

information relating to the nature of the incident must be logged so that a full historical record is maintained”
Popular Service Management tools like ServiceNow, Remedy, HPSM etc. provide multiple fields for logging an Incident ticket and needless to say that all of them are very much applicable for a cloud environment. Besides, it would require a few additional fields to support accurate classification and uniqueness of an Incident ticket in cloud environment.
For example:
- A field for identification of cloud provider would be very helpful in reducing overall ticketing timestamp. It can be a dropdown with values like private cloud, public cloud etc. or specifically NJ Datacenter, Singapore Datacenter, AWS, Rackspace etc.
- A field for associated hardware location/country can be helpful in case of security issues. (tip: every country has different laws for data security)
- A field for affected Services or business processes would be helpful in communication
- A check-box for hypervisor related issues
In case of a hardware failure that can impact multiple services and thousands of users, Incident Logging becomes crucial activity to trigger the resolution and recovery work. A hardware failure must be treated as Sev-1 or Critical Incident and all dependent service owners/business process owners must be notified in real time. Therefore, it is expected that the Incident Ticket should be able to provide information about all the upstream and downstream dependencies of the failed CI.
In case of incidents related to Public Cloud, the information flow from vendor’s monitoring and ticketing tools to the host systems is essential and therefore automation and integration tools will play a critical role. 4.3. INCIDENT CATEGORIZATION
Incident Categorization activity is performed by the Service Desk staff/IT Support Staff to ensure that appropriate categorization codes are assigned to each Incident. With the help of automation, Event Monitoring tools can also populate Incident Categorization codes while create an Incident ticket from an Event.
In a cloud environment, although Incident Categorization activity overlaps with Incident Logging however, Incident Categorization metadata must be designed to obtain meaningful information for rapid routing of Incidents, Problem Identification and Supplier Management.
Traditional Multilevel Categorization Example is:
Category
Tier-1
Tier-2
Tier-3
Incident
Hardware
Server
Memory Board
Incident
Software
Microsoft
Exchange
Table 1
Another popular approach is categorized as CI Category and Service Category. Example:

CI Name: NN150B12Win2k8A01
Service: Collaboration Service
In a cloud environment, we need to ensure that Incident Categorization provides details on service provider, service, name of the application/service/server, criticality index etc. For example:
AWS ->Infra -> ABCAWSUSEC001 -> Criticality Index: 1 -> Not Accessible
Salesforce -> Application -> CRM -> Criticality Index: 2> Functionality Issue
Private Cloud -> Application -> Exchange Server -> Criticality Index: 2 -> Slow Response
Private Cloud -> Intranet -> Connectivity -> Not Accessible
ATT -> Internet -> Connectivity
AWS ->Security -> Unauthorized Access

4.4. INCIDENT PRIORITIZATION
Incident Prioritization is one of the most critical aspects of not only IM process but the whole lifecycle of IT Services. Incident Prioritization means allocating appropriate priority to an Incident based on pre-defined criteria. Allocated priority codes will help support staff to give appropriate attention to the Incident. Most of the IT Outsourcing Contracts are driven by the SLAs which are defined based on Incident Priority Guidelines.
In a cloud environment, Incident Prioritization becomes all the more important because a) there are multiple service providers who may have to work towards Incident Resolution, b) Single hardware or hypervisor failure can effect multiple users and services & c) Due to heavy dependency on Network (WAN & LAN), any network related issue must be treated as high priority
Typically, priority of an Incident is determined by two factors namely “Impact” and “Urgency” where Impact is how much damage caused by an Incident and Urgency is how quickly it needs to be resolved. Some of the organizations use a questionnaire to determine the impact and urgency. In case of user reported Incidents, user can be facilitated to provide inputs for determining the urgency.
Incident Priority data or logs are analyzed further for defining and negotiation SLAs (Service Level Agreement)/ OLAs (Operational Level Agreement) and UCs (Underpinning Contracts). Therefore, in a cloud environment, where there is significant dependency on the vendors/service providers, a proper Incident Prioritization would certainly play a major role in SLA Definition and Negotiations activities. It will also help in determining the good candidates (Apps or Infra) for migrating to public cloud based on impact/urgency analysis.
An example of Incident Prioritization in Cloud Environment: Urgency Urgency Determination Questionnaire (example):  Revenue Generating Service/Application?  Brand Exposure?  Safety Exposure?  Business Hours?  CIA Rating of the Service/Application?  VIP User Profile?  Orchestration Engine related? High Medium Low Impact Extensive/Widespread Critical High Medium Significant/ Large High High Medium Moderate / Medium Medium Medium Medium Localized/ Minor Medium Low Low Impact Determination Questionnaire (example):  Number of Instances/ virtual devices?  Number of Services/ Applications?  Number of Geographical locations?  BCP Available?  Network Issue?  Number of Users?
Table 2

4.5. INCIDENT ESCALATION
In traditional IM process, there are two types of Incident Escalation procedures: 1) Functional Escalation & 2) Hierarchical Escalation. Functional Escalation defines inter-groups/teams routing model. Example: Service Desk to Wintel Support; Wintel to DBA; DBA to Network; Network to Third Party and so on. On the other hand, Hierarchical Escalation provides a mechanism to involve senior management or leadership team in case of a Sev-1 incident or any challenging situation like ambiguity on Incident Ownership, involving third party on warranty issues, customer dissatisfaction etc.
In a cloud environment, there are multiple parties involved or associated with a Service and therefore any Service degradation (Incident) would require all the stakeholders to come together as an online forum. For that purpose, Functional and Hierarchical Escalations should run hand-in hand. The only difference is that business might not be interested in known the details of Incidents while they would be interested in knowing the impact on their work. So, the communication has to be designed in such a way that it sends out relevant details to the stakeholders.
In a suggested Incident Escalation model for cloud, an Incident should be assigned to a support group and at the same time other groups who have any relationship with the Incident should also get notification. Later on, after Incident resolution activity, one of the effected support groups may be engaged to give a sign-off. Social Networking features in Service Management tool can play a role in this kind of escalation. In- case of vendor related Incidents; vendor must be intimidated at the beginning of the Incident lifecycle. Once the Incident is assigned to the vendor, then a parallel communication must be sent to Problem Manager, IT Manager, Vendor Manager and Account Manager (vendor).
SLA BREACH NOTIFICATIONS
In-case of SLA breach warning, a communication/notification must be sent out to group manager, IT manager, IT Director etc. In an SLA breach situation, apart from IT leadership team, stakeholders from the business and finance must be involved. Some of the vendors have service based SLAs (non-negotiable) and in that case, a clear expectation setting must be done with the business. During Service Design phase, business should get the option to choose components from the catalog based on SLA vs. Cost analysis. Example:
Server Type
Baseline SLA (turn-around)
Hourly Downtime Cost (post the Baseline SLA)
HPC Windows (Private)
2 Hours
$7000
HPC Unix (Private)
2 Hours
$6000
HPC Windows (Public)
Best efforts
$1500
HPC Unix (Public)
Best Efforts
$1100
Table 3

ROLE OF SERVICE DESK
In a traditional enterprise, Service Desks are responsible for determining Incident Category followed by performing initial investigation based on knowledge base or Runbook and finally escalating the ticket to the appropriate support group. Considering the complexity and nature of the Incidents in cloud environment, there are chances that traditional service desk function might not be able to do initial diagnosis and they may end up routing it to wrong support group. Hence it becomes important to upgrade the traditional service desk by marrying it to monitoring teams or command center. Combining two teams will form a function known as integrated command center (ICC) or IT Operations Center (ITOC), which will have good technical competency to perform initial investigation and escalation in cloud environment.
We have to keep in mind that majority of common Incidents related to availability, accessibility, device failure etc. will be eliminated in cloud environment because of the high performance compute design. Hence, it makes absolute sense to combine Service Desk and Command Center and enhance the productivity. 4.6. INVESTIGATION, RESOLUTION AND RECOVERY
In traditional IM lifecycle, Incident Investigation & Incident Resolution are defined as sequential activities. In cloud environment, we should go a step further and combine them for faster turnaround. It would be a logical step because in the previous section, I proposed to merge Service Desk and Monitoring teams for better initial investigation and diagnosis. Therefore, unwanted Incident hopping (escalation to wrong groups) should be eliminated and resolution and recovery should come right after the escalation.
Incident Resolution in cloud should be faster and better than traditional IT environment. There must be higher degree of proactive detection, fault tolerance, redundancy to avoid downtime, auto correction aspects and intelligent systems to analysis and detect Incidents proactively.
In a white paper published by VMWare on “Proactive Incident and Problem Management”, they have defined three Cloud Capability Levels: 1) Reactive, 2) Proactive & 3) Innovative where Reactive is lowest maturity level for a cloud provider and Innovative is highest. Reactive model is natural approach but it’s not sustainable in cloud environment because of various reasons including visualization, orchestration, no clarity on assets/CI/managed objects etc. So, it becomes important to develop intelligent systems to analyze the event monitoring data, historical ticket data, maintenance tasks, business growth patterns, IT needs of a business process and other IT drivers and move from Reactive capability to Innovative Capability.
Incidents in a cloud environment would require highly skilled professionals but at the same time, cloud environment provides enough redundancy to avoid/reduce downtime. So initially there might be some limitations in establishing SOP/Run-book (Standard Operating Procedure) based approach but in a longer run, cloud can provide enough opportunities to reduce Incidents and automate resolution tasks. In a cloud environment, IT support staff should work towards ensuring that repetitive Incidents do not occur in the environment.
Once the Incident is resolved, it can be owned by support team itself or passed to other group for validation/sign-off. In case of user reported Incidents, a user sign-off must be taken.

4.7. INCIDENT CLOSURE
Once the Incident is resolved, it enters into the ultimate activity of its lifecycle which is Incident Closure. Incident Closure is an important activity for ensuring that required solution has been provided and implemented.
In Incident Escalation section, I have mentioned about the Incident or Service Failure notification to all the stakeholders. Likewise, before closing the Incidents, system needs ensure that all the stakeholders have given their sign-off on the Incident. This task can be automated by making it time bound force closure. In case of public cloud, the closure must be performed only after obtaining required confirmation from Cloud Providers.
Most of the Service Management tools provide Closure Categorization Codes (Similar to Incident Categorization) and it would be helpful in Cloud Environment to use those codes properly.
If solution provided by support groups doesn’t completely solve the issue, then stakeholders or end-user may choose to Re-open the incident. Any re-opened Incident would trigger hierarchical escalation and involve senior management into the lifecycle for better governance.

Key Performance Indicators 15
5. KEY PERFORMANCE INDICATORS
Key Performance Indicators (KPIs) are also known as process performance measurement criteria. As name indicates, the purpose of KPIs is to evaluate the process performance against process goals and objectives. Some of the mature organizations have tightly coupled KPIs with Business CSFs (Critical Success Factors).
As illustrated in ITIL v3 Guidelines “A KPI refers to a specific, agreed level of performance that will be used to measure the effectiveness of an organization or process”
The standard to define KPIs is known is GQM approach where G is Goals, Q is Question and M is Metrics. The goal is very clear here – to ensure that Incidents are resolved at the earliest. The questions we may ask that “what it takes to do rapid incident resolution?”; “what can cause the delay?”; “what are the dependencies?”
When we start thinking on these lines, we come across multiple KPIs related to Incident Management process. Most of the KPIs are already being used in the industry. In this section, we will try to explore the needs to revise the existing KPIs for Cloud Environment.
Let’s take a look at some of the KPIs:
- Percentage Reduction in number in Incidents (Month-on-month)
- Percentage Reduction in Weekly Incident Backlog (weekly)
- Percentage Increment in SLA compliance (daily/weekly)
- Percentage reduction in incorrectly assigned Incidents (weekly/monthly)
In case of Cloud, we need to consider the performance of the “vendor” or partner. Therefore there is a need to have additional KPIs to ensure required coverage.
Some examples of additional KPIs for Cloud Incident Management Process:
- Ratio of auto generated tickets and user reported tickets
- Percentage reduction in issues escalated to Cloud Service Provider
- Percentage reduction in incorrect escalations to Cloud Service Provider
- Percentage reduction in the Incident Diagnosis time
- Percentage reduction in incorrectly categorized incidents
- Percentage reduction in number of major Incidents
- Percentage reduction in average turn-around time from vendor
- Increase in proactive detection rate

Key Policies 16
6. KEY POLICIES
Ticket Ownership Policy
Ticket ownership should always remain with the cloud consumer. Having said that, we must account certain situations that are controlled by cloud vendor internally and cloud consumer will have no role to play. For those instances, we can consider a joint ownership and ensure that cloud consumer gets real time updates on the issues.
Escalation Policy
Any escalation to the cloud vendor must be approved or supervised by L3 support team or Incident Manager. Team must ensure that there is minimum incorrect escalation to the cloud vendor. In case of issues related to internal infrastructure or applications, the escalation guidelines are same as mentioned in ITIL book.

Technology Considerations 17
7. TECHNOLOGY CONSIDERATIONS
As mentioned earlier, technology is going to play a critical role in supporting and managing cloud environment and therefore the ITSM Processes must be integrated and orchestrated in such a way that they can enable a seamless information flow between the processes, tools and teams. There are four key technology considerations that are critical for running Incident Management process in Cloud.
Service Catalog
Self Service
Orchestration
Analytics
Below is a reference high level architecture of Integrated ITSM Processes to support future technology:
Figure 3

References 18
REFERENCES
1. ITIL 2011 Guidelines (https://www.axelos.com/itil)
2. Wikipedia (http://en.wikipedia.org/wiki/Cloud_computing)
3. ServiceNow (http://www.servicenow.com)
4. NIST Cloud Definition

Every cloud has a silver lining

Recommended

Recommended

More Related Content

Similar to Every cloud has a silver lining

Similar to Every cloud has a silver lining (20)

More from Aditya Dashora

More from Aditya Dashora (8)

Recently uploaded

Recently uploaded (20)

Every cloud has a silver lining