SlideShare a Scribd company logo
EVERY CLOUD HAS 
A SILVER LINING 
A WHITEPAPER ON ITSM INCIDENT MANAGEMENT PROCESS FOR CLOUD 
ENVIRONMENT 
Cloud Computing has changed the dynamics of IT Services business but organizations have 
not been able to foresee the changes required in ITSM Processes and Procedures to adopt 
the Cloud Computing. In this publication, I have tried to explore the procedure and process 
level changes needed in ITIL Incident Management process in order to work smoothly in Cloud 
Environment. 
Published by: Aditya Dashora
© Conceptualized and Published by Aditya Dashora 
1 
About the Author 
Aditya Dashora, a senior consultant from Infosys Limited is an IT Enthusiast with around 9 years of experience in delivering many IT Service Management consulting projects for large enterprises across the globe. 
Aditya is quite passionate about helping CIOs and CTOs in improving their IT Strategy to meet the current and future demands. Also, he is instrumental in exploring and defining new ways of working for the organizations by leveraging technology. Aditya is based out of Bangalore, India. 
Contact Information: 
adydashora@gmail.com 
https://www.linkedin.com/in/adityadashora
© Conceptualized and Published by Aditya Dashora 
2 
CONTENTS 
1. Executive Summary .................................................................................................................................. 3 
2. A sneak peek into the world of “Cloud” .............................................................................................. 4 
3. Incident Management process for Cloud ........................................................................................... 6 
4. Procedural Level Changes ..................................................................................................................... 8 
5. Key Performance Indicators ................................................................................................................. 15 
6. Key Policies ............................................................................................................................................... 16 
7. Technology Considerations .................................................................................................................. 17 
References........................................................................................................................................................ 18
© Conceptualized and Published by Aditya Dashora 
Executive Summary 3 
1. EXECUTIVE SUMMARY 
With the rapidly growing adoption rate, it is already conceived that within next 5-6 years, Cloud Computing is going to change the rules of the game, played by victorious IT Service Providers across the world. Firms, doing business in IT Infrastructure space have started feeling nervousness about the growing acceptability of IaaS and PaaS services provided by Cloud Vendors. IT Service Management, an instrument or weapon used by IT Service Providers and IT Support Organization to fight the so called challenges in delivering IT Services to the customers, also considered as a style statement within the IT Service Industry is going to play a vital role in the Cloud IT Shop. However, concepts of ITSM will require some restructuring and renovation in order to attain the capabilities to support the Cloud based IT Shop. 
In this article, I have tried to explain the operational level changes needed in a traditional Incident Management process to ensure accurate and speedy reaction to the Incidents/Issues/Events in a Cloud Environment.
© Conceptualized and Published by Aditya Dashora 
A sneak peek into the world of “Cloud” 4 
2. A SNEAK PEEK INTO THE WORLD OF “CLOUD” 2.1. CLOUD ENVIRONMENT OVERVIEW 
NIST definition of Cloud says that Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. 
The three Cloud service models are defined as i.e. 1) Software as a Service, 2) Platform as a Service & 3) Infrastructure as a Service. Similarly, there are four Cloud deployment models i.e. Private Cloud, Public Cloud, Community Cloud and Hybrid Cloud. 
There are five essential characteristics of Cloud Computing defined by NIST and they are: 1) On- demand self-service, 2) broadband network access, 3) Resource Pooling, 4) Rapid Elasticity and 5) Measured Service. 
Traditionally in an IT organization, IT support function(including managed service providers) is responsible for procurement, implementation, support and maintenance of IT Services and components like Critical Business Application, Enterprise/Corporate Applications, Messaging Services, Databases, Batch Processing Services, Servers, Middleware, Storage and Back-up infrastructure, Network, and IT Security Management etc.. In cloud implementation, some of the mentioned IT components are provided and supported by a Cloud Service Provider on pay per use basis. In case of a Private Cloud, the Technology Management function becomes the Cloud Provider while in Public Cloud organizations avail services from providers like AWS, Rack Space, Google Compute, MS Azure, Salesforce etc. 
Focus of any Cloud implementation is to reduce cost of IT and ensure high availability and in order to achieve that, it is important to identify and analyze “IT Services” & “Critical Business Applications” and define a Cloud implementation strategy. 
Some organizations choose to retain some of its critical IT Service components on-premise and move reminder to the cloud. For example, a manufacturing company can choose to retain its “Order Management System” applications and supporting infrastructure in-premise and offload supporting services like Collaboration Portal, Messaging, CRM, HR Portal etc. to the cloud. This setup is commonly known as Hybrid Cloud or IT Mix. 
A common Cloud adoption approach is to move entire non- production into cloud, which will ensure significant amount of cost savings. Applications which require unpredictable capacity during peak load hours are also good candidates of cloud services.
© Conceptualized and Published by Aditya Dashora 
A sneak peek into the world of “Cloud” 5 
2.2. HYBRID CLOUD – THE REALITY OF THE FUTURE 
Hybrid Cloud Environment is said to be the reality of the future of Cloud Computing. In the cloud adoption journey, on one hand enterprises will transform their data centers into a private cloud and also, they will engage multiple Cloud Providers to enjoy the benefits of Public Cloud. For this white paper, I have considered a case of a big enterprise with a hybrid cloud environment. They are using SaaS and IaaS from Public Cloud and along with their Private Cloud. In next section, I have elaborated the required changes in the Incident Management process to manage a hybrid cloud environment. 
The reason to choose this scenario is that majority of the organizations will opt to walk on this path. Organizations have already invested a lot into their IT environment and own IT Assets of worth millions of dollars. Also, many organizations would choose to retain some of the IT Services related to their critical business processes. Therefore, Hybrid Cloud deployment model provides enough control, governance and flexibility so that enterprises can enjoy best of the both worlds.
© Conceptualized and Published by Aditya Dashora 
Incident Management process for Cloud 6 
3. INCIDENT MANAGEMENT PROCESS FOR CLOUD 3.1. INCIDENT MANAGEMENT PROCESS OVERVIEW 
ITIL defines Incident as, “An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet impacted service is also an incident, for example failure of one disk from a mirror set. 
“Incident Management (referred as IM hereafter) is the process for dealing with all incidents; this can include failures, questions or queries reported by the users (usually via a telephone call to the Service Desk), by technical staff, or automatically detected and reported by event monitoring tools.” 
These definitions are very much relevant in a cloud environment. The only update is that we can be more specific in this definition to cover all the dependencies of IT Services: “An unplanned interruption to an IT service or reduction in the quality of an IT service, service degradation/failure of a configuration item or any enabler technology i.e. orchestration, hypervisor, self service module, monitoring platform.” 
Other important aspect which we need to keep in mind is dynamic nature of cloud environment and because of that; many of the “Incidents” would end-up becoming minor Change Requests. So, one need to be specific with the qualification of an Incident in a cloud environment. 
INCIDENT MANAGEMENT (IM) HIGH LEVEL PROCESS DIAGRAM: 
Figure 1 
This process has been working seamlessly for any IT support function and would be instrumental in cloud environment as well. There may be a need to emphasize more on some activities than others. Also, cloud includes significant amount of automation and self-service and therefore, some of the procedures or activities would be performed automatically or overlapped with sub-sequent activities. (Red dotted circles)
© Conceptualized and Published by Aditya Dashora 
Incident Management process for Cloud 7 
SUGGESTED INCIDENT MANAGEMENT PROCESS FOR CLOUD: 
Figure 2
© Conceptualized and Published by Aditya Dashora 
Procedural Level Changes 8 
4. PROCEDURAL LEVEL CHANGES 4.1. INCIDENT IDENTIFICATION 
Incident Identification is performed in two fashions, 1) Identification through Event Management platform & 2) User Reported Incidents. In cloud environment, there would be a higher degree of dependency on the Event Monitoring systems and therefore a mature Event Management process is a pre-requisite. Incidents related to enabler technologies like Hypervisors, Orchestration Engine, Load balancers, Network or Domain Controllers will be identified by the monitoring tools based on pre-defined thresholds. IT Security related Incidents will be identified by the IT Security Monitoring tools and will share the information with central Event Management system or Manager or Managers (MoM) layer. It is important to understand that in Public Cloud, there is a potential risk of data leakage or security breach. Therefore, an IT organization must be sensitive towards the security risk measures taken by Public Cloud vendors, and should try to establish real- time monitoring of security issues. 
For all user reported Incidents, it is important to determine the single point of failure using network topology or CMDB (Configuration Management Database). Traditionally, this activity was performed by L1/L2 support teams but in cloud environment, Service Desk or first point of contact should be able to detect that in most of the cases. Cloud implementation ensures good amount of automation and transparency which enables support staff to determine the single point of failure. 
A mature CMDB can provide CI dependency details but in cloud environment it might not be relevant to identify faulty CIs due to dynamic nature of technology. Rather, it would be more meaningful to trace the failed service, its dependencies on other services and failover plans. Also, orchestration engine and self-service module can be configured to display on-going major incidents to the users which can avoid incident queue. 
In case of Incidents related to public cloud, most of the issues will be identified by the Cloud Vendor and reminder would be reported by business-users/end-users. Ideally, orchestration engine and service management platform should be capable of fetching real-time data from the public cloud vendor and display ongoing outages/Incidents. That would suppress the related Incidents. Apart from that, issues related to Network Connectivity, Application Functionality (SaaS), Application Deployment (PaaS) related issues would be identified and reported by users as usual. 
In Hybrid Cloud model, issues related to Storage Gateway or connecter between In Premise infra and Public Cloud will be identified and logged by both cloud service consumer and the vendor (e.g. AWS). However, for better governance, the ownership of the ticket must remain with Service Desk/L1/L2 support and not with the Cloud Vendor. Same policy would be applicable for tickets created by monitoring tools at the vendor. 4.2. INCIDENT LOGGING 
Incident Logging is second activity in IM lifecycle and it holds equal relevance in cloud environment as traditional IT setup. As mentioned in ITIL v3 Service Operations book “All relevant
© Conceptualized and Published by Aditya Dashora 
Procedural Level Changes 9 
information relating to the nature of the incident must be logged so that a full historical record is maintained” 
Popular Service Management tools like ServiceNow, Remedy, HPSM etc. provide multiple fields for logging an Incident ticket and needless to say that all of them are very much applicable for a cloud environment. Besides, it would require a few additional fields to support accurate classification and uniqueness of an Incident ticket in cloud environment. 
For example: 
- A field for identification of cloud provider would be very helpful in reducing overall ticketing timestamp. It can be a dropdown with values like private cloud, public cloud etc. or specifically NJ Datacenter, Singapore Datacenter, AWS, Rackspace etc. 
- A field for associated hardware location/country can be helpful in case of security issues. (tip: every country has different laws for data security) 
- A field for affected Services or business processes would be helpful in communication 
- A check-box for hypervisor related issues 
In case of a hardware failure that can impact multiple services and thousands of users, Incident Logging becomes crucial activity to trigger the resolution and recovery work. A hardware failure must be treated as Sev-1 or Critical Incident and all dependent service owners/business process owners must be notified in real time. Therefore, it is expected that the Incident Ticket should be able to provide information about all the upstream and downstream dependencies of the failed CI. 
In case of incidents related to Public Cloud, the information flow from vendor’s monitoring and ticketing tools to the host systems is essential and therefore automation and integration tools will play a critical role. 4.3. INCIDENT CATEGORIZATION 
Incident Categorization activity is performed by the Service Desk staff/IT Support Staff to ensure that appropriate categorization codes are assigned to each Incident. With the help of automation, Event Monitoring tools can also populate Incident Categorization codes while create an Incident ticket from an Event. 
In a cloud environment, although Incident Categorization activity overlaps with Incident Logging however, Incident Categorization metadata must be designed to obtain meaningful information for rapid routing of Incidents, Problem Identification and Supplier Management. 
Traditional Multilevel Categorization Example is: 
Category 
Tier-1 
Tier-2 
Tier-3 
Incident 
Hardware 
Server 
Memory Board 
Incident 
Software 
Microsoft 
Exchange 
Table 1 
Another popular approach is categorized as CI Category and Service Category. Example:
© Conceptualized and Published by Aditya Dashora 
Procedural Level Changes 10 
CI Name: NN150B12Win2k8A01 
Service: Collaboration Service 
In a cloud environment, we need to ensure that Incident Categorization provides details on service provider, service, name of the application/service/server, criticality index etc. For example: 
AWS ->Infra -> ABCAWSUSEC001 -> Criticality Index: 1 -> Not Accessible 
Salesforce -> Application -> CRM -> Criticality Index: 2> Functionality Issue 
Private Cloud -> Application -> Exchange Server -> Criticality Index: 2 -> Slow Response 
Private Cloud -> Intranet -> Connectivity -> Not Accessible 
ATT -> Internet -> Connectivity 
AWS ->Security -> Unauthorized Access
© Conceptualized and Published by Aditya Dashora 
Procedural Level Changes 11 
4.4. INCIDENT PRIORITIZATION 
Incident Prioritization is one of the most critical aspects of not only IM process but the whole lifecycle of IT Services. Incident Prioritization means allocating appropriate priority to an Incident based on pre-defined criteria. Allocated priority codes will help support staff to give appropriate attention to the Incident. Most of the IT Outsourcing Contracts are driven by the SLAs which are defined based on Incident Priority Guidelines. 
In a cloud environment, Incident Prioritization becomes all the more important because a) there are multiple service providers who may have to work towards Incident Resolution, b) Single hardware or hypervisor failure can effect multiple users and services & c) Due to heavy dependency on Network (WAN & LAN), any network related issue must be treated as high priority 
Typically, priority of an Incident is determined by two factors namely “Impact” and “Urgency” where Impact is how much damage caused by an Incident and Urgency is how quickly it needs to be resolved. Some of the organizations use a questionnaire to determine the impact and urgency. In case of user reported Incidents, user can be facilitated to provide inputs for determining the urgency. 
Incident Priority data or logs are analyzed further for defining and negotiation SLAs (Service Level Agreement)/ OLAs (Operational Level Agreement) and UCs (Underpinning Contracts). Therefore, in a cloud environment, where there is significant dependency on the vendors/service providers, a proper Incident Prioritization would certainly play a major role in SLA Definition and Negotiations activities. It will also help in determining the good candidates (Apps or Infra) for migrating to public cloud based on impact/urgency analysis. 
An example of Incident Prioritization in Cloud Environment: Urgency Urgency Determination Questionnaire (example):  Revenue Generating Service/Application?  Brand Exposure?  Safety Exposure?  Business Hours?  CIA Rating of the Service/Application?  VIP User Profile?  Orchestration Engine related? High Medium Low Impact Extensive/Widespread Critical High Medium Significant/ Large High High Medium Moderate / Medium Medium Medium Medium Localized/ Minor Medium Low Low Impact Determination Questionnaire (example):  Number of Instances/ virtual devices?  Number of Services/ Applications?  Number of Geographical locations?  BCP Available?  Network Issue?  Number of Users? 
Table 2
© Conceptualized and Published by Aditya Dashora 
Procedural Level Changes 12 
4.5. INCIDENT ESCALATION 
In traditional IM process, there are two types of Incident Escalation procedures: 1) Functional Escalation & 2) Hierarchical Escalation. Functional Escalation defines inter-groups/teams routing model. Example: Service Desk to Wintel Support; Wintel to DBA; DBA to Network; Network to Third Party and so on. On the other hand, Hierarchical Escalation provides a mechanism to involve senior management or leadership team in case of a Sev-1 incident or any challenging situation like ambiguity on Incident Ownership, involving third party on warranty issues, customer dissatisfaction etc. 
In a cloud environment, there are multiple parties involved or associated with a Service and therefore any Service degradation (Incident) would require all the stakeholders to come together as an online forum. For that purpose, Functional and Hierarchical Escalations should run hand-in hand. The only difference is that business might not be interested in known the details of Incidents while they would be interested in knowing the impact on their work. So, the communication has to be designed in such a way that it sends out relevant details to the stakeholders. 
In a suggested Incident Escalation model for cloud, an Incident should be assigned to a support group and at the same time other groups who have any relationship with the Incident should also get notification. Later on, after Incident resolution activity, one of the effected support groups may be engaged to give a sign-off. Social Networking features in Service Management tool can play a role in this kind of escalation. In- case of vendor related Incidents; vendor must be intimidated at the beginning of the Incident lifecycle. Once the Incident is assigned to the vendor, then a parallel communication must be sent to Problem Manager, IT Manager, Vendor Manager and Account Manager (vendor). 
SLA BREACH NOTIFICATIONS 
In-case of SLA breach warning, a communication/notification must be sent out to group manager, IT manager, IT Director etc. In an SLA breach situation, apart from IT leadership team, stakeholders from the business and finance must be involved. Some of the vendors have service based SLAs (non-negotiable) and in that case, a clear expectation setting must be done with the business. During Service Design phase, business should get the option to choose components from the catalog based on SLA vs. Cost analysis. Example: 
Server Type 
Baseline SLA (turn-around) 
Hourly Downtime Cost (post the Baseline SLA) 
HPC Windows (Private) 
2 Hours 
$7000 
HPC Unix (Private) 
2 Hours 
$6000 
HPC Windows (Public) 
Best efforts 
$1500 
HPC Unix (Public) 
Best Efforts 
$1100 
Table 3
© Conceptualized and Published by Aditya Dashora 
Procedural Level Changes 13 
ROLE OF SERVICE DESK 
In a traditional enterprise, Service Desks are responsible for determining Incident Category followed by performing initial investigation based on knowledge base or Runbook and finally escalating the ticket to the appropriate support group. Considering the complexity and nature of the Incidents in cloud environment, there are chances that traditional service desk function might not be able to do initial diagnosis and they may end up routing it to wrong support group. Hence it becomes important to upgrade the traditional service desk by marrying it to monitoring teams or command center. Combining two teams will form a function known as integrated command center (ICC) or IT Operations Center (ITOC), which will have good technical competency to perform initial investigation and escalation in cloud environment. 
We have to keep in mind that majority of common Incidents related to availability, accessibility, device failure etc. will be eliminated in cloud environment because of the high performance compute design. Hence, it makes absolute sense to combine Service Desk and Command Center and enhance the productivity. 4.6. INVESTIGATION, RESOLUTION AND RECOVERY 
In traditional IM lifecycle, Incident Investigation & Incident Resolution are defined as sequential activities. In cloud environment, we should go a step further and combine them for faster turnaround. It would be a logical step because in the previous section, I proposed to merge Service Desk and Monitoring teams for better initial investigation and diagnosis. Therefore, unwanted Incident hopping (escalation to wrong groups) should be eliminated and resolution and recovery should come right after the escalation. 
Incident Resolution in cloud should be faster and better than traditional IT environment. There must be higher degree of proactive detection, fault tolerance, redundancy to avoid downtime, auto correction aspects and intelligent systems to analysis and detect Incidents proactively. 
In a white paper published by VMWare on “Proactive Incident and Problem Management”, they have defined three Cloud Capability Levels: 1) Reactive, 2) Proactive & 3) Innovative where Reactive is lowest maturity level for a cloud provider and Innovative is highest. Reactive model is natural approach but it’s not sustainable in cloud environment because of various reasons including visualization, orchestration, no clarity on assets/CI/managed objects etc. So, it becomes important to develop intelligent systems to analyze the event monitoring data, historical ticket data, maintenance tasks, business growth patterns, IT needs of a business process and other IT drivers and move from Reactive capability to Innovative Capability. 
Incidents in a cloud environment would require highly skilled professionals but at the same time, cloud environment provides enough redundancy to avoid/reduce downtime. So initially there might be some limitations in establishing SOP/Run-book (Standard Operating Procedure) based approach but in a longer run, cloud can provide enough opportunities to reduce Incidents and automate resolution tasks. In a cloud environment, IT support staff should work towards ensuring that repetitive Incidents do not occur in the environment. 
Once the Incident is resolved, it can be owned by support team itself or passed to other group for validation/sign-off. In case of user reported Incidents, a user sign-off must be taken.
© Conceptualized and Published by Aditya Dashora 
Procedural Level Changes 14 
4.7. INCIDENT CLOSURE 
Once the Incident is resolved, it enters into the ultimate activity of its lifecycle which is Incident Closure. Incident Closure is an important activity for ensuring that required solution has been provided and implemented. 
In Incident Escalation section, I have mentioned about the Incident or Service Failure notification to all the stakeholders. Likewise, before closing the Incidents, system needs ensure that all the stakeholders have given their sign-off on the Incident. This task can be automated by making it time bound force closure. In case of public cloud, the closure must be performed only after obtaining required confirmation from Cloud Providers. 
Most of the Service Management tools provide Closure Categorization Codes (Similar to Incident Categorization) and it would be helpful in Cloud Environment to use those codes properly. 
If solution provided by support groups doesn’t completely solve the issue, then stakeholders or end-user may choose to Re-open the incident. Any re-opened Incident would trigger hierarchical escalation and involve senior management into the lifecycle for better governance.
© Conceptualized and Published by Aditya Dashora 
Key Performance Indicators 15 
5. KEY PERFORMANCE INDICATORS 
Key Performance Indicators (KPIs) are also known as process performance measurement criteria. As name indicates, the purpose of KPIs is to evaluate the process performance against process goals and objectives. Some of the mature organizations have tightly coupled KPIs with Business CSFs (Critical Success Factors). 
As illustrated in ITIL v3 Guidelines “A KPI refers to a specific, agreed level of performance that will be used to measure the effectiveness of an organization or process” 
The standard to define KPIs is known is GQM approach where G is Goals, Q is Question and M is Metrics. The goal is very clear here – to ensure that Incidents are resolved at the earliest. The questions we may ask that “what it takes to do rapid incident resolution?”; “what can cause the delay?”; “what are the dependencies?” 
When we start thinking on these lines, we come across multiple KPIs related to Incident Management process. Most of the KPIs are already being used in the industry. In this section, we will try to explore the needs to revise the existing KPIs for Cloud Environment. 
Let’s take a look at some of the KPIs: 
- Percentage Reduction in number in Incidents (Month-on-month) 
- Percentage Reduction in Weekly Incident Backlog (weekly) 
- Percentage Increment in SLA compliance (daily/weekly) 
- Percentage reduction in incorrectly assigned Incidents (weekly/monthly) 
In case of Cloud, we need to consider the performance of the “vendor” or partner. Therefore there is a need to have additional KPIs to ensure required coverage. 
Some examples of additional KPIs for Cloud Incident Management Process: 
- Ratio of auto generated tickets and user reported tickets 
- Percentage reduction in issues escalated to Cloud Service Provider 
- Percentage reduction in incorrect escalations to Cloud Service Provider 
- Percentage reduction in the Incident Diagnosis time 
- Percentage reduction in incorrectly categorized incidents 
- Percentage reduction in number of major Incidents 
- Percentage reduction in average turn-around time from vendor 
- Increase in proactive detection rate
© Conceptualized and Published by Aditya Dashora 
Key Policies 16 
6. KEY POLICIES 
Ticket Ownership Policy 
Ticket ownership should always remain with the cloud consumer. Having said that, we must account certain situations that are controlled by cloud vendor internally and cloud consumer will have no role to play. For those instances, we can consider a joint ownership and ensure that cloud consumer gets real time updates on the issues. 
Escalation Policy 
Any escalation to the cloud vendor must be approved or supervised by L3 support team or Incident Manager. Team must ensure that there is minimum incorrect escalation to the cloud vendor. In case of issues related to internal infrastructure or applications, the escalation guidelines are same as mentioned in ITIL book.
© Conceptualized and Published by Aditya Dashora 
Technology Considerations 17 
7. TECHNOLOGY CONSIDERATIONS 
As mentioned earlier, technology is going to play a critical role in supporting and managing cloud environment and therefore the ITSM Processes must be integrated and orchestrated in such a way that they can enable a seamless information flow between the processes, tools and teams. There are four key technology considerations that are critical for running Incident Management process in Cloud. 
Service Catalog 
Self Service 
Orchestration 
Analytics 
Below is a reference high level architecture of Integrated ITSM Processes to support future technology: 
Figure 3
© Conceptualized and Published by Aditya Dashora 
References 18 
REFERENCES 
1. ITIL 2011 Guidelines (https://www.axelos.com/itil) 
2. Wikipedia (http://en.wikipedia.org/wiki/Cloud_computing) 
3. ServiceNow (http://www.servicenow.com) 
4. NIST Cloud Definition

More Related Content

Similar to Every cloud has a silver lining

Transforming an organization to cloud
Transforming an organization to cloud Transforming an organization to cloud
Transforming an organization to cloud
Ali Akbar
 
Cloud Computing for Exploring to Scope in Business
Cloud Computing for Exploring to Scope in BusinessCloud Computing for Exploring to Scope in Business
Cloud Computing for Exploring to Scope in Business
International Journal of Computer and Communication System Engineering
 
Ibm cloud
Ibm cloudIbm cloud
Ibm cloud
Mahfuzul Haq
 
Cloud
CloudCloud
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...Nitish Bhardwaj
 
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...Nitish Bhardwaj
 
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...Nitish Bhardwaj
 

Similar to Every cloud has a silver lining (20)

Transforming an organization to cloud
Transforming an organization to cloud Transforming an organization to cloud
Transforming an organization to cloud
 
Cloud Computing for Exploring to Scope in Business
Cloud Computing for Exploring to Scope in BusinessCloud Computing for Exploring to Scope in Business
Cloud Computing for Exploring to Scope in Business
 
Ibm cloud
Ibm cloudIbm cloud
Ibm cloud
 
Cloud brochure
Cloud brochureCloud brochure
Cloud brochure
 
Cloud
CloudCloud
Cloud
 
Pd1f docu
Pd1f docuPd1f docu
Pd1f docu
 
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
 
Pdf docu
Pdf docuPdf docu
Pdf docu
 
Pdf docu
Pdf docuPdf docu
Pdf docu
 
Pdf docu
Pdf docuPdf docu
Pdf docu
 
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
 
Pdf docu
Pdf docuPdf docu
Pdf docu
 
Pdf docu
Pdf docuPdf docu
Pdf docu
 
Pdf docu
Pdf docuPdf docu
Pdf docu
 
Pd1f docu
Pd1f docuPd1f docu
Pd1f docu
 
Pd1f docu
Pd1f docuPd1f docu
Pd1f docu
 
Pd1f docu
Pd1f docuPd1f docu
Pd1f docu
 
Pdf docu
Pdf docuPdf docu
Pdf docu
 
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
Anessentialguidetopossibilitiesandrisksofcloudcomputing apragmaticeffectivean...
 
Pdf docu
Pdf docuPdf docu
Pdf docu
 

More from Aditya Dashora

Research report sample - Strategy formulation exercise
Research report sample - Strategy formulation exerciseResearch report sample - Strategy formulation exercise
Research report sample - Strategy formulation exercise
Aditya Dashora
 
Analytics Assignment - Cluster analysis
Analytics Assignment - Cluster analysisAnalytics Assignment - Cluster analysis
Analytics Assignment - Cluster analysis
Aditya Dashora
 
Multi-dimenssional Data Model for CMS
Multi-dimenssional Data Model for CMSMulti-dimenssional Data Model for CMS
Multi-dimenssional Data Model for CMSAditya Dashora
 
Analytics Assignment - Research on Fruition Science
Analytics Assignment - Research on Fruition ScienceAnalytics Assignment - Research on Fruition Science
Analytics Assignment - Research on Fruition ScienceAditya Dashora
 
Data Visualisation Assignment - using Tablue
Data Visualisation Assignment - using TablueData Visualisation Assignment - using Tablue
Data Visualisation Assignment - using TablueAditya Dashora
 
Application Profiling for DevOps and Cloud Strategy
Application Profiling for DevOps and Cloud StrategyApplication Profiling for DevOps and Cloud Strategy
Application Profiling for DevOps and Cloud Strategy
Aditya Dashora
 
Illustrative Target Operating Model for Problem Management
Illustrative Target Operating Model for Problem ManagementIllustrative Target Operating Model for Problem Management
Illustrative Target Operating Model for Problem Management
Aditya Dashora
 
MY BUSINESS - MY IT - MY ITSM
MY BUSINESS - MY IT - MY ITSMMY BUSINESS - MY IT - MY ITSM
MY BUSINESS - MY IT - MY ITSM
Aditya Dashora
 

More from Aditya Dashora (8)

Research report sample - Strategy formulation exercise
Research report sample - Strategy formulation exerciseResearch report sample - Strategy formulation exercise
Research report sample - Strategy formulation exercise
 
Analytics Assignment - Cluster analysis
Analytics Assignment - Cluster analysisAnalytics Assignment - Cluster analysis
Analytics Assignment - Cluster analysis
 
Multi-dimenssional Data Model for CMS
Multi-dimenssional Data Model for CMSMulti-dimenssional Data Model for CMS
Multi-dimenssional Data Model for CMS
 
Analytics Assignment - Research on Fruition Science
Analytics Assignment - Research on Fruition ScienceAnalytics Assignment - Research on Fruition Science
Analytics Assignment - Research on Fruition Science
 
Data Visualisation Assignment - using Tablue
Data Visualisation Assignment - using TablueData Visualisation Assignment - using Tablue
Data Visualisation Assignment - using Tablue
 
Application Profiling for DevOps and Cloud Strategy
Application Profiling for DevOps and Cloud StrategyApplication Profiling for DevOps and Cloud Strategy
Application Profiling for DevOps and Cloud Strategy
 
Illustrative Target Operating Model for Problem Management
Illustrative Target Operating Model for Problem ManagementIllustrative Target Operating Model for Problem Management
Illustrative Target Operating Model for Problem Management
 
MY BUSINESS - MY IT - MY ITSM
MY BUSINESS - MY IT - MY ITSMMY BUSINESS - MY IT - MY ITSM
MY BUSINESS - MY IT - MY ITSM
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 

Every cloud has a silver lining

  • 1. EVERY CLOUD HAS A SILVER LINING A WHITEPAPER ON ITSM INCIDENT MANAGEMENT PROCESS FOR CLOUD ENVIRONMENT Cloud Computing has changed the dynamics of IT Services business but organizations have not been able to foresee the changes required in ITSM Processes and Procedures to adopt the Cloud Computing. In this publication, I have tried to explore the procedure and process level changes needed in ITIL Incident Management process in order to work smoothly in Cloud Environment. Published by: Aditya Dashora
  • 2. © Conceptualized and Published by Aditya Dashora 1 About the Author Aditya Dashora, a senior consultant from Infosys Limited is an IT Enthusiast with around 9 years of experience in delivering many IT Service Management consulting projects for large enterprises across the globe. Aditya is quite passionate about helping CIOs and CTOs in improving their IT Strategy to meet the current and future demands. Also, he is instrumental in exploring and defining new ways of working for the organizations by leveraging technology. Aditya is based out of Bangalore, India. Contact Information: adydashora@gmail.com https://www.linkedin.com/in/adityadashora
  • 3. © Conceptualized and Published by Aditya Dashora 2 CONTENTS 1. Executive Summary .................................................................................................................................. 3 2. A sneak peek into the world of “Cloud” .............................................................................................. 4 3. Incident Management process for Cloud ........................................................................................... 6 4. Procedural Level Changes ..................................................................................................................... 8 5. Key Performance Indicators ................................................................................................................. 15 6. Key Policies ............................................................................................................................................... 16 7. Technology Considerations .................................................................................................................. 17 References........................................................................................................................................................ 18
  • 4. © Conceptualized and Published by Aditya Dashora Executive Summary 3 1. EXECUTIVE SUMMARY With the rapidly growing adoption rate, it is already conceived that within next 5-6 years, Cloud Computing is going to change the rules of the game, played by victorious IT Service Providers across the world. Firms, doing business in IT Infrastructure space have started feeling nervousness about the growing acceptability of IaaS and PaaS services provided by Cloud Vendors. IT Service Management, an instrument or weapon used by IT Service Providers and IT Support Organization to fight the so called challenges in delivering IT Services to the customers, also considered as a style statement within the IT Service Industry is going to play a vital role in the Cloud IT Shop. However, concepts of ITSM will require some restructuring and renovation in order to attain the capabilities to support the Cloud based IT Shop. In this article, I have tried to explain the operational level changes needed in a traditional Incident Management process to ensure accurate and speedy reaction to the Incidents/Issues/Events in a Cloud Environment.
  • 5. © Conceptualized and Published by Aditya Dashora A sneak peek into the world of “Cloud” 4 2. A SNEAK PEEK INTO THE WORLD OF “CLOUD” 2.1. CLOUD ENVIRONMENT OVERVIEW NIST definition of Cloud says that Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. The three Cloud service models are defined as i.e. 1) Software as a Service, 2) Platform as a Service & 3) Infrastructure as a Service. Similarly, there are four Cloud deployment models i.e. Private Cloud, Public Cloud, Community Cloud and Hybrid Cloud. There are five essential characteristics of Cloud Computing defined by NIST and they are: 1) On- demand self-service, 2) broadband network access, 3) Resource Pooling, 4) Rapid Elasticity and 5) Measured Service. Traditionally in an IT organization, IT support function(including managed service providers) is responsible for procurement, implementation, support and maintenance of IT Services and components like Critical Business Application, Enterprise/Corporate Applications, Messaging Services, Databases, Batch Processing Services, Servers, Middleware, Storage and Back-up infrastructure, Network, and IT Security Management etc.. In cloud implementation, some of the mentioned IT components are provided and supported by a Cloud Service Provider on pay per use basis. In case of a Private Cloud, the Technology Management function becomes the Cloud Provider while in Public Cloud organizations avail services from providers like AWS, Rack Space, Google Compute, MS Azure, Salesforce etc. Focus of any Cloud implementation is to reduce cost of IT and ensure high availability and in order to achieve that, it is important to identify and analyze “IT Services” & “Critical Business Applications” and define a Cloud implementation strategy. Some organizations choose to retain some of its critical IT Service components on-premise and move reminder to the cloud. For example, a manufacturing company can choose to retain its “Order Management System” applications and supporting infrastructure in-premise and offload supporting services like Collaboration Portal, Messaging, CRM, HR Portal etc. to the cloud. This setup is commonly known as Hybrid Cloud or IT Mix. A common Cloud adoption approach is to move entire non- production into cloud, which will ensure significant amount of cost savings. Applications which require unpredictable capacity during peak load hours are also good candidates of cloud services.
  • 6. © Conceptualized and Published by Aditya Dashora A sneak peek into the world of “Cloud” 5 2.2. HYBRID CLOUD – THE REALITY OF THE FUTURE Hybrid Cloud Environment is said to be the reality of the future of Cloud Computing. In the cloud adoption journey, on one hand enterprises will transform their data centers into a private cloud and also, they will engage multiple Cloud Providers to enjoy the benefits of Public Cloud. For this white paper, I have considered a case of a big enterprise with a hybrid cloud environment. They are using SaaS and IaaS from Public Cloud and along with their Private Cloud. In next section, I have elaborated the required changes in the Incident Management process to manage a hybrid cloud environment. The reason to choose this scenario is that majority of the organizations will opt to walk on this path. Organizations have already invested a lot into their IT environment and own IT Assets of worth millions of dollars. Also, many organizations would choose to retain some of the IT Services related to their critical business processes. Therefore, Hybrid Cloud deployment model provides enough control, governance and flexibility so that enterprises can enjoy best of the both worlds.
  • 7. © Conceptualized and Published by Aditya Dashora Incident Management process for Cloud 6 3. INCIDENT MANAGEMENT PROCESS FOR CLOUD 3.1. INCIDENT MANAGEMENT PROCESS OVERVIEW ITIL defines Incident as, “An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet impacted service is also an incident, for example failure of one disk from a mirror set. “Incident Management (referred as IM hereafter) is the process for dealing with all incidents; this can include failures, questions or queries reported by the users (usually via a telephone call to the Service Desk), by technical staff, or automatically detected and reported by event monitoring tools.” These definitions are very much relevant in a cloud environment. The only update is that we can be more specific in this definition to cover all the dependencies of IT Services: “An unplanned interruption to an IT service or reduction in the quality of an IT service, service degradation/failure of a configuration item or any enabler technology i.e. orchestration, hypervisor, self service module, monitoring platform.” Other important aspect which we need to keep in mind is dynamic nature of cloud environment and because of that; many of the “Incidents” would end-up becoming minor Change Requests. So, one need to be specific with the qualification of an Incident in a cloud environment. INCIDENT MANAGEMENT (IM) HIGH LEVEL PROCESS DIAGRAM: Figure 1 This process has been working seamlessly for any IT support function and would be instrumental in cloud environment as well. There may be a need to emphasize more on some activities than others. Also, cloud includes significant amount of automation and self-service and therefore, some of the procedures or activities would be performed automatically or overlapped with sub-sequent activities. (Red dotted circles)
  • 8. © Conceptualized and Published by Aditya Dashora Incident Management process for Cloud 7 SUGGESTED INCIDENT MANAGEMENT PROCESS FOR CLOUD: Figure 2
  • 9. © Conceptualized and Published by Aditya Dashora Procedural Level Changes 8 4. PROCEDURAL LEVEL CHANGES 4.1. INCIDENT IDENTIFICATION Incident Identification is performed in two fashions, 1) Identification through Event Management platform & 2) User Reported Incidents. In cloud environment, there would be a higher degree of dependency on the Event Monitoring systems and therefore a mature Event Management process is a pre-requisite. Incidents related to enabler technologies like Hypervisors, Orchestration Engine, Load balancers, Network or Domain Controllers will be identified by the monitoring tools based on pre-defined thresholds. IT Security related Incidents will be identified by the IT Security Monitoring tools and will share the information with central Event Management system or Manager or Managers (MoM) layer. It is important to understand that in Public Cloud, there is a potential risk of data leakage or security breach. Therefore, an IT organization must be sensitive towards the security risk measures taken by Public Cloud vendors, and should try to establish real- time monitoring of security issues. For all user reported Incidents, it is important to determine the single point of failure using network topology or CMDB (Configuration Management Database). Traditionally, this activity was performed by L1/L2 support teams but in cloud environment, Service Desk or first point of contact should be able to detect that in most of the cases. Cloud implementation ensures good amount of automation and transparency which enables support staff to determine the single point of failure. A mature CMDB can provide CI dependency details but in cloud environment it might not be relevant to identify faulty CIs due to dynamic nature of technology. Rather, it would be more meaningful to trace the failed service, its dependencies on other services and failover plans. Also, orchestration engine and self-service module can be configured to display on-going major incidents to the users which can avoid incident queue. In case of Incidents related to public cloud, most of the issues will be identified by the Cloud Vendor and reminder would be reported by business-users/end-users. Ideally, orchestration engine and service management platform should be capable of fetching real-time data from the public cloud vendor and display ongoing outages/Incidents. That would suppress the related Incidents. Apart from that, issues related to Network Connectivity, Application Functionality (SaaS), Application Deployment (PaaS) related issues would be identified and reported by users as usual. In Hybrid Cloud model, issues related to Storage Gateway or connecter between In Premise infra and Public Cloud will be identified and logged by both cloud service consumer and the vendor (e.g. AWS). However, for better governance, the ownership of the ticket must remain with Service Desk/L1/L2 support and not with the Cloud Vendor. Same policy would be applicable for tickets created by monitoring tools at the vendor. 4.2. INCIDENT LOGGING Incident Logging is second activity in IM lifecycle and it holds equal relevance in cloud environment as traditional IT setup. As mentioned in ITIL v3 Service Operations book “All relevant
  • 10. © Conceptualized and Published by Aditya Dashora Procedural Level Changes 9 information relating to the nature of the incident must be logged so that a full historical record is maintained” Popular Service Management tools like ServiceNow, Remedy, HPSM etc. provide multiple fields for logging an Incident ticket and needless to say that all of them are very much applicable for a cloud environment. Besides, it would require a few additional fields to support accurate classification and uniqueness of an Incident ticket in cloud environment. For example: - A field for identification of cloud provider would be very helpful in reducing overall ticketing timestamp. It can be a dropdown with values like private cloud, public cloud etc. or specifically NJ Datacenter, Singapore Datacenter, AWS, Rackspace etc. - A field for associated hardware location/country can be helpful in case of security issues. (tip: every country has different laws for data security) - A field for affected Services or business processes would be helpful in communication - A check-box for hypervisor related issues In case of a hardware failure that can impact multiple services and thousands of users, Incident Logging becomes crucial activity to trigger the resolution and recovery work. A hardware failure must be treated as Sev-1 or Critical Incident and all dependent service owners/business process owners must be notified in real time. Therefore, it is expected that the Incident Ticket should be able to provide information about all the upstream and downstream dependencies of the failed CI. In case of incidents related to Public Cloud, the information flow from vendor’s monitoring and ticketing tools to the host systems is essential and therefore automation and integration tools will play a critical role. 4.3. INCIDENT CATEGORIZATION Incident Categorization activity is performed by the Service Desk staff/IT Support Staff to ensure that appropriate categorization codes are assigned to each Incident. With the help of automation, Event Monitoring tools can also populate Incident Categorization codes while create an Incident ticket from an Event. In a cloud environment, although Incident Categorization activity overlaps with Incident Logging however, Incident Categorization metadata must be designed to obtain meaningful information for rapid routing of Incidents, Problem Identification and Supplier Management. Traditional Multilevel Categorization Example is: Category Tier-1 Tier-2 Tier-3 Incident Hardware Server Memory Board Incident Software Microsoft Exchange Table 1 Another popular approach is categorized as CI Category and Service Category. Example:
  • 11. © Conceptualized and Published by Aditya Dashora Procedural Level Changes 10 CI Name: NN150B12Win2k8A01 Service: Collaboration Service In a cloud environment, we need to ensure that Incident Categorization provides details on service provider, service, name of the application/service/server, criticality index etc. For example: AWS ->Infra -> ABCAWSUSEC001 -> Criticality Index: 1 -> Not Accessible Salesforce -> Application -> CRM -> Criticality Index: 2> Functionality Issue Private Cloud -> Application -> Exchange Server -> Criticality Index: 2 -> Slow Response Private Cloud -> Intranet -> Connectivity -> Not Accessible ATT -> Internet -> Connectivity AWS ->Security -> Unauthorized Access
  • 12. © Conceptualized and Published by Aditya Dashora Procedural Level Changes 11 4.4. INCIDENT PRIORITIZATION Incident Prioritization is one of the most critical aspects of not only IM process but the whole lifecycle of IT Services. Incident Prioritization means allocating appropriate priority to an Incident based on pre-defined criteria. Allocated priority codes will help support staff to give appropriate attention to the Incident. Most of the IT Outsourcing Contracts are driven by the SLAs which are defined based on Incident Priority Guidelines. In a cloud environment, Incident Prioritization becomes all the more important because a) there are multiple service providers who may have to work towards Incident Resolution, b) Single hardware or hypervisor failure can effect multiple users and services & c) Due to heavy dependency on Network (WAN & LAN), any network related issue must be treated as high priority Typically, priority of an Incident is determined by two factors namely “Impact” and “Urgency” where Impact is how much damage caused by an Incident and Urgency is how quickly it needs to be resolved. Some of the organizations use a questionnaire to determine the impact and urgency. In case of user reported Incidents, user can be facilitated to provide inputs for determining the urgency. Incident Priority data or logs are analyzed further for defining and negotiation SLAs (Service Level Agreement)/ OLAs (Operational Level Agreement) and UCs (Underpinning Contracts). Therefore, in a cloud environment, where there is significant dependency on the vendors/service providers, a proper Incident Prioritization would certainly play a major role in SLA Definition and Negotiations activities. It will also help in determining the good candidates (Apps or Infra) for migrating to public cloud based on impact/urgency analysis. An example of Incident Prioritization in Cloud Environment: Urgency Urgency Determination Questionnaire (example):  Revenue Generating Service/Application?  Brand Exposure?  Safety Exposure?  Business Hours?  CIA Rating of the Service/Application?  VIP User Profile?  Orchestration Engine related? High Medium Low Impact Extensive/Widespread Critical High Medium Significant/ Large High High Medium Moderate / Medium Medium Medium Medium Localized/ Minor Medium Low Low Impact Determination Questionnaire (example):  Number of Instances/ virtual devices?  Number of Services/ Applications?  Number of Geographical locations?  BCP Available?  Network Issue?  Number of Users? Table 2
  • 13. © Conceptualized and Published by Aditya Dashora Procedural Level Changes 12 4.5. INCIDENT ESCALATION In traditional IM process, there are two types of Incident Escalation procedures: 1) Functional Escalation & 2) Hierarchical Escalation. Functional Escalation defines inter-groups/teams routing model. Example: Service Desk to Wintel Support; Wintel to DBA; DBA to Network; Network to Third Party and so on. On the other hand, Hierarchical Escalation provides a mechanism to involve senior management or leadership team in case of a Sev-1 incident or any challenging situation like ambiguity on Incident Ownership, involving third party on warranty issues, customer dissatisfaction etc. In a cloud environment, there are multiple parties involved or associated with a Service and therefore any Service degradation (Incident) would require all the stakeholders to come together as an online forum. For that purpose, Functional and Hierarchical Escalations should run hand-in hand. The only difference is that business might not be interested in known the details of Incidents while they would be interested in knowing the impact on their work. So, the communication has to be designed in such a way that it sends out relevant details to the stakeholders. In a suggested Incident Escalation model for cloud, an Incident should be assigned to a support group and at the same time other groups who have any relationship with the Incident should also get notification. Later on, after Incident resolution activity, one of the effected support groups may be engaged to give a sign-off. Social Networking features in Service Management tool can play a role in this kind of escalation. In- case of vendor related Incidents; vendor must be intimidated at the beginning of the Incident lifecycle. Once the Incident is assigned to the vendor, then a parallel communication must be sent to Problem Manager, IT Manager, Vendor Manager and Account Manager (vendor). SLA BREACH NOTIFICATIONS In-case of SLA breach warning, a communication/notification must be sent out to group manager, IT manager, IT Director etc. In an SLA breach situation, apart from IT leadership team, stakeholders from the business and finance must be involved. Some of the vendors have service based SLAs (non-negotiable) and in that case, a clear expectation setting must be done with the business. During Service Design phase, business should get the option to choose components from the catalog based on SLA vs. Cost analysis. Example: Server Type Baseline SLA (turn-around) Hourly Downtime Cost (post the Baseline SLA) HPC Windows (Private) 2 Hours $7000 HPC Unix (Private) 2 Hours $6000 HPC Windows (Public) Best efforts $1500 HPC Unix (Public) Best Efforts $1100 Table 3
  • 14. © Conceptualized and Published by Aditya Dashora Procedural Level Changes 13 ROLE OF SERVICE DESK In a traditional enterprise, Service Desks are responsible for determining Incident Category followed by performing initial investigation based on knowledge base or Runbook and finally escalating the ticket to the appropriate support group. Considering the complexity and nature of the Incidents in cloud environment, there are chances that traditional service desk function might not be able to do initial diagnosis and they may end up routing it to wrong support group. Hence it becomes important to upgrade the traditional service desk by marrying it to monitoring teams or command center. Combining two teams will form a function known as integrated command center (ICC) or IT Operations Center (ITOC), which will have good technical competency to perform initial investigation and escalation in cloud environment. We have to keep in mind that majority of common Incidents related to availability, accessibility, device failure etc. will be eliminated in cloud environment because of the high performance compute design. Hence, it makes absolute sense to combine Service Desk and Command Center and enhance the productivity. 4.6. INVESTIGATION, RESOLUTION AND RECOVERY In traditional IM lifecycle, Incident Investigation & Incident Resolution are defined as sequential activities. In cloud environment, we should go a step further and combine them for faster turnaround. It would be a logical step because in the previous section, I proposed to merge Service Desk and Monitoring teams for better initial investigation and diagnosis. Therefore, unwanted Incident hopping (escalation to wrong groups) should be eliminated and resolution and recovery should come right after the escalation. Incident Resolution in cloud should be faster and better than traditional IT environment. There must be higher degree of proactive detection, fault tolerance, redundancy to avoid downtime, auto correction aspects and intelligent systems to analysis and detect Incidents proactively. In a white paper published by VMWare on “Proactive Incident and Problem Management”, they have defined three Cloud Capability Levels: 1) Reactive, 2) Proactive & 3) Innovative where Reactive is lowest maturity level for a cloud provider and Innovative is highest. Reactive model is natural approach but it’s not sustainable in cloud environment because of various reasons including visualization, orchestration, no clarity on assets/CI/managed objects etc. So, it becomes important to develop intelligent systems to analyze the event monitoring data, historical ticket data, maintenance tasks, business growth patterns, IT needs of a business process and other IT drivers and move from Reactive capability to Innovative Capability. Incidents in a cloud environment would require highly skilled professionals but at the same time, cloud environment provides enough redundancy to avoid/reduce downtime. So initially there might be some limitations in establishing SOP/Run-book (Standard Operating Procedure) based approach but in a longer run, cloud can provide enough opportunities to reduce Incidents and automate resolution tasks. In a cloud environment, IT support staff should work towards ensuring that repetitive Incidents do not occur in the environment. Once the Incident is resolved, it can be owned by support team itself or passed to other group for validation/sign-off. In case of user reported Incidents, a user sign-off must be taken.
  • 15. © Conceptualized and Published by Aditya Dashora Procedural Level Changes 14 4.7. INCIDENT CLOSURE Once the Incident is resolved, it enters into the ultimate activity of its lifecycle which is Incident Closure. Incident Closure is an important activity for ensuring that required solution has been provided and implemented. In Incident Escalation section, I have mentioned about the Incident or Service Failure notification to all the stakeholders. Likewise, before closing the Incidents, system needs ensure that all the stakeholders have given their sign-off on the Incident. This task can be automated by making it time bound force closure. In case of public cloud, the closure must be performed only after obtaining required confirmation from Cloud Providers. Most of the Service Management tools provide Closure Categorization Codes (Similar to Incident Categorization) and it would be helpful in Cloud Environment to use those codes properly. If solution provided by support groups doesn’t completely solve the issue, then stakeholders or end-user may choose to Re-open the incident. Any re-opened Incident would trigger hierarchical escalation and involve senior management into the lifecycle for better governance.
  • 16. © Conceptualized and Published by Aditya Dashora Key Performance Indicators 15 5. KEY PERFORMANCE INDICATORS Key Performance Indicators (KPIs) are also known as process performance measurement criteria. As name indicates, the purpose of KPIs is to evaluate the process performance against process goals and objectives. Some of the mature organizations have tightly coupled KPIs with Business CSFs (Critical Success Factors). As illustrated in ITIL v3 Guidelines “A KPI refers to a specific, agreed level of performance that will be used to measure the effectiveness of an organization or process” The standard to define KPIs is known is GQM approach where G is Goals, Q is Question and M is Metrics. The goal is very clear here – to ensure that Incidents are resolved at the earliest. The questions we may ask that “what it takes to do rapid incident resolution?”; “what can cause the delay?”; “what are the dependencies?” When we start thinking on these lines, we come across multiple KPIs related to Incident Management process. Most of the KPIs are already being used in the industry. In this section, we will try to explore the needs to revise the existing KPIs for Cloud Environment. Let’s take a look at some of the KPIs: - Percentage Reduction in number in Incidents (Month-on-month) - Percentage Reduction in Weekly Incident Backlog (weekly) - Percentage Increment in SLA compliance (daily/weekly) - Percentage reduction in incorrectly assigned Incidents (weekly/monthly) In case of Cloud, we need to consider the performance of the “vendor” or partner. Therefore there is a need to have additional KPIs to ensure required coverage. Some examples of additional KPIs for Cloud Incident Management Process: - Ratio of auto generated tickets and user reported tickets - Percentage reduction in issues escalated to Cloud Service Provider - Percentage reduction in incorrect escalations to Cloud Service Provider - Percentage reduction in the Incident Diagnosis time - Percentage reduction in incorrectly categorized incidents - Percentage reduction in number of major Incidents - Percentage reduction in average turn-around time from vendor - Increase in proactive detection rate
  • 17. © Conceptualized and Published by Aditya Dashora Key Policies 16 6. KEY POLICIES Ticket Ownership Policy Ticket ownership should always remain with the cloud consumer. Having said that, we must account certain situations that are controlled by cloud vendor internally and cloud consumer will have no role to play. For those instances, we can consider a joint ownership and ensure that cloud consumer gets real time updates on the issues. Escalation Policy Any escalation to the cloud vendor must be approved or supervised by L3 support team or Incident Manager. Team must ensure that there is minimum incorrect escalation to the cloud vendor. In case of issues related to internal infrastructure or applications, the escalation guidelines are same as mentioned in ITIL book.
  • 18. © Conceptualized and Published by Aditya Dashora Technology Considerations 17 7. TECHNOLOGY CONSIDERATIONS As mentioned earlier, technology is going to play a critical role in supporting and managing cloud environment and therefore the ITSM Processes must be integrated and orchestrated in such a way that they can enable a seamless information flow between the processes, tools and teams. There are four key technology considerations that are critical for running Incident Management process in Cloud. Service Catalog Self Service Orchestration Analytics Below is a reference high level architecture of Integrated ITSM Processes to support future technology: Figure 3
  • 19. © Conceptualized and Published by Aditya Dashora References 18 REFERENCES 1. ITIL 2011 Guidelines (https://www.axelos.com/itil) 2. Wikipedia (http://en.wikipedia.org/wiki/Cloud_computing) 3. ServiceNow (http://www.servicenow.com) 4. NIST Cloud Definition