• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Deal With Production Issues - The ITIL Way

Deal With Production Issues - The ITIL Way



Introduce the Incident Management and Problem Management concept of ITIL; Descript how to management Production Issues with ideas from ITIL

Introduce the Incident Management and Problem Management concept of ITIL; Descript how to management Production Issues with ideas from ITIL



Total Views
Views on SlideShare
Embed Views



3 Embeds 13

http://www.slideshare.net 11
http://www.docseek.net 1
http://www.slashdocs.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Deal With Production Issues - The ITIL Way Deal With Production Issues - The ITIL Way Presentation Transcript

    • Deal with Production Issues Suggestions from ITIL
    • Problems to solve
      • Long resolution time
      • Neglected issues
        • Issues we lose track of until our users remind us
      • Recurring issues
      • Inconsistency in response time
      • Developers are distracted constantly to resolve issues
    • Goal
      • Manage issues in a consistent manner
      • Fast resolution
      • Reduce client impact
      • Proactively resolve issues before they impact clients
    • Basic Concepts
      • Incidents
        • Any event which is not part of the standard operation of a service and which causes, or may cause an interruption to or a reduction in, the quality of that service
      • Problems
        • A problem is a condition often identified as the cause of multiple incidents that exhibit common symptoms.
      • Known Errors
        • A known error is a condition identified by successful diagnosis of the root cause of a problem, and subsequent development of a Work-around
    • Relationship of the three
      • Problem is the root cause of the incidents
      • Incident is the manifest of a underline Problem
      • One Problem can cause many Incidents
      • Known error is a problem with known root cause and known workaround
    • Manage Incident vs. Manage Problem
      • Different goals
        • Incident Management focus on restoring the service operation as quickly as possible
        • Problem management focus on finding and eliminating the root cause
      • Different actions
        • Incident management applies workarounds or temporary fixes to quickly restore the services
        • Problem management issue a change to fundamentally eliminate the root cause
      • Incident management is reactive and problem management is proactive
      • Incident management emphasize speed and problem management emphasize quality
    • Common mistakes
      • Spend tremendous time and efforts to find root cause before the service level is recovered
      • Stop the investigation after an incident is fixed by a workaround
      • Same incident occurs repeatedly without understanding of the root cause
    • Solutions from ITIL
      • Separate out Incident Management and Problem Management into two independent but related processes
      • Handle incidents (restore service) as quickly as possible
      • Proactively and independently work on resolving problems
      • Wisely manage Known Errors
    • Incident Management
      • Always remember the goal is to “Restore service level as quickly as possible ”
      • How to go fast?
        • Classification
        • Match known errors and known workarounds
        • Appropriate escalation
      • Go fast, but not go crazy. Don’t miss
        • Record
        • Prioritize
        • Follow up
    • Incident Management Process
    • Acceptance And Record
      • Benefits of recording
        • Help to diagnosis new incidents based on known incidents
        • Help Problem Management to find the root cause
        • Easy to determine the impact
        • Be able to track and control the issue resolution.
      • Incident Reporting Channels
        • User
        • System Monitor/Alert
        • IT person
    • Incident Record
      • Unique ID
      • Basic diagnosis info
        • Timestamp
        • Symptoms
        • User info (name, contact info)
        • Who’s responsible
      • Additional information
        • Screenshots
        • Logs
      • Status
        • New, Accepted, Scheduled, Assigned, Active, Suspended, Resolved, Terminated
    • Classification
      • Classification
        • Possible reasons (application, network, database, business logic, etc.)
        • Supporting group (application group, database group, infrastructure group, network group, etc.)
      • Prioritize
        • Priority = Impact X Urgency
        • Determine resolution timeline (resolve within X hours) based on Service Level Agreement
    • Preliminary Support
      • Preliminary Response
        • Acknowledge of acceptance
        • Collect basic info
        • Provide basic help to the user
      • Service Requests
        • Service Request is standard service like check status, reset password, etc.
        • Go through standard procedure to handle service requests
    • Match
      • Match known errors
        • Known solution
        • Known workaround
        • Known resolution procedure
      • Match existing incidents
        • Link the new incident with the existing incidents
        • Increase the impact level of the existing incident
        • If the existing one is already worked on, inform the responsible personal/group
    • Investigate and Diagnosis
      • Escalation
        • Functional escalation (Technical escalation) : Involve more technical experts, involve teams in other functional group, or involve external suppliers
        • Hierarchical escalation (Management escalation): Escalate to higher level management team
    • Escalation by Priorities
      • A (Service Desk)
      • B (Second Line)
      • C (Third Line, Supplier)
      • D (Incident Manager)
      • E (Division Management)
      • F (Corporate Management
      C B A 8 hr 4 D C B A 6 hr 3 E,F D C B A 4 hr 2 EF CD B A 2 hr 1 100% timeline 60% timeline 30% timeline 10 Minute 0 Minute Resolution timeline Priority
    • Investigation Activities
      • Assign dedicated support person
      • Collect basic info
      • Query historical data
        • Recent releases
        • Recent changes
        • Workload trend
      • Analyze
      • Again, don’t spend too much time in finding the root cause. Find a workaround as soon as possible!
    • Resolve and recover
      • Resolution (workarounds or permanent fix)
        • Create a Request For Change (RFC)
        • Approve RFC
        • Implement Change.
      • Record the analysis, the root cause, the workaround and the solution
      • Leave the incident in Open status when resolution hasn’t been found
    • Termination
      • Contact the user to confirm incident is resolved
      • Change the Incident status into “Closed”
      • Update all the Incident record to reflect the final priority, impact, user and root cause
    • Track and Monitor
      • Assign an owner to each incident. Usually it’s the Service Desk person.
      • Provide feedback to the users after a change
      • Enforce the escalation based on the priority
    • Problem Management
      • Problem Control
        • Find the root cause of a problem
        • Turn a problem into a Known Error
      • Error Control
        • Control and Monitor the Known Errors until they are appropriately handled
      • Proactive Problem Management
        • Resolve problems before they cause any incidents
    • Problem Control
    • Identify Problems
      • Analyze the trends of incidents
        • Likely to reoccur
        • Likely more will occur
        • Likely to have larger impact
      • Analyze the weakness of the infrastructure
        • Availability
        • Capability
      • A significant incident (outage)
    • Diagnosis
      • Recreate incident in testing environment
      • Link the modules with incidents
      • Review the latest changes
      • After the root cause of a problem is found, this problem becomes a Known Error
    • Temporary Fixes
      • It’s important to find a temporary fix if the problem causes significant incident
      • If temporary fix involves changes in the infrastructure, a Request For Change must be submitted. (Later, another RFC may be submitted to fix the root cause)
      • For urgent problems, Emergency Change Request Process should be initialized.
    • Error Control
    • Identify and Record Known Error
      • Identify
        • Find the root cause of a problem
        • Link a problem with a known error
      • Record
        • Assign an ID
        • Symptoms
        • Root cause
        • Status
      • Notification
        • Notify incident management team. They can associate new incidents with known errors
    • Determine the solution
      • Evaluate based on
        • Service Level Agreement
        • Impact and Urgency
        • Cost and benefit
      • Possible solutions
        • Temporary fixes
        • Permanent fixes
        • No fix (cost is greater than benefits)
      • Record the decision in Problem Database
    • Known Errors from other environments
      • Known errors from development environment
        • We may choose to release with some minor known issues
      • Known errors from suppliers
        • Usually reported in the release notes
      • Record, Monitor and Track those known errors
      • Relate problems with those known errors
    • PIR (Post Implementation Review)
      • Normal problems
        • Confirm all the related incidents are closed
        • Verify if the problem record is complete (symptoms, root cause and solutions)
        • Change the problem status into Resolved
      • Significant problems
        • What went well?
        • What went wrong?
        • How to do better next time?
        • How to prevent the similar issues from happening again?
    • Track and Monitor
      • Track the full lifecycle of each known error
        • Reevaluate impact and urgency. Adjust the priorities accordingly.
        • Monitor the progress of the diagnosis and implementation of the solution. Monitor the implementation of the RFC.
    • Proactive Problem Management
      • Focus on the quality of the service and the infrastructure
      • Analyze operational trends
      • Detect the potential incidents and prevent them from happening
      • Find out the weak points of the infrastructure or the overloaded components
    • Ideas to improve our Production Support process
      • Idea 1: Create an independent Problem Management Team.
      • Idea 2: Create an Problem Database
      • Idea 3: Define the Production Support Procedure
      • Idea 4: Review and revise the procedures of using TeamTrack
      • Idea 5: Enforce Post Implementation Review
      • Idea 6: Proactively manage problems
      • Idea 7 (optional): Acquire an Service Desk software to facilitate the process
    • Create an independent Problem Management Team.
      • Can be a full time team or a part time team
      • Appoint a Problem Management Manager. Must be different than the Production Support Manager. Their goals, schedules and requirements are different.
      • Responsible for managing all the production problems (not incidents) for multiple applications
        • Identify problems
        • Record problem
        • Find and evaluate solutions
        • Track the progress till closure
      • Work closely with the existing Production Support team.
    • Create a Problem Database
      • A easy to search knowledge database
      • Include problems and known errors
      • Track symptoms, root causes, temporary fixes, workarounds, and permanent solutions
      • Include all the known errors in DEV and unresolved or deferred defects in QA/RATE environments
      • Maintained by the Problem Management Team
      • Will be used by Production Support team for match and fast resolution of incidents
    • Define the Production Support Procedure (Work Instructions)
      • Create a formal and detailed document. Train Production Support Team to follow the new procedure
      • Start with ITIL Incident Management Process. Adjust it to our own situation and tools
      • Clearly define how to calculate priorities
      • Clearly define the time-bound escalation procedure
      • Clearly define the monitoring and tracking steps
    • Review and define the procedure of using TeamTrack
      • TeamTrack is our existing Incident Tracking system
        • Review the functions of TeamTrack
        • Redefine the incident escalation process according to ITIL suggestions
      • Define the interface between PC Support and IT Production Support Team
        • Communication channel
        • Roles and responsibilities
        • Escalation
        • Track and Control
        • Knowledge sharing
    • Enforce PIR
      • Contact each user to confirm all the incidents are closed
      • Make sure the Problem record is complete and useful
      • Identify issues in the Incident and Problem Management process. Add those to Problem database.
    • Proactively Manage Problems
      • Responsibility of the Problem Management Team.
      • Perform the following activities:
        • Analyze incidents to find the trend
        • Analyze infrastructure to identify possible bottleneck
        • Run fail-over and stress tests
        • Apply a problem solution across multiple related applications
        • Establish and maintain the Production Monitor System to proactively detect system anomalies
      • Evaluate how many problems are proactively identified and resolved
    • Service Desk Software
      • Evaluate the existing TeamTrack software and see if it covers out needs
      • Other popular options
        • HP Openview Service Desk
        • Remedy Strategic Service Suite
        • CA Unicenter Service Desk