Your SlideShare is downloading. ×
Deal With Production Issues - The ITIL Way
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Deal With Production Issues - The ITIL Way


Published on

Introduce the Incident Management and Problem Management concept of ITIL; Descript how to management Production Issues with ideas from ITIL

Introduce the Incident Management and Problem Management concept of ITIL; Descript how to management Production Issues with ideas from ITIL

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Deal with Production Issues Suggestions from ITIL
  • 2. Problems to solve
    • Long resolution time
    • Neglected issues
      • Issues we lose track of until our users remind us
    • Recurring issues
    • Inconsistency in response time
    • Developers are distracted constantly to resolve issues
  • 3. Goal
    • Manage issues in a consistent manner
    • Fast resolution
    • Reduce client impact
    • Proactively resolve issues before they impact clients
  • 4. Basic Concepts
    • Incidents
      • Any event which is not part of the standard operation of a service and which causes, or may cause an interruption to or a reduction in, the quality of that service
    • Problems
      • A problem is a condition often identified as the cause of multiple incidents that exhibit common symptoms.
    • Known Errors
      • A known error is a condition identified by successful diagnosis of the root cause of a problem, and subsequent development of a Work-around
  • 5. Relationship of the three
    • Problem is the root cause of the incidents
    • Incident is the manifest of a underline Problem
    • One Problem can cause many Incidents
    • Known error is a problem with known root cause and known workaround
  • 6. Manage Incident vs. Manage Problem
    • Different goals
      • Incident Management focus on restoring the service operation as quickly as possible
      • Problem management focus on finding and eliminating the root cause
    • Different actions
      • Incident management applies workarounds or temporary fixes to quickly restore the services
      • Problem management issue a change to fundamentally eliminate the root cause
    • Incident management is reactive and problem management is proactive
    • Incident management emphasize speed and problem management emphasize quality
  • 7. Common mistakes
    • Spend tremendous time and efforts to find root cause before the service level is recovered
    • Stop the investigation after an incident is fixed by a workaround
    • Same incident occurs repeatedly without understanding of the root cause
  • 8. Solutions from ITIL
    • Separate out Incident Management and Problem Management into two independent but related processes
    • Handle incidents (restore service) as quickly as possible
    • Proactively and independently work on resolving problems
    • Wisely manage Known Errors
  • 9. Incident Management
    • Always remember the goal is to “Restore service level as quickly as possible ”
    • How to go fast?
      • Classification
      • Match known errors and known workarounds
      • Appropriate escalation
    • Go fast, but not go crazy. Don’t miss
      • Record
      • Prioritize
      • Follow up
  • 10. Incident Management Process
  • 11. Acceptance And Record
    • Benefits of recording
      • Help to diagnosis new incidents based on known incidents
      • Help Problem Management to find the root cause
      • Easy to determine the impact
      • Be able to track and control the issue resolution.
    • Incident Reporting Channels
      • User
      • System Monitor/Alert
      • IT person
  • 12. Incident Record
    • Unique ID
    • Basic diagnosis info
      • Timestamp
      • Symptoms
      • User info (name, contact info)
      • Who’s responsible
    • Additional information
      • Screenshots
      • Logs
    • Status
      • New, Accepted, Scheduled, Assigned, Active, Suspended, Resolved, Terminated
  • 13. Classification
    • Classification
      • Possible reasons (application, network, database, business logic, etc.)
      • Supporting group (application group, database group, infrastructure group, network group, etc.)
    • Prioritize
      • Priority = Impact X Urgency
      • Determine resolution timeline (resolve within X hours) based on Service Level Agreement
  • 14. Preliminary Support
    • Preliminary Response
      • Acknowledge of acceptance
      • Collect basic info
      • Provide basic help to the user
    • Service Requests
      • Service Request is standard service like check status, reset password, etc.
      • Go through standard procedure to handle service requests
  • 15. Match
    • Match known errors
      • Known solution
      • Known workaround
      • Known resolution procedure
    • Match existing incidents
      • Link the new incident with the existing incidents
      • Increase the impact level of the existing incident
      • If the existing one is already worked on, inform the responsible personal/group
  • 16. Investigate and Diagnosis
    • Escalation
      • Functional escalation (Technical escalation) : Involve more technical experts, involve teams in other functional group, or involve external suppliers
      • Hierarchical escalation (Management escalation): Escalate to higher level management team
  • 17. Escalation by Priorities
    • A (Service Desk)
    • B (Second Line)
    • C (Third Line, Supplier)
    • D (Incident Manager)
    • E (Division Management)
    • F (Corporate Management
    C B A 8 hr 4 D C B A 6 hr 3 E,F D C B A 4 hr 2 EF CD B A 2 hr 1 100% timeline 60% timeline 30% timeline 10 Minute 0 Minute Resolution timeline Priority
  • 18. Investigation Activities
    • Assign dedicated support person
    • Collect basic info
    • Query historical data
      • Recent releases
      • Recent changes
      • Workload trend
    • Analyze
    • Again, don’t spend too much time in finding the root cause. Find a workaround as soon as possible!
  • 19. Resolve and recover
    • Resolution (workarounds or permanent fix)
      • Create a Request For Change (RFC)
      • Approve RFC
      • Implement Change.
    • Record the analysis, the root cause, the workaround and the solution
    • Leave the incident in Open status when resolution hasn’t been found
  • 20. Termination
    • Contact the user to confirm incident is resolved
    • Change the Incident status into “Closed”
    • Update all the Incident record to reflect the final priority, impact, user and root cause
  • 21. Track and Monitor
    • Assign an owner to each incident. Usually it’s the Service Desk person.
    • Provide feedback to the users after a change
    • Enforce the escalation based on the priority
  • 22. Problem Management
    • Problem Control
      • Find the root cause of a problem
      • Turn a problem into a Known Error
    • Error Control
      • Control and Monitor the Known Errors until they are appropriately handled
    • Proactive Problem Management
      • Resolve problems before they cause any incidents
  • 23. Problem Control
  • 24. Identify Problems
    • Analyze the trends of incidents
      • Likely to reoccur
      • Likely more will occur
      • Likely to have larger impact
    • Analyze the weakness of the infrastructure
      • Availability
      • Capability
    • A significant incident (outage)
  • 25. Diagnosis
    • Recreate incident in testing environment
    • Link the modules with incidents
    • Review the latest changes
    • After the root cause of a problem is found, this problem becomes a Known Error
  • 26. Temporary Fixes
    • It’s important to find a temporary fix if the problem causes significant incident
    • If temporary fix involves changes in the infrastructure, a Request For Change must be submitted. (Later, another RFC may be submitted to fix the root cause)
    • For urgent problems, Emergency Change Request Process should be initialized.
  • 27. Error Control
  • 28. Identify and Record Known Error
    • Identify
      • Find the root cause of a problem
      • Link a problem with a known error
    • Record
      • Assign an ID
      • Symptoms
      • Root cause
      • Status
    • Notification
      • Notify incident management team. They can associate new incidents with known errors
  • 29. Determine the solution
    • Evaluate based on
      • Service Level Agreement
      • Impact and Urgency
      • Cost and benefit
    • Possible solutions
      • Temporary fixes
      • Permanent fixes
      • No fix (cost is greater than benefits)
    • Record the decision in Problem Database
  • 30. Known Errors from other environments
    • Known errors from development environment
      • We may choose to release with some minor known issues
    • Known errors from suppliers
      • Usually reported in the release notes
    • Record, Monitor and Track those known errors
    • Relate problems with those known errors
  • 31. PIR (Post Implementation Review)
    • Normal problems
      • Confirm all the related incidents are closed
      • Verify if the problem record is complete (symptoms, root cause and solutions)
      • Change the problem status into Resolved
    • Significant problems
      • What went well?
      • What went wrong?
      • How to do better next time?
      • How to prevent the similar issues from happening again?
  • 32. Track and Monitor
    • Track the full lifecycle of each known error
      • Reevaluate impact and urgency. Adjust the priorities accordingly.
      • Monitor the progress of the diagnosis and implementation of the solution. Monitor the implementation of the RFC.
  • 33. Proactive Problem Management
    • Focus on the quality of the service and the infrastructure
    • Analyze operational trends
    • Detect the potential incidents and prevent them from happening
    • Find out the weak points of the infrastructure or the overloaded components
  • 34. Ideas to improve our Production Support process
    • Idea 1: Create an independent Problem Management Team.
    • Idea 2: Create an Problem Database
    • Idea 3: Define the Production Support Procedure
    • Idea 4: Review and revise the procedures of using TeamTrack
    • Idea 5: Enforce Post Implementation Review
    • Idea 6: Proactively manage problems
    • Idea 7 (optional): Acquire an Service Desk software to facilitate the process
  • 35. Create an independent Problem Management Team.
    • Can be a full time team or a part time team
    • Appoint a Problem Management Manager. Must be different than the Production Support Manager. Their goals, schedules and requirements are different.
    • Responsible for managing all the production problems (not incidents) for multiple applications
      • Identify problems
      • Record problem
      • Find and evaluate solutions
      • Track the progress till closure
    • Work closely with the existing Production Support team.
  • 36. Create a Problem Database
    • A easy to search knowledge database
    • Include problems and known errors
    • Track symptoms, root causes, temporary fixes, workarounds, and permanent solutions
    • Include all the known errors in DEV and unresolved or deferred defects in QA/RATE environments
    • Maintained by the Problem Management Team
    • Will be used by Production Support team for match and fast resolution of incidents
  • 37. Define the Production Support Procedure (Work Instructions)
    • Create a formal and detailed document. Train Production Support Team to follow the new procedure
    • Start with ITIL Incident Management Process. Adjust it to our own situation and tools
    • Clearly define how to calculate priorities
    • Clearly define the time-bound escalation procedure
    • Clearly define the monitoring and tracking steps
  • 38. Review and define the procedure of using TeamTrack
    • TeamTrack is our existing Incident Tracking system
      • Review the functions of TeamTrack
      • Redefine the incident escalation process according to ITIL suggestions
    • Define the interface between PC Support and IT Production Support Team
      • Communication channel
      • Roles and responsibilities
      • Escalation
      • Track and Control
      • Knowledge sharing
  • 39. Enforce PIR
    • Contact each user to confirm all the incidents are closed
    • Make sure the Problem record is complete and useful
    • Identify issues in the Incident and Problem Management process. Add those to Problem database.
  • 40. Proactively Manage Problems
    • Responsibility of the Problem Management Team.
    • Perform the following activities:
      • Analyze incidents to find the trend
      • Analyze infrastructure to identify possible bottleneck
      • Run fail-over and stress tests
      • Apply a problem solution across multiple related applications
      • Establish and maintain the Production Monitor System to proactively detect system anomalies
    • Evaluate how many problems are proactively identified and resolved
  • 41. Service Desk Software
    • Evaluate the existing TeamTrack software and see if it covers out needs
    • Other popular options
      • HP Openview Service Desk
      • Remedy Strategic Service Suite
      • CA Unicenter Service Desk