Deal With Production Issues - The ITIL Way


Published on

Introduce the Incident Management and Problem Management concept of ITIL; Descript how to management Production Issues with ideas from ITIL

1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Deal With Production Issues - The ITIL Way

  1. 1. Deal with Production Issues Suggestions from ITIL
  2. 2. Problems to solve <ul><li>Long resolution time </li></ul><ul><li>Neglected issues </li></ul><ul><ul><li>Issues we lose track of until our users remind us </li></ul></ul><ul><li>Recurring issues </li></ul><ul><li>Inconsistency in response time </li></ul><ul><li>Developers are distracted constantly to resolve issues </li></ul>
  3. 3. Goal <ul><li>Manage issues in a consistent manner </li></ul><ul><li>Fast resolution </li></ul><ul><li>Reduce client impact </li></ul><ul><li>Proactively resolve issues before they impact clients </li></ul>
  4. 4. Basic Concepts <ul><li>Incidents </li></ul><ul><ul><li>Any event which is not part of the standard operation of a service and which causes, or may cause an interruption to or a reduction in, the quality of that service </li></ul></ul><ul><li>Problems </li></ul><ul><ul><li>A problem is a condition often identified as the cause of multiple incidents that exhibit common symptoms. </li></ul></ul><ul><li>Known Errors </li></ul><ul><ul><li>A known error is a condition identified by successful diagnosis of the root cause of a problem, and subsequent development of a Work-around </li></ul></ul>
  5. 5. Relationship of the three <ul><li>Problem is the root cause of the incidents </li></ul><ul><li>Incident is the manifest of a underline Problem </li></ul><ul><li>One Problem can cause many Incidents </li></ul><ul><li>Known error is a problem with known root cause and known workaround </li></ul>
  6. 6. Manage Incident vs. Manage Problem <ul><li>Different goals </li></ul><ul><ul><li>Incident Management focus on restoring the service operation as quickly as possible </li></ul></ul><ul><ul><li>Problem management focus on finding and eliminating the root cause </li></ul></ul><ul><li>Different actions </li></ul><ul><ul><li>Incident management applies workarounds or temporary fixes to quickly restore the services </li></ul></ul><ul><ul><li>Problem management issue a change to fundamentally eliminate the root cause </li></ul></ul><ul><li>Incident management is reactive and problem management is proactive </li></ul><ul><li>Incident management emphasize speed and problem management emphasize quality </li></ul>
  7. 7. Common mistakes <ul><li>Spend tremendous time and efforts to find root cause before the service level is recovered </li></ul><ul><li>Stop the investigation after an incident is fixed by a workaround </li></ul><ul><li>Same incident occurs repeatedly without understanding of the root cause </li></ul>
  8. 8. Solutions from ITIL <ul><li>Separate out Incident Management and Problem Management into two independent but related processes </li></ul><ul><li>Handle incidents (restore service) as quickly as possible </li></ul><ul><li>Proactively and independently work on resolving problems </li></ul><ul><li>Wisely manage Known Errors </li></ul>
  9. 9. Incident Management <ul><li>Always remember the goal is to “Restore service level as quickly as possible ” </li></ul><ul><li>How to go fast? </li></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Match known errors and known workarounds </li></ul></ul><ul><ul><li>Appropriate escalation </li></ul></ul><ul><li>Go fast, but not go crazy. Don’t miss </li></ul><ul><ul><li>Record </li></ul></ul><ul><ul><li>Prioritize </li></ul></ul><ul><ul><li>Follow up </li></ul></ul>
  10. 10. Incident Management Process
  11. 11. Acceptance And Record <ul><li>Benefits of recording </li></ul><ul><ul><li>Help to diagnosis new incidents based on known incidents </li></ul></ul><ul><ul><li>Help Problem Management to find the root cause </li></ul></ul><ul><ul><li>Easy to determine the impact </li></ul></ul><ul><ul><li>Be able to track and control the issue resolution. </li></ul></ul><ul><li>Incident Reporting Channels </li></ul><ul><ul><li>User </li></ul></ul><ul><ul><li>System Monitor/Alert </li></ul></ul><ul><ul><li>IT person </li></ul></ul>
  12. 12. Incident Record <ul><li>Unique ID </li></ul><ul><li>Basic diagnosis info </li></ul><ul><ul><li>Timestamp </li></ul></ul><ul><ul><li>Symptoms </li></ul></ul><ul><ul><li>User info (name, contact info) </li></ul></ul><ul><ul><li>Who’s responsible </li></ul></ul><ul><li>Additional information </li></ul><ul><ul><li>Screenshots </li></ul></ul><ul><ul><li>Logs </li></ul></ul><ul><li>Status </li></ul><ul><ul><li>New, Accepted, Scheduled, Assigned, Active, Suspended, Resolved, Terminated </li></ul></ul>
  13. 13. Classification <ul><li>Classification </li></ul><ul><ul><li>Possible reasons (application, network, database, business logic, etc.) </li></ul></ul><ul><ul><li>Supporting group (application group, database group, infrastructure group, network group, etc.) </li></ul></ul><ul><li>Prioritize </li></ul><ul><ul><li>Priority = Impact X Urgency </li></ul></ul><ul><ul><li>Determine resolution timeline (resolve within X hours) based on Service Level Agreement </li></ul></ul>
  14. 14. Preliminary Support <ul><li>Preliminary Response </li></ul><ul><ul><li>Acknowledge of acceptance </li></ul></ul><ul><ul><li>Collect basic info </li></ul></ul><ul><ul><li>Provide basic help to the user </li></ul></ul><ul><li>Service Requests </li></ul><ul><ul><li>Service Request is standard service like check status, reset password, etc. </li></ul></ul><ul><ul><li>Go through standard procedure to handle service requests </li></ul></ul>
  15. 15. Match <ul><li>Match known errors </li></ul><ul><ul><li>Known solution </li></ul></ul><ul><ul><li>Known workaround </li></ul></ul><ul><ul><li>Known resolution procedure </li></ul></ul><ul><li>Match existing incidents </li></ul><ul><ul><li>Link the new incident with the existing incidents </li></ul></ul><ul><ul><li>Increase the impact level of the existing incident </li></ul></ul><ul><ul><li>If the existing one is already worked on, inform the responsible personal/group </li></ul></ul>
  16. 16. Investigate and Diagnosis <ul><li>Escalation </li></ul><ul><ul><li>Functional escalation (Technical escalation) : Involve more technical experts, involve teams in other functional group, or involve external suppliers </li></ul></ul><ul><ul><li>Hierarchical escalation (Management escalation): Escalate to higher level management team </li></ul></ul>
  17. 17. Escalation by Priorities <ul><li>A (Service Desk) </li></ul><ul><li>B (Second Line) </li></ul><ul><li>C (Third Line, Supplier) </li></ul><ul><li>D (Incident Manager) </li></ul><ul><li>E (Division Management) </li></ul><ul><li>F (Corporate Management </li></ul>C B A 8 hr 4 D C B A 6 hr 3 E,F D C B A 4 hr 2 EF CD B A 2 hr 1 100% timeline 60% timeline 30% timeline 10 Minute 0 Minute Resolution timeline Priority
  18. 18. Investigation Activities <ul><li>Assign dedicated support person </li></ul><ul><li>Collect basic info </li></ul><ul><li>Query historical data </li></ul><ul><ul><li>Recent releases </li></ul></ul><ul><ul><li>Recent changes </li></ul></ul><ul><ul><li>Workload trend </li></ul></ul><ul><li>Analyze </li></ul><ul><li>Again, don’t spend too much time in finding the root cause. Find a workaround as soon as possible! </li></ul>
  19. 19. Resolve and recover <ul><li>Resolution (workarounds or permanent fix) </li></ul><ul><ul><li>Create a Request For Change (RFC) </li></ul></ul><ul><ul><li>Approve RFC </li></ul></ul><ul><ul><li>Implement Change. </li></ul></ul><ul><li>Record the analysis, the root cause, the workaround and the solution </li></ul><ul><li>Leave the incident in Open status when resolution hasn’t been found </li></ul>
  20. 20. Termination <ul><li>Contact the user to confirm incident is resolved </li></ul><ul><li>Change the Incident status into “Closed” </li></ul><ul><li>Update all the Incident record to reflect the final priority, impact, user and root cause </li></ul>
  21. 21. Track and Monitor <ul><li>Assign an owner to each incident. Usually it’s the Service Desk person. </li></ul><ul><li>Provide feedback to the users after a change </li></ul><ul><li>Enforce the escalation based on the priority </li></ul>
  22. 22. Problem Management <ul><li>Problem Control </li></ul><ul><ul><li>Find the root cause of a problem </li></ul></ul><ul><ul><li>Turn a problem into a Known Error </li></ul></ul><ul><li>Error Control </li></ul><ul><ul><li>Control and Monitor the Known Errors until they are appropriately handled </li></ul></ul><ul><li>Proactive Problem Management </li></ul><ul><ul><li>Resolve problems before they cause any incidents </li></ul></ul>
  23. 23. Problem Control
  24. 24. Identify Problems <ul><li>Analyze the trends of incidents </li></ul><ul><ul><li>Likely to reoccur </li></ul></ul><ul><ul><li>Likely more will occur </li></ul></ul><ul><ul><li>Likely to have larger impact </li></ul></ul><ul><li>Analyze the weakness of the infrastructure </li></ul><ul><ul><li>Availability </li></ul></ul><ul><ul><li>Capability </li></ul></ul><ul><li>A significant incident (outage) </li></ul>
  25. 25. Diagnosis <ul><li>Recreate incident in testing environment </li></ul><ul><li>Link the modules with incidents </li></ul><ul><li>Review the latest changes </li></ul><ul><li>After the root cause of a problem is found, this problem becomes a Known Error </li></ul>
  26. 26. Temporary Fixes <ul><li>It’s important to find a temporary fix if the problem causes significant incident </li></ul><ul><li>If temporary fix involves changes in the infrastructure, a Request For Change must be submitted. (Later, another RFC may be submitted to fix the root cause) </li></ul><ul><li>For urgent problems, Emergency Change Request Process should be initialized. </li></ul>
  27. 27. Error Control
  28. 28. Identify and Record Known Error <ul><li>Identify </li></ul><ul><ul><li>Find the root cause of a problem </li></ul></ul><ul><ul><li>Link a problem with a known error </li></ul></ul><ul><li>Record </li></ul><ul><ul><li>Assign an ID </li></ul></ul><ul><ul><li>Symptoms </li></ul></ul><ul><ul><li>Root cause </li></ul></ul><ul><ul><li>Status </li></ul></ul><ul><li>Notification </li></ul><ul><ul><li>Notify incident management team. They can associate new incidents with known errors </li></ul></ul>
  29. 29. Determine the solution <ul><li>Evaluate based on </li></ul><ul><ul><li>Service Level Agreement </li></ul></ul><ul><ul><li>Impact and Urgency </li></ul></ul><ul><ul><li>Cost and benefit </li></ul></ul><ul><li>Possible solutions </li></ul><ul><ul><li>Temporary fixes </li></ul></ul><ul><ul><li>Permanent fixes </li></ul></ul><ul><ul><li>No fix (cost is greater than benefits) </li></ul></ul><ul><li>Record the decision in Problem Database </li></ul>
  30. 30. Known Errors from other environments <ul><li>Known errors from development environment </li></ul><ul><ul><li>We may choose to release with some minor known issues </li></ul></ul><ul><li>Known errors from suppliers </li></ul><ul><ul><li>Usually reported in the release notes </li></ul></ul><ul><li>Record, Monitor and Track those known errors </li></ul><ul><li>Relate problems with those known errors </li></ul>
  31. 31. PIR (Post Implementation Review) <ul><li>Normal problems </li></ul><ul><ul><li>Confirm all the related incidents are closed </li></ul></ul><ul><ul><li>Verify if the problem record is complete (symptoms, root cause and solutions) </li></ul></ul><ul><ul><li>Change the problem status into Resolved </li></ul></ul><ul><li>Significant problems </li></ul><ul><ul><li>What went well? </li></ul></ul><ul><ul><li>What went wrong? </li></ul></ul><ul><ul><li>How to do better next time? </li></ul></ul><ul><ul><li>How to prevent the similar issues from happening again? </li></ul></ul>
  32. 32. Track and Monitor <ul><li>Track the full lifecycle of each known error </li></ul><ul><ul><li>Reevaluate impact and urgency. Adjust the priorities accordingly. </li></ul></ul><ul><ul><li>Monitor the progress of the diagnosis and implementation of the solution. Monitor the implementation of the RFC. </li></ul></ul>
  33. 33. Proactive Problem Management <ul><li>Focus on the quality of the service and the infrastructure </li></ul><ul><li>Analyze operational trends </li></ul><ul><li>Detect the potential incidents and prevent them from happening </li></ul><ul><li>Find out the weak points of the infrastructure or the overloaded components </li></ul>
  34. 34. Ideas to improve our Production Support process <ul><li>Idea 1: Create an independent Problem Management Team. </li></ul><ul><li>Idea 2: Create an Problem Database </li></ul><ul><li>Idea 3: Define the Production Support Procedure </li></ul><ul><li>Idea 4: Review and revise the procedures of using TeamTrack </li></ul><ul><li>Idea 5: Enforce Post Implementation Review </li></ul><ul><li>Idea 6: Proactively manage problems </li></ul><ul><li>Idea 7 (optional): Acquire an Service Desk software to facilitate the process </li></ul>
  35. 35. Create an independent Problem Management Team. <ul><li>Can be a full time team or a part time team </li></ul><ul><li>Appoint a Problem Management Manager. Must be different than the Production Support Manager. Their goals, schedules and requirements are different. </li></ul><ul><li>Responsible for managing all the production problems (not incidents) for multiple applications </li></ul><ul><ul><li>Identify problems </li></ul></ul><ul><ul><li>Record problem </li></ul></ul><ul><ul><li>Find and evaluate solutions </li></ul></ul><ul><ul><li>Track the progress till closure </li></ul></ul><ul><li>Work closely with the existing Production Support team. </li></ul>
  36. 36. Create a Problem Database <ul><li>A easy to search knowledge database </li></ul><ul><li>Include problems and known errors </li></ul><ul><li>Track symptoms, root causes, temporary fixes, workarounds, and permanent solutions </li></ul><ul><li>Include all the known errors in DEV and unresolved or deferred defects in QA/RATE environments </li></ul><ul><li>Maintained by the Problem Management Team </li></ul><ul><li>Will be used by Production Support team for match and fast resolution of incidents </li></ul>
  37. 37. Define the Production Support Procedure (Work Instructions) <ul><li>Create a formal and detailed document. Train Production Support Team to follow the new procedure </li></ul><ul><li>Start with ITIL Incident Management Process. Adjust it to our own situation and tools </li></ul><ul><li>Clearly define how to calculate priorities </li></ul><ul><li>Clearly define the time-bound escalation procedure </li></ul><ul><li>Clearly define the monitoring and tracking steps </li></ul>
  38. 38. Review and define the procedure of using TeamTrack <ul><li>TeamTrack is our existing Incident Tracking system </li></ul><ul><ul><li>Review the functions of TeamTrack </li></ul></ul><ul><ul><li>Redefine the incident escalation process according to ITIL suggestions </li></ul></ul><ul><li>Define the interface between PC Support and IT Production Support Team </li></ul><ul><ul><li>Communication channel </li></ul></ul><ul><ul><li>Roles and responsibilities </li></ul></ul><ul><ul><li>Escalation </li></ul></ul><ul><ul><li>Track and Control </li></ul></ul><ul><ul><li>Knowledge sharing </li></ul></ul>
  39. 39. Enforce PIR <ul><li>Contact each user to confirm all the incidents are closed </li></ul><ul><li>Make sure the Problem record is complete and useful </li></ul><ul><li>Identify issues in the Incident and Problem Management process. Add those to Problem database. </li></ul>
  40. 40. Proactively Manage Problems <ul><li>Responsibility of the Problem Management Team. </li></ul><ul><li>Perform the following activities: </li></ul><ul><ul><li>Analyze incidents to find the trend </li></ul></ul><ul><ul><li>Analyze infrastructure to identify possible bottleneck </li></ul></ul><ul><ul><li>Run fail-over and stress tests </li></ul></ul><ul><ul><li>Apply a problem solution across multiple related applications </li></ul></ul><ul><ul><li>Establish and maintain the Production Monitor System to proactively detect system anomalies </li></ul></ul><ul><li>Evaluate how many problems are proactively identified and resolved </li></ul>
  41. 41. Service Desk Software <ul><li>Evaluate the existing TeamTrack software and see if it covers out needs </li></ul><ul><li>Other popular options </li></ul><ul><ul><li>HP Openview Service Desk </li></ul></ul><ul><ul><li>Remedy Strategic Service Suite </li></ul></ul><ul><ul><li>CA Unicenter Service Desk </li></ul></ul>