Your SlideShare is downloading. ×
Deal With Production Issues - The ITIL Way
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Deal With Production Issues - The ITIL Way

5,123

Published on

Introduce the Incident Management and Problem Management concept of ITIL; Descript how to management Production Issues with ideas from ITIL

Introduce the Incident Management and Problem Management concept of ITIL; Descript how to management Production Issues with ideas from ITIL

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,123
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
264
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Deal with Production Issues Suggestions from ITIL
  • 2. Problems to solve
    • Long resolution time
    • Neglected issues
      • Issues we lose track of until our users remind us
    • Recurring issues
    • Inconsistency in response time
    • Developers are distracted constantly to resolve issues
  • 3. Goal
    • Manage issues in a consistent manner
    • Fast resolution
    • Reduce client impact
    • Proactively resolve issues before they impact clients
  • 4. Basic Concepts
    • Incidents
      • Any event which is not part of the standard operation of a service and which causes, or may cause an interruption to or a reduction in, the quality of that service
    • Problems
      • A problem is a condition often identified as the cause of multiple incidents that exhibit common symptoms.
    • Known Errors
      • A known error is a condition identified by successful diagnosis of the root cause of a problem, and subsequent development of a Work-around
  • 5. Relationship of the three
    • Problem is the root cause of the incidents
    • Incident is the manifest of a underline Problem
    • One Problem can cause many Incidents
    • Known error is a problem with known root cause and known workaround
  • 6. Manage Incident vs. Manage Problem
    • Different goals
      • Incident Management focus on restoring the service operation as quickly as possible
      • Problem management focus on finding and eliminating the root cause
    • Different actions
      • Incident management applies workarounds or temporary fixes to quickly restore the services
      • Problem management issue a change to fundamentally eliminate the root cause
    • Incident management is reactive and problem management is proactive
    • Incident management emphasize speed and problem management emphasize quality
  • 7. Common mistakes
    • Spend tremendous time and efforts to find root cause before the service level is recovered
    • Stop the investigation after an incident is fixed by a workaround
    • Same incident occurs repeatedly without understanding of the root cause
  • 8. Solutions from ITIL
    • Separate out Incident Management and Problem Management into two independent but related processes
    • Handle incidents (restore service) as quickly as possible
    • Proactively and independently work on resolving problems
    • Wisely manage Known Errors
  • 9. Incident Management
    • Always remember the goal is to “Restore service level as quickly as possible ”
    • How to go fast?
      • Classification
      • Match known errors and known workarounds
      • Appropriate escalation
    • Go fast, but not go crazy. Don’t miss
      • Record
      • Prioritize
      • Follow up
  • 10. Incident Management Process
  • 11. Acceptance And Record
    • Benefits of recording
      • Help to diagnosis new incidents based on known incidents
      • Help Problem Management to find the root cause
      • Easy to determine the impact
      • Be able to track and control the issue resolution.
    • Incident Reporting Channels
      • User
      • System Monitor/Alert
      • IT person
  • 12. Incident Record
    • Unique ID
    • Basic diagnosis info
      • Timestamp
      • Symptoms
      • User info (name, contact info)
      • Who’s responsible
    • Additional information
      • Screenshots
      • Logs
    • Status
      • New, Accepted, Scheduled, Assigned, Active, Suspended, Resolved, Terminated
  • 13. Classification
    • Classification
      • Possible reasons (application, network, database, business logic, etc.)
      • Supporting group (application group, database group, infrastructure group, network group, etc.)
    • Prioritize
      • Priority = Impact X Urgency
      • Determine resolution timeline (resolve within X hours) based on Service Level Agreement
  • 14. Preliminary Support
    • Preliminary Response
      • Acknowledge of acceptance
      • Collect basic info
      • Provide basic help to the user
    • Service Requests
      • Service Request is standard service like check status, reset password, etc.
      • Go through standard procedure to handle service requests
  • 15. Match
    • Match known errors
      • Known solution
      • Known workaround
      • Known resolution procedure
    • Match existing incidents
      • Link the new incident with the existing incidents
      • Increase the impact level of the existing incident
      • If the existing one is already worked on, inform the responsible personal/group
  • 16. Investigate and Diagnosis
    • Escalation
      • Functional escalation (Technical escalation) : Involve more technical experts, involve teams in other functional group, or involve external suppliers
      • Hierarchical escalation (Management escalation): Escalate to higher level management team
  • 17. Escalation by Priorities
    • A (Service Desk)
    • B (Second Line)
    • C (Third Line, Supplier)
    • D (Incident Manager)
    • E (Division Management)
    • F (Corporate Management
    C B A 8 hr 4 D C B A 6 hr 3 E,F D C B A 4 hr 2 EF CD B A 2 hr 1 100% timeline 60% timeline 30% timeline 10 Minute 0 Minute Resolution timeline Priority
  • 18. Investigation Activities
    • Assign dedicated support person
    • Collect basic info
    • Query historical data
      • Recent releases
      • Recent changes
      • Workload trend
    • Analyze
    • Again, don’t spend too much time in finding the root cause. Find a workaround as soon as possible!
  • 19. Resolve and recover
    • Resolution (workarounds or permanent fix)
      • Create a Request For Change (RFC)
      • Approve RFC
      • Implement Change.
    • Record the analysis, the root cause, the workaround and the solution
    • Leave the incident in Open status when resolution hasn’t been found
  • 20. Termination
    • Contact the user to confirm incident is resolved
    • Change the Incident status into “Closed”
    • Update all the Incident record to reflect the final priority, impact, user and root cause
  • 21. Track and Monitor
    • Assign an owner to each incident. Usually it’s the Service Desk person.
    • Provide feedback to the users after a change
    • Enforce the escalation based on the priority
  • 22. Problem Management
    • Problem Control
      • Find the root cause of a problem
      • Turn a problem into a Known Error
    • Error Control
      • Control and Monitor the Known Errors until they are appropriately handled
    • Proactive Problem Management
      • Resolve problems before they cause any incidents
  • 23. Problem Control
  • 24. Identify Problems
    • Analyze the trends of incidents
      • Likely to reoccur
      • Likely more will occur
      • Likely to have larger impact
    • Analyze the weakness of the infrastructure
      • Availability
      • Capability
    • A significant incident (outage)
  • 25. Diagnosis
    • Recreate incident in testing environment
    • Link the modules with incidents
    • Review the latest changes
    • After the root cause of a problem is found, this problem becomes a Known Error
  • 26. Temporary Fixes
    • It’s important to find a temporary fix if the problem causes significant incident
    • If temporary fix involves changes in the infrastructure, a Request For Change must be submitted. (Later, another RFC may be submitted to fix the root cause)
    • For urgent problems, Emergency Change Request Process should be initialized.
  • 27. Error Control
  • 28. Identify and Record Known Error
    • Identify
      • Find the root cause of a problem
      • Link a problem with a known error
    • Record
      • Assign an ID
      • Symptoms
      • Root cause
      • Status
    • Notification
      • Notify incident management team. They can associate new incidents with known errors
  • 29. Determine the solution
    • Evaluate based on
      • Service Level Agreement
      • Impact and Urgency
      • Cost and benefit
    • Possible solutions
      • Temporary fixes
      • Permanent fixes
      • No fix (cost is greater than benefits)
    • Record the decision in Problem Database
  • 30. Known Errors from other environments
    • Known errors from development environment
      • We may choose to release with some minor known issues
    • Known errors from suppliers
      • Usually reported in the release notes
    • Record, Monitor and Track those known errors
    • Relate problems with those known errors
  • 31. PIR (Post Implementation Review)
    • Normal problems
      • Confirm all the related incidents are closed
      • Verify if the problem record is complete (symptoms, root cause and solutions)
      • Change the problem status into Resolved
    • Significant problems
      • What went well?
      • What went wrong?
      • How to do better next time?
      • How to prevent the similar issues from happening again?
  • 32. Track and Monitor
    • Track the full lifecycle of each known error
      • Reevaluate impact and urgency. Adjust the priorities accordingly.
      • Monitor the progress of the diagnosis and implementation of the solution. Monitor the implementation of the RFC.
  • 33. Proactive Problem Management
    • Focus on the quality of the service and the infrastructure
    • Analyze operational trends
    • Detect the potential incidents and prevent them from happening
    • Find out the weak points of the infrastructure or the overloaded components
  • 34. Ideas to improve our Production Support process
    • Idea 1: Create an independent Problem Management Team.
    • Idea 2: Create an Problem Database
    • Idea 3: Define the Production Support Procedure
    • Idea 4: Review and revise the procedures of using TeamTrack
    • Idea 5: Enforce Post Implementation Review
    • Idea 6: Proactively manage problems
    • Idea 7 (optional): Acquire an Service Desk software to facilitate the process
  • 35. Create an independent Problem Management Team.
    • Can be a full time team or a part time team
    • Appoint a Problem Management Manager. Must be different than the Production Support Manager. Their goals, schedules and requirements are different.
    • Responsible for managing all the production problems (not incidents) for multiple applications
      • Identify problems
      • Record problem
      • Find and evaluate solutions
      • Track the progress till closure
    • Work closely with the existing Production Support team.
  • 36. Create a Problem Database
    • A easy to search knowledge database
    • Include problems and known errors
    • Track symptoms, root causes, temporary fixes, workarounds, and permanent solutions
    • Include all the known errors in DEV and unresolved or deferred defects in QA/RATE environments
    • Maintained by the Problem Management Team
    • Will be used by Production Support team for match and fast resolution of incidents
  • 37. Define the Production Support Procedure (Work Instructions)
    • Create a formal and detailed document. Train Production Support Team to follow the new procedure
    • Start with ITIL Incident Management Process. Adjust it to our own situation and tools
    • Clearly define how to calculate priorities
    • Clearly define the time-bound escalation procedure
    • Clearly define the monitoring and tracking steps
  • 38. Review and define the procedure of using TeamTrack
    • TeamTrack is our existing Incident Tracking system
      • Review the functions of TeamTrack
      • Redefine the incident escalation process according to ITIL suggestions
    • Define the interface between PC Support and IT Production Support Team
      • Communication channel
      • Roles and responsibilities
      • Escalation
      • Track and Control
      • Knowledge sharing
  • 39. Enforce PIR
    • Contact each user to confirm all the incidents are closed
    • Make sure the Problem record is complete and useful
    • Identify issues in the Incident and Problem Management process. Add those to Problem database.
  • 40. Proactively Manage Problems
    • Responsibility of the Problem Management Team.
    • Perform the following activities:
      • Analyze incidents to find the trend
      • Analyze infrastructure to identify possible bottleneck
      • Run fail-over and stress tests
      • Apply a problem solution across multiple related applications
      • Establish and maintain the Production Monitor System to proactively detect system anomalies
    • Evaluate how many problems are proactively identified and resolved
  • 41. Service Desk Software
    • Evaluate the existing TeamTrack software and see if it covers out needs
    • Other popular options
      • HP Openview Service Desk
      • Remedy Strategic Service Suite
      • CA Unicenter Service Desk

×