Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

incident analysis - procedure and approach


Published on

  • ...My Scandalous Secret to Crushing The Odds So Effortlessly... ♥♥♥
    Are you sure you want to  Yes  No
    Your message goes here
  • Tackle Odds With Lottery Secrets ★★★
    Are you sure you want to  Yes  No
    Your message goes here

incident analysis - procedure and approach

  1. 1. How to walk away from your Outage looking like a HERO Teresa Dietrich, Vice President Technology Derek Chang, Director Site Reliability Engineering
  2. 2. Who we are and Why we are here…. Teresa Dietrich – VP of Technical Operations @ WebMD, previously with AOL, @teresadg (Twitter), Derek Chang – Director of Site Reliability Engineering aka SRE @WebMD, experience in Development, WebOps and CMS We are passionate about Outages, Process & Procedures and Always making new mistakes!! 2
  3. 3. About WebMD• Most Recognized & Trusted Brand of Health Information• Serves consumers, physicians, other healthcare professionals, employers and health plans.• 107 million visitors/month on both desktop and mobile platforms• 2.5 billion page views/month 3
  4. 4. What is An Outage?Service is unavailable to users or to a subsetof usersService is unable to function as designed andimplementedDegradation of service to the point theresource is unusable (Defined SLAs) 4
  5. 5. Why do Outages happen? Bugs in OS, middleware, and application Hardware failure Infrastructure failure (Network, SAN) Environment failures (Power, Cooling) Human Error Demand exceeds capacity Malicious attacks 5
  6. 6. How are Outages exacerbated? Too long for monitoring to catch the issue Monitoring does not catch the issue, humans eventually do Too long to alert appropriate people of issue Too long for people to respond to alerts Too long to find the cause or source of the issue To long to resolve the issue Lack of communication to Internal and External customers Multiple failure scenario 6
  7. 7. A different way to do a Post Mortem Focus on improving processes and systems for future, not assigning responsibility for the outage. Structure, structure, structure! Discover, Analyze and Review Analysis done by a third party engineer with DevOps experience @ WebMD. Data collected in a prescribed and orderly fashion, using a template. Recommendations for improvement owned, assigned and tracked through resolution. 7
  8. 8. Incident Analysis Template 1 You can download the template @ 8
  9. 9. Incident Analysis Template 2 You can download the template @ 9
  10. 10. Incident 1 – background info 10
  11. 11. Incident 1 – outage resolution 11
  12. 12. Incident 1 – timeline analysis 12
  13. 13. Incident 1 – timeline analysis 13
  14. 14. Incident 1 – recent application builds, changes and maintenance 14
  15. 15. Incident 1 – log analysis 15
  16. 16. Incident 1 – log analysis 16
  17. 17. Incident 1 – monitoring correlation 17
  18. 18. Incident 1 – monitoring correlation 18
  19. 19. Incident 1 – root cause analysis 19
  20. 20. Incident 1 – root cause analysis 20
  21. 21. Incident 1 – root cause analysisIts caused by a known Oracle bug 5181800 specifically on oracle version LNS: LNS (log-write network-server) and ARCH (archiver) processes running on the primary database select archived redologs and send them to the standby database (IAD1) where the RFS (remote file server) background process within the Oracleinstance performs the task of receiving archived redo-logs originating from the primary database (PHX1) 21
  22. 22. Incident 1 – review and recommendation# Type Review Description Recommendation Process no ON clear was sent after outage update 4 was the last 1. Better process for outage communicationRR01 outage is cleared communication 2. firstaid NMS - notification management system Monitoring Currently oracle relies on home-grown detection script to monitor oracle event queue and We should look to third party monitoring tool at hand send email upon errors. The fact that IAD1 inadequate monitoring on (e.g. Zenoss) to monitor oracle components andRR03 RAC problem (which is the origin of oracle infrastructure implement oracle GRID control to provide additional control file lock in PHX1) didnt catch our monitoring attention made the troubleshooting a more difficult and longer process. Monitor alert inadequate monitoring on no alert was sent before/during outageRR04 We should set up alert from Gomez and Truesight. user experience from Gomez and Truesight. Development excessive errors in the request application log make it 1. review current logging implementation 15000 errors on 1/25, 28000 errors on extremely difficult to 2. log clean upRR05 1/26 and 10000 errors on 1/27 on a single troubleshoot by log and in 3. operations should review log and provide report tomcat server turn impact the recovery with engineering regularly (bi-weekly or monthly) time Ops request potential log rotation problem on tomcat serverRR06 several logs are only 1 kilobytes in size review/correct log setting and rotation script. (Medscape www backend farm) 22
  23. 23. Investigation Procedures 23
  24. 24. Investigation Procedures 24
  25. 25. Investigation Procedures 25
  26. 26. Incident 2 – background information 26
  27. 27. Incident 2 – Timeline analysis and application profiling 27
  28. 28. Incident 2 – root cause 28
  29. 29. Incident 2 - resolution 29
  30. 30. Incident 2 – Resolution rollout• Research: Further research revealed the Jsp compilation meta data are only stored in JVM when the Tomcat Jasper engine runs at development mode• Potential business impact: Teams agreed the solution to turn-off development mode under the assumption that there is no business impact – PJSP update will still function properly• POC: A brief POC test showed non-development mode does reduce memory footprint (memory usage dropped from 196.2Mb to 61.3Mb and total objects in memory dropped from 2.6m to 876k) and all PJSP updates are recompiled and ready to serve in a short moment.• Deployment: Zenoss JMX chart showed the memory dropped back close to initial consumption (0.2-0.3Gb) after each GC cycle while with development mode, the memory inflated to 1G in a couple days and GC could not reclaim memory space and tomcat needed to be restarted. 30
  31. 31. Incident 2 – Resolution rolloutFix verification: The fix was applied to the whole farm in production. Since then, the result is good - no more restart dueto out of memory space and view article performance is more than 30% better in Truesight (avg. 109.5ms compared to155.9ms before) 31
  32. 32. Incident 2 – review and recommendation 32
  33. 33. Change people’s reaction to “Post Mortem” Removing the emotion and blame from the Post Mortem process help minimize the dread and lack of participation. Standard procedures and templates shape people’s expectations and perceptions of the Post Mortem process. With the lead engineer of the investigation having no day to day responsibility with regards to product in question, we can greatly reduced the defensiveness and political stances by those involved. 33
  34. 34. Ensure the lessons are learned Publishing the results to first to the teams involved and then to the entire technology organization helps with education, openness about the process and accountability for the changes recommended. Take the recommendations, once agreed and approved, and turn them into actionable items: Dev Change Requests, Ops Tickets, Process Update and Communication, Monitoring Change. A single person should own the recommendations becoming action items and responsibility for seeing them through completion. Don’t let them fall by the wayside. During the next outage, try and highlight how the previous lessons improved the next outage, do your own PR for your process. 34
  35. 35. Questions Time permitting OR Office hours Tuesday June 26 @ 1pm 35
  36. 36. Appendix - Investigation Procedures1. Collect background information – Scope of impact – Information about the product(s) impacted – Interview personnel involved2. Initial interpretation – Type of incident – outage, service degradation – Expectation from senior management – Depth and scope of investigation – Resource planning 36
  37. 37. Appendix - Investigation Procedures3. In-depth analysis – Timeline analysis – Change analysis – Log analysis – Monitoring data correlation4. Research – Vendor documentation and white paper – Architecture review – Code review and application profiling – Infrastructure review5. Resolution and recommendation 37