Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What To Do When It All Goes So Wrong


Published on

As IT Professionals we inevitably will see situations where everything goes wrong. At times we are somewhat lucky and this just means diminished functionality or a slow system. Other times our organization is temporarily out of business. Regardless of the scope of the issue, how we react can have a direct impact on how quickly things are returned to normal. This session will cover how to communicate issues, including what to say, who to say it to and when to say it. Part of managing communication is to get everyone into a room, forcing them to talk, so time will be spent on designing an effective war room. The session will also cover how by setting out to prove that an issue is ours we are able to more quickly get at a root cause.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

What To Do When It All Goes So Wrong

  1. 1. What To Do When It All Goes So Wrong<br />David Levy<br /><br />SQL Saturday #67 Chicago<br />
  2. 2. More than 11 years in IT<br />SQL Server DBA for over 3 years<br />Previous Life as Developer<br />Blogger<br /><br />Syndicated on<br />Syndicated on<br />@dave_levy on Twitter<br />About Me<br />
  3. 3. Peak Time of Peak Sales Day<br />Typical Hourly Sales $100K/HR<br />Order Entry Screen is Locked Up<br />Users report Slowness Initially<br />Now the “Sales Center” Application is Just “Clocking”<br />EMERGENCY!<br />
  4. 4. Let Everyone Know There is a Problem<br />Prevent Duplicated Efforts<br />Allows Others to Speak Up<br />Recent Changes<br />Related Issues<br />Communicate<br /><br />
  5. 5. Send Up a Flare<br />Send to an IT Only Distribution Group<br />Keep the Subject Line General<br />Provide Broad Overview Including:<br />Systems Impacted<br />Major Symptoms Including Error Messages<br />Number of People Impacted<br />Any Location Specific Information<br />Communicate<br />
  6. 6. What Resources Do You need?<br />Subject Matter Experts<br />Specialized Equipment<br />Communicate<br />
  7. 7. Never Assign Blame<br />Only State Facts<br />Communicate<br />
  8. 8. To: IT Emergencies<br />Subject: Sales Center Issues<br />Sales Center Users are reporting that the Order Entry screen has quit responding. We are currently investigating the issue with the Sales Center Development Team. We will provide updates as we know more.<br />Communicate<br />
  9. 9.
  10. 10. What Are the Symptoms?<br />What Locations are Involved?<br />Collect<br />
  11. 11. What Systems are Involved?<br />SQL Server<br />AS400<br />Mainframe<br />Web Farm<br />Major Network Components like Load Balancers<br />Collect<br />
  12. 12. What Has Changed?<br />Look at Change Control Calendar<br />Talk to Primary On-Calls for Related Systems<br />Collect<br />
  13. 13. Anything in the Logs?<br />Windows Logs<br />Application Specific Logs<br />Custom Exception Handling Systems<br />Collect<br />
  14. 14. What are Performance Indicators Showing?<br />Perfmon<br />SQL Wait Stats<br />Third-party tools<br />Collect<br />
  15. 15. Analyze Collected Information<br />Are There Any Obvious Signs of Trouble?<br />Can the Problem be Linked to a Change?<br />Can Any Patterns be Identified?<br />Process<br />
  16. 16. Prove It Is Your Issue<br />Shows Humility<br />Shows Respect for Everyone Else’s Time<br />Avoid Appearing Arrogant<br />Process<br />
  17. 17. Prove It Is Your Issue<br />Construct Tests to Prove Theories in Order of Likelihood Until Problem Proven or Theories Exhausted<br />Faster than arguing about what it is not<br />How can you know it is not your issue?<br />Process<br />
  18. 18. List Potential Actions<br />Rank by effort, confidence, level of risk<br />Develop action plans for best options and re-rank<br />Each potential action should have a rollback plan<br />Process<br />
  19. 19. Define Measures<br />What will indicate things have gotten better?<br />Adding this index will reduce Disk IO by 10 million reads per second<br />The execution time of query x will drop from 6 minutes to 50 milliseconds<br />Process<br />
  20. 20. Define Measures<br />What will indicate things have gotten worse?<br />Disk IO may go up<br />The execution time of query x may go up<br />Adding this index may slow inserts from the order upload process<br />Process<br />
  21. 21. Communicate Your Intentions<br />Make the Change<br />Follow a written plan<br />Make a single change<br />A single person should make the change<br />Document any additional steps taken<br />Start Over by Collecting More Data<br />Respond<br />
  22. 22. Signs You Need to Convene A War Room<br />Having Trouble Finding Anything Wrong<br />30 Minutes Without Progress<br />An Issue Appears to Span Multiple Systems<br />Having Difficulty Getting People Engaged<br />The War Room<br />
  23. 23. Get Everyone in a Room<br />No Changes Made Outside the Room<br />No Heroes<br />Watch out for people doing a lot of typing<br />Avoid changes that take more than a few minutes<br />Have a Call in Number for Remote Coworkers<br />The War Room<br />
  24. 24. Have a Technology Kit<br />Old Switch<br />Patch Cords<br />Mice + Mouse Pads<br />Power Strips<br />The War Room<br />
  25. 25. Monitor Your Guest List<br />1-2 Representatives From Each Team<br />Try to Keep Management Out<br />Watch for Disruptive People<br />The War Room<br />
  26. 26. To: IT Emergencies<br />Subject: Sales Center Issues<br />We are convening a war room for the Sales Center issue. Everyone working on the issue please meet in the North Conference Room. Remote/WFH coworkers should dial into the conference bridge 888-888-1234, participant code:1234.<br />Communicate<br />
  27. 27.
  28. 28. White Board the Issue<br />Every System Gets Own Column<br />Write All Facts on White Board<br />Closed Items Get Crossed Out Not Erased<br />Include a Resolution for Each Closed Item<br />The War Room<br />
  29. 29. Share the Floor<br />Likely Issue Owner Has the Lead<br />Make Sure Everyone is Heard<br />Contributing Often Involves Staying Out of the Way<br />Don’t Be Afraid to Fade Back and Run The Whiteboard<br />The War Room<br />
  30. 30. Never Call “Not-It” and Leave<br />Not Helpful<br />You May be Wrong<br />Appears Arrogant<br />The War Room<br />
  31. 31. Keep an Eye On Time<br />Provide Regular Updates to Management<br />Bring in Food Around Meal Times<br />Raises Spirits<br />Brings in More People to Help<br />The War Room<br />
  32. 32. To: IT Emergencies<br />Subject: Sales Center Issues Update<br />The Sales Center war room is still going. We are currently looking into a driver issue with IBM. All necessary resources have been engaged.<br />Communicate<br />
  33. 33. Keep People in Reserve<br />Each Team Should Divide up the Day<br />Rotate People In and Out<br />Send Someone Home Early to Come in Early<br />The War Room<br />
  34. 34. Closing Out<br />Communicate Resolution<br />Capture Contents of Whiteboard<br />Clean Up Room<br />The War Room<br />
  35. 35. To: IT Emergencies<br />Subject: Sales Center Issues Resolved<br />The Sales Center issue has been resolved. The issue was caused by a patch that was applied over the weekend. Now that it has been backed out everything has returned to normal.<br />Communicate<br />
  36. 36. ?<br />Questions?<br />