Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Respond to and troubleshoot production incidents like an sa

146 views

Published on

So it's 4 AM and you just got a call from a panicked executive that the system is down! Oh noes! What do you do? Troubleshoot LIKE AN SA. I know "Systems Administrator" is not the cool industry term anymore, but that mentality for fixing the big live problem, like RIGHT NOW can still help today.

You're probably in the job you're in because you're AWESOME at figuring out what's wrong and fixing problems. But your projects have grown, your team has grown, and the expectations grow with them. How do you deal with these new found responsibilities? LIKE AN SA. There are some simple processes you can put in place to help make your life easier. We'll discuss a framework for incident response, a step-by-step guide for troubleshooting production issues, and how to then learn from these outages to prevent problems from happening again.

Published in: Technology
  • Be the first to comment

Respond to and troubleshoot production incidents like an sa

  1. 1. Respond to and Troubleshoot Production Incidents Like an SA ll, .cl:. .clc. ;l: ;ll' 'cl:..:lc. ,ll, .:lc' 'lllc' ,ll; :. .:lc. ,' 'll:. .; dKK0 ,KKK, 'KKK: lKKK. cKKd ll; .:lc. 'cl:. .:l: .WMMd NMM0 ;MMMMO. .0MMMM. oMMO ,ll; .cl:..cl:. .;ll' lMMM. oMMW. ;MMMMMW:lWMMMMM. oMMO ;ll, .llll. ;ll, KMMK .MMMc ;MMW,XMMMMX.WMM. oMMO 'll:. .:lccl:. .:lc. .WMM: 0MM0 ;MMN dMMo WMM. oMMO 'cl:. .:lc' 'll; .clc. lMMNlMMW. ;MMN '. WMM. oMMO l:. .;ll' ,ll, .c: KMMMMMl ;MMN WMM. oMMN00000 . ;ll, 'cc. ,ll, . .NNNN0 ,NNX XNN. lNNNNNNNN ,ll, .cl:clc. ;ll' ,ll; .clc. .cl:. .:lc.
  2. 2. whoami • Tom Cudd • 2004 University of Nebraska-Lincoln graduate, B.S. in Computer Engineering • Work at VML in Kansas City, MO
  3. 3. dig +noall +answer • Colgate • Premier League • Kellogg’s • Korean Air • US Soccer • Wendy’s • BridgeStone • Sprint • Ford
  4. 4. It’s 3 A.M. Buzz Buzz
  5. 5. How You Know It’s Broken • Support Line service • “Pager”/Text message • Direct phone call from coworkers • Direct phone call from clients • Email alerts • Tickets from a system
  6. 6. The System is Down! _______ .'. .'. / .| _ /|. : : // : : : : (_V/_) : : : : .-v-. : : : | | / : | |/ / === / .-.__ '..___..' __.-. / / | _| |_ | | |/ |'|----/ ----|'| | | | |.|---: :---|.| | _| :_______: |_ / '--' ####### '--' '#####' /#"""# /#/ # .-.// /.-. / / '._/ _.'
  7. 7. The Old Way • Roll out of bed • Start fixing
  8. 8. What You Think You’re Doing
  9. 9. What You’re Really Doing
  10. 10. Now What? • Start a process shaped from many incidents, finely tuned with learnings gleaned from years of production support experiences
  11. 11. Just Kidding! • Sort of
  12. 12. What You Should Do Instead • Acknowledge • Inspect • Determine Severity • Create Tracking Method • Create Communication Channel • Inform • Escalate • Identify Roles • Communicate Outside Team • Your Larger Organization • Your Account Manager/Client • Go To After Action Process
  13. 13. Communicate Briefly! • This is the first step of any troubleshooting process • People need to know what is broken and if someone is working on it
  14. 14. Let’s Start Working
  15. 15. Rule #1: Don’t Panic oooo$$$$$$$$$$$$oooo oo$$$$$$$$$$$$$$$$$$$$$$$$o oo$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$o o$ $$ o$ o $ oo o$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$o $$ $$ $$o$ oo $ $ "$ o$$$$$$$$$ $$$$$$$$$$$$$ $$$$$$$$$o $$$o$$o$ "$$$$$$o$ o$$$$$$$$$ $$$$$$$$$$$ $$$$$$$$$$o $$$$$$$$ $$$$$$$ $$$$$$$$$$$ $$$$$$$$$$$ $$$$$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$$$ $$$$$$$$$$$$$$ """$$$ "$$$""""$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ "$$$ $$$ o$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ "$$$o o$$" $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$o $$$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$" "$$$$$$ooooo$$$$o o$$$oooo$$$$$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ o$$$$$$$$$$$$$$$$$ $$$$$$$$"$$$$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$$"""""""" """" $$$$ "$$$$$$$$$$$$$$$$$$$$$$$$$$$$" o$$$ "$$$o """$$$$$$$$$$$$$$$$$$"$$" $$$ $$$o "$$""$$$$$$"""" o$$$ $$$$o o$$$" "$$$$o o$$$$$$o"$$$$o o$$$$ "$$$$$oo ""$$$$o$$$$$o o$$$$"" ""$$$$$oooo "$$$o$$$$$$$$$""" ""$$$$$$$oo $$$$$$$$$$ """"$$$$$$$$$$$ $$$$$$$$$$$$ $$$$$$$$$$" "$$$"""" • Nobody likes a whiny admin
  16. 16. Rules for Troubleshooting
  17. 17. Don’t immediately assume you know what’s wrong
  18. 18. Don’t assume the problem description is accurate
  19. 19. Understand the architecture (or find someone who does)
  20. 20. Other Rules for Troubleshooting • Identify the scope of the problem (site vs. page) • Dev + QA + Ops (Involve the whole team) • Evaluate recent chances to the environment
  21. 21. Black Box Abstraction • “In science, computing, and engineering, a black box is a device, system or object which can be viewed in terms of its inputs and outputs (or transfer characteristics), without any knowledge of its internal workings. Its implementation is "opaque" (black). Almost anything might be referred to as a black box: a transistor, an algorithm, or the human brain.” • https://en.wikipedia.org/wiki/Black_box
  22. 22. Where to Start? • Top down approach • Useful for when nothing loads at all • Work your way down the stack • Bottom up • Errors or inconsistencies • Work your way up the stack WK0kxdooodxk0X Kx:. .,lkN Wx; .c0 O, cX Wd. ,0 O. .,:odxxdoc;. ;X W; .l0W Xx; x X. lX Wx. : X lW O. ; W. .X Wl c : .W d 0 X N c ' l d W X ' W ; x . l d ' W ; x c d W K 0 N c . ' .W d x X .X l , O oW 0. .W O lX Wk' .N K. .o0W Xx; lN Wl .,codxxxdl;' lN X: lN X: lN Wk; ' lN Kd:. .;o0W Nc lN NKOxollcclloxOKN Nc lN Nc lN Nc lN Nc lN Nc lN Nc lN Nc Nc Nc Nc .K Nc 'O
  23. 23. Layers to Review • DNS (“It’s always DNS” – James Glenn) • Content Delivery Network (CDN) • Load balancer • Caching layers (varnish, dispatcher, memcache) • Web servers • Application servers • Data layer (database, service, API)
  24. 24. Things to Check • Services • Applications • Operating Systems • System Resources • Infrastructure
  25. 25. Services • Are they running? • Are PID files available? • Are logs rolling?
  26. 26. Applications • What response codes are we getting? • Are we getting any response at all? • TCP Connection? • Timing out? • Connection closed/denied?
  27. 27. Operating System • File handles • inodes • File system (read only, corruption, etc.)
  28. 28. System Resources • CPU • Memory • Disk • I/O
  29. 29. Infrastructure • I/O • Bandwidth • Networking • Firewalls • Upstream providers
  30. 30. Tools We Use Problem Easy Medium Advanced Capture end user experience Browsers (all of them) Firebug, chrome developer tools Fiddler, APM transactional Finding failure in stack wget PowerShell, curl nmap, Wireshark, tcpdump, NewRelic DNS issues nslookup dig dig Accessing logs VPN, SSH, RDP SumoLogic, Splunk ELK Stack
  31. 31. The #1 Troubleshooting Resource
  32. 32. The Response Process • Acknowledge • Inspect • Determine Severity • Create Tracking Method • Create Communication Channel • Inform • Escalate • Identify Roles • Communicate Outside Team • Your Larger Organization • Your Account Manager/Client • Go To After Action Process
  33. 33. Acknowledge • No more alerts • No other resources initially work on it • Initial response SLA
  34. 34. Inspect, Determine Severity • Use the information from the alert or notification • Check from your phone or something quick on first book • SWAG the severity level • It’s OK to change severity level during the process
  35. 35. Severity Level Guidelines • Sev 1 (High) • Sev 2 (Medium) • Sev 3 (Low)
  36. 36. Sev 1 • This incident level is attained when the following type of conditions are met: • A complete outage of the website or critical serviceinfrastructure • A reoccurring temporary outage of the web site or critical serviceinfrastructure • Loss of data • Any Security Incident
  37. 37. Sev 2 • This incident level is attained when the following type of conditions are met: • A significant degradation of the website performance such as failure to render pages within a reasonable or typical timeframe (i.e. >15 sec or to point of request timing out) • Recent modifications to the system cause website or services to operate in a way that is materially different from those described in the functional specifications for essential features • This may also include: End user performance of the site outside pre-defined SLA agreements, portions of site functionality missing, broken, non-functional, errros, inconsistencies, CMS Performance, or client reported issue that is not Severity 1
  38. 38. Sev 3 • This incident level is attained when the following type of conditions are met: • A minor degradation of the service delivery occurs (i.e. content feed not being updated regularly) • Recent modifications to the system cause website or services to operate in a way that is materially different from those described in the functional specification for non-essential features • Improper functionality • This may also include any: Backend issues, Security team reported, QA team reported, or Legal (depending on issue may need to be treated as Severity 1 or Severity 2)
  39. 39. Initial Response/ Confirmation Goal • Sev 1: Confirmation within 15 minutes via email or phone • Sev 2: Confirmation within 30 minutes via email or phone • Sev 3: Confirmation during regular business hours within 1 business day
  40. 40. Subsequent Communication Frequency • Sev 1: Every 60 minutes or as agreed upon during incident • Sev 2: Every 2 hours or as agreed upon during incident • Sev 3: As agreed upon per previously assigned client standards or end of every business day
  41. 41. Resolution Target • Sev 1: 1-4 hours • Sev 2: 4-24 hours • Sev 3: 1-3 business days or as agreed upon per incident
  42. 42. Communication Methods • Sev 1: Incident manager should create a Request, then create a HipChat room directly from the ticketing interface for the incident. Invite all necessary parties. If a conference bridge is required, this is the responsibility of the incident manager • Sev 2: HipChat or Skype for Business for group communications. JIRA and Email for status updates • Sev 3: Communication at the discretion of interested parties. JIRA for status updates
  43. 43. Create Tracking Method • If you have a ticketing system that allows for comments to be added, track action steps here • Not the play by play, but the high level bullet points • For example: “Rebooted server”
  44. 44. Create Communication Channel • Slack/HipChat – separate engineering channel for unique incidents • Keep “noise” out of main channels • Conference bridge set up if not using IM tools • Have a separate bridge handy for engineering/development and main one for non- technical communications
  45. 45. Inform • Call your boss, or your boss’s boss, or your boss’s(n) boss • @all • DL-EveryoneImportant
  46. 46. Escalate • Be OK not knowing everything • Find the right resource to work on the issue • Don’t bail, learn from the incident
  47. 47. Identify Roles • Incident Manager • Incident Leader • Supporting Roles
  48. 48. Incident Manager • Handles communications • May assist in making decisions • SHOULD NOT be the person doing the work • Assist in gathering information, weighing options during the resolution process • Using the communication methods and the expected frequency per severity level
  49. 49. Incident Leader • The identified subject matter expert to the project, site, or service with the problem to be resolved • The direct person attempting to fix the issue (developer, DBA, Ops, whatever) • Ops would move to supporting role if developer/architect is incident leader
  50. 50. Supporting Roles • Any other identified technical resource that may be required to assist in resolving the issue • Ops, DBA, or Developer • The Incident manager should help in pulling these resources into an incident resolution • Moral support is also a role!
  51. 51. Communications Outside the Team • If it requires client communications, and if you have account managers, pull them into the process • If it requires vendor communications, find out who the responsible account contact is • If it’s to your internal org (and important people), ask your boss or boss’s boss to do it
  52. 52. After Action Process • Document and Summarize Issue • Timeline • Stabilization Actions • Impact • Resolution • Learning Takeaways • Actionable Takeaways
  53. 53. Thanks! • Twitter: @tomcudd • Website: https://tomcudd.com/

×