Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lost art of troubleshooting

1,696 views

Published on

There are a lot of great things about the cloud, but the "destroy and rebuild" philosophy which is really good for building a continuous delivery pipeline, really sucks when applied to troubleshooting production problems. When your application goes haywire, the most valuable engineering skill is not the the ability to bring up a copy of your system or even the knowledge of your technology stack (although it doesn't hurt). It is the skill of understanding and solving problems.

Finding the root cause of the issue and mitigating it with minimal disruption in production is a must-have skill for engineers responsible for managing and maintaining production systems, which nowadays includes ops, dbas and devs alike. In this talk I will discuss the skills required to troubleshoot complex systems, traits that prevent engineers from being successful at troubleshooting and discuss some techniques and tips and trick for troubleshooting complex systems in production.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Lost art of troubleshooting

  1. 1. http://fayerplay.com lost art of troublesh @papa_fire ooting Leon Fayer
  2. 2. @papa_fire {me} 20+ years breaking & fixing dev, architect, [DevOps] vp @ OmniTI fix other people’s
  3. 3. @papa_fire why troubleshooting?
  4. 4. @papa_fire cloud ruined everything i t r e a l l y d i d
  5. 5. MostreliablewaytofixWindowsproblems 1997 DevOpsmantraformanagingcloud-based systems 2017 when in doubt - reboot destroy and rebuild
  6. 6. old McDonald had a farm
  7. 7. old McDonald lost a farm d u e t o m a d c o w d i s e a s e
  8. 8. @papa_fire troubleshooting - a form of problem solving
  9. 9. @papa_fire problem solving - ability to fix things that you know nothing about
  10. 10. @papa_fire why is problem solving important?
  11. 11. @papa_fire … because systems are complex
  12. 12. @papa_fire … because of Murphy’s law
  13. 13. @papa_fire … because someone is always watching
  14. 14. @papa_fire {disclamer}
  15. 15. @papa_fire
  16. 16. @papa_fire wishful thinking
  17. 17. @papa_fire reality
  18. 18. @papa_fire where to begin?
  19. 19. @papa_fire replicate
  20. 20. @papa_fire OURTEAM isolate
  21. 21. @papa_fire fix?
  22. 22. @papa_fire what’s the problem? it’s broken!
  23. 23. understanding
  24. 24. @papa_fire OURTEAM understand problem
  25. 25. @papa_fire “ we can’t support 100s req/min we need to scale better!
  26. 26. @papa_fire “ we can’t support 100s req/min we need to scale better! improve performance
  27. 27. @papa_fire performance problem
  28. 28. @papa_fire perceived problem
  29. 29. @papa_fire actual problem
  30. 30. @papa_fire OURTEAM understand business
  31. 31. @papa_fire “ I don’t give a **** if the datacenter is on fire as long as I am still making money
  32. 32. @papa_fire what does it mean to you?
  33. 33. @papa_fire
  34. 34. @papa_fire sales
  35. 35. @papa_fire
  36. 36. @papa_fire content
  37. 37. @papa_fire content ad revenue
  38. 38. @papa_fire every technical decision powers a business need
  39. 39. @papa_fire OURTEAM understand impact
  40. 40. @papa_fire
  41. 41. is there a lesser of two evils?
  42. 42. @papa_fire sometimes breaking = fixing
  43. 43. @papa_fire 80% now > 100% tomorrow
  44. 44. @papa_fire incremental improvements
  45. 45. @papa_fire anatomy of a problem
  46. 46. @papa_fire anatomy of a problem problem norm norm
  47. 47. @papa_fire anatomy of a problem problem norm acceptable norm
  48. 48. @papa_fire anatomy of a problem problem norm acceptable norm fix fix fix fix
  49. 49. @papa_fire what have we learned? understandingof what’s important cause and effect largest impact acceptable risk
  50. 50. @papa_fire what not to do
  51. 51. @papa_fire don’t assume
  52. 52. @papa_fire
  53. 53. @papa_fire I didn’t build it it’s not documented it passed the tests works in dev everything looks right
  54. 54. @papa_fire
  55. 55. @papa_fire don’t feed your ego solve the problem
  56. 56. @papa_fire ask for help
  57. 57. @papa_fire OURTEAM tools
  58. 58. @papa_fire logging monitoring profiling
  59. 59. @papa_fire logging actionable concise parsable
  60. 60. @papa_fire OURTEAM [2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 16:46:31] AbandonedReservation successfully enqueued. [2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 18:57:02] Parsed args [2017-02-01 18:57:02] Posting to API [2017-02-01 18:57:02] Initializing args [2017-02-01 18:57:02] Loading reservation_form_data [2017-02-01 18:57:03] Reservation Form Data loaded successfully [2017-02-01 18:57:03] Appending campaign info [2017-02-01 18:57:03] Reservation name: [Some very very very long name] [2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance. [2017-02-01 18:57:03] Setting currency to US Dollar [2017-02-01 18:57:03] Appending marketing info [2017-02-01 18:57:03] Have a non-sku source_code [2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE [2017-02-01 18:57:03] Appending match rule = Match Rule [2017-02-01 18:57:03] Appending user info [2017-02-01 18:57:03] Appending order info [2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871 [2017-02-01 18:57:03] Determining actual cost table [2017-02-01 18:57:03] Appending comment notes [2017-02-01 18:57:03] Appending abandoned flag [2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = joe.smith@gmail.com contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... } [2017-02-01 19:04:03] Post complete, took 420 seconds [2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached) [2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
  61. 61. @papa_fire OURTEAM [2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 16:46:31] AbandonedReservation successfully enqueued. [2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 18:57:02] Parsed args [2017-02-01 18:57:02] Posting to API [2017-02-01 18:57:02] Initializing args [2017-02-01 18:57:02] Loading reservation_form_data [2017-02-01 18:57:03] Reservation Form Data loaded successfully [2017-02-01 18:57:03] Appending campaign info [2017-02-01 18:57:03] Reservation name: [Some very very very long name] [2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance. [2017-02-01 18:57:03] Setting currency to US Dollar [2017-02-01 18:57:03] Appending marketing info [2017-02-01 18:57:03] Have a non-sku source_code [2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE [2017-02-01 18:57:03] Appending match rule = Match Rule [2017-02-01 18:57:03] Appending user info [2017-02-01 18:57:03] Appending order info [2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871 [2017-02-01 18:57:03] Determining actual cost table [2017-02-01 18:57:03] Appending comment notes [2017-02-01 18:57:03] Appending abandoned flag [2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = joe.smith@gmail.com contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... } [2017-02-01 19:04:03] Post complete, took 420 seconds [2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached) [2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 usefulinformation [2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 18:57:03] API GET data: [2017-02-01 19:04:03] Post complete, took 420 seconds [2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
  62. 62. @papa_fire OURTEAM [2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 16:46:31] AbandonedReservation successfully enqueued. [2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 18:57:02] Parsed args [2017-02-01 18:57:02] Posting to API [2017-02-01 18:57:02] Initializing args [2017-02-01 18:57:02] Loading reservation_form_data [2017-02-01 18:57:03] Reservation Form Data loaded successfully [2017-02-01 18:57:03] Appending campaign info [2017-02-01 18:57:03] Reservation name: [Some very very very long name] [2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance. [2017-02-01 18:57:03] Setting currency to US Dollar [2017-02-01 18:57:03] Appending marketing info [2017-02-01 18:57:03] Have a non-sku source_code [2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE [2017-02-01 18:57:03] Appending match rule = Match Rule [2017-02-01 18:57:03] Appending user info [2017-02-01 18:57:03] Appending order info [2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871 [2017-02-01 18:57:03] Determining actual cost table [2017-02-01 18:57:03] Appending comment notes [2017-02-01 18:57:03] Appending abandoned flag [2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = joe.smith@gmail.com contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 ... } [2017-02-01 19:04:03] Post complete, took 420 seconds [2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached) [2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 informationIneed [2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached)
  63. 63. @papa_fire monitoring all inclusive business-first correlatable
  64. 64. @papa_fire what’s the problem? it’s broken!
  65. 65. @papa_fire revenue
  66. 66. @papa_fire revenue
  67. 67. @papa_fire revenue user performance
  68. 68. @papa_fire revenue database load user performance
  69. 69. @papa_fire revenue database load decline rate user performance
  70. 70. @papa_fire profiling
  71. 71. @papa_fire OURTEAM when you have the “what” but still have no idea “why”
  72. 72. @papa_fire OURTEAM #!/usr/sbin/dtrace -s #pragma quiet ::ap_process_request:process-request-entry /zonename == "www4"/ { self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp; } sched:::off-cpu /self->uri != 0/ { self->runtime += timestamp - self->oncpu; self->offcpu = timestamp; } sched:::on-cpu /self->uri != 0/ { self->oncpu = timestamp; self->waittime += timestamp - self->offcpu; } ::ap_process_request:process-request-return /self->uri != 0/ { @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count(); } :::tick-5min { printf("n%Yn", walltimestamp); printf("nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URLn"); trunc(@duration,10); printa(@duration); trunc(@duration); printf("nnNUMBER OF HITSn"); trunc(@count,10); printa(@count); trunc(@count); printf("nnTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URLn"); trunc(@waiting,10); printa(@waiting); trunc(@waiting); } TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL /directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344
  73. 73. @papa_fire OURTEAM #!/usr/sbin/dtrace -s #pragma quiet ::ap_process_request:process-request-entry /zonename == "www4"/ { self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp; } sched:::off-cpu /self->uri != 0/ { self->runtime += timestamp - self->oncpu; self->offcpu = timestamp; } sched:::on-cpu /self->uri != 0/ { self->oncpu = timestamp; self->waittime += timestamp - self->offcpu; } ::ap_process_request:process-request-return /self->uri != 0/ { @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count(); } :::tick-5min { printf("n%Yn", walltimestamp); printf("nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URLn"); trunc(@duration,10); printa(@duration); trunc(@duration); printf("nnNUMBER OF HITSn"); trunc(@count,10); printa(@count); trunc(@count); printf("nnTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URLn"); trunc(@waiting,10); printa(@waiting); trunc(@waiting); } TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL /directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344 /api/mobile/get_all_events 368584344
  74. 74. @papa_fire OURTEAM TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL /directory 6850049 /api/map_search 7341249 /api/mobile/get_all_items 7980925 /m/ 9124747 /m/directory 9175345 /api/mobile/get_profile 11729556 /api/holiday_feed 12603853 /api/mobile/get_all_widgets 15043481 /api/get_item/60693 19773404 /m/events/all 26165132 /api/all_items 27362330 /api/mobile/get_all_events 368584344 #!/usr/sbin/dtrace -s #pragma quiet ::ap_process_request:process-request-entry /zonename == "www4"/ { self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp; } sched:::off-cpu /self->uri != 0/ { self->runtime += timestamp - self->oncpu; self->offcpu = timestamp; } sched:::on-cpu /self->uri != 0/ { self->oncpu = timestamp; self->waittime += timestamp - self->offcpu; } ::ap_process_request:process-request-return /self->uri != 0/ { @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count(); } :::tick-5min { printf("n%Yn", walltimestamp); printf("nTOTAL TIME SPENT ON CPU BY ALL HITS ON THIS URLn"); trunc(@duration,10); printa(@duration); trunc(@duration); printf("nnNUMBER OF HITSn"); trunc(@count,10); printa(@count); trunc(@count); printf("nnTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URLn"); trunc(@waiting,10); printa(@waiting); trunc(@waiting); } /api/get_item/60693 19773404
  75. 75. @papa_fire downtherabbithole
  76. 76. @papa_fire troubleshooting is … requiredskill educational iterative frustrating rewarding
  77. 77. @papa_fire
  78. 78. @papa_fire https://www.track5media.com/wp-content/uploads/2016/06/workers-gathered-around-comuputer- screen.jpg http://more-sky.com/data/out/10/IMG_379964.jpg https://ruwix.com/pics/trolls/9-rubix-cube-neversolved.jpg http://blog.cartif.com/wp-content/uploads/2016/02/evolucion.png https://cdn-images-1.medium.com/max/2000/1*t-yZUIXuaXo97yiqYtpC5A.jpeg http://www.6speedonline.com/forums/attachment.php? attachmentid=286232&stc=1&d=1380726388 http://www.wallpapers.faketrix.com/content/animal/feathered/page-2/1024/Ostrich-non-flying- winged-animals.jpg http://oldmanyellsat.cloud/oldman.jpg http://cdn.wccftech.com/wp-content/uploads/2016/05/4195797-windows-7-alternate-blue.jpg https://www.poweradmin.com/blog/wp-content/uploads/2015/10/amazon-aws.png https://supportingcmu.org/image/Herd.png http://www.publicdomainpictures.net/pictures/30000/velka/green-fields-1351063140pg3.jpg https://hurtigruten.global.ssl.fastly.net/assets/48dee2/globalassets/photos/voyages/explorer- voyages/2017-18/ms-fram-antarctica/the-frozen-land-of-the-penguins/ 2500x1250_r739816dominicbarrington.jpg?width=1600&height=800&transform=DownFill https://www.thegeneralistit.com/wp-content/uploads/2015/11/dreamstime_xxl_38819851- Business-woman-eliminate-problem-and-find-solution.jpg http://paperzip.co.uk/wp-content/uploads/2016/01/word-of-the-day-newspaper.jpg http://vignette3.wikia.nocookie.net/starwars/images/7/72/DeathStar1-SWE.png/revision/latest? cb=20150121020639 https://lcarsgfx.files.wordpress.com/2014/10/prometheus1.png https://cdn.meme.am/cache/instances/folder699/400x/65194699.jpg http://blog.weespring.com/wp-content/uploads/2014/06/baby-safety-manual-5.jpg https://4.bp.blogspot.com/-2fGfDw-sohs/V9_CAwCcnaI/AAAAAAAACos/ zrARBywD2qAZOphkQMC7WZGdV3vMY5nTACLcB/s1600/Stop%2Bwhining.jpg https://ih0.redbubble.net/image.14163956.5143/raf,750x1000,075,t,black_white.u4.jpg http://www.inspireddad.org/wp-content/uploads/uploads/2013/02/ducttape_0930a8_3926013.jpg https://katieleigh.files.wordpress.com/2014/10/img_0683.jpg http://pre02.deviantart.net/020c/th/pre/i/2016/094/8/0/down_the_rabbit_hole_by_irenhorrors- d7hgsr3.jpg http://i1-linux.softpedia-static.com/screenshots/Valgrind_1.png http://i.imgur.com/m6Rkbdx.gif credits questions?

×