Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

10 Step Guide to Troubleshooting like an Experts

1,039 views

Published on

We often find ourselves in reactive situations - with staff, managers, team-leads breathing down your neck while you are trying to figure out what happened and get it corrected STAT. It’s stressful. We are fighting the panic feeling, we are struggling with whether or not to fight or flight - we’ve had better times. Read our 10 step guide to troubleshooting like an Expert.

Written by Experts Exchange Expert, Matt Minor. View his profile at: http://www.experts-exchange.com/members/MattMinor86.html

See his original article here: http://www.experts-exchange.com/articles/25099/Troubleshoot-like-an-Expert-A-10-Step-Guide.html

Published in: Business
  • Login to see the comments

  • Be the first to like this

10 Step Guide to Troubleshooting like an Experts

  1. 1. 10StepGuideto Troubleshooting like an Expert Written by Matt Minor
  2. 2. Captain Jack Sparrow’s infinite wisdom never ceases to amaze. His famous quote illustrated in the image above has really impacted the way I interpret “problem situations”. In this field, we often find ourselves in reactive situations -- with staff, managers, team-leads breathing down your neck while you are trying to figure out what happened and get it corrected STAT. It’s stressful. We are fighting the panic feeling, we are struggling with whether or not to fight or flight -- we’ve had better times. Here are ten steps to help better organize your troubleshooting procedure.
  3. 3. Prepare 1 This COULD mean getting required tools ready or pulling out product documentation. Though valuable, these are not my primary focus in a troubleshooting scenario. The preparation stage for me is mental -- getting my attitude in-check and shifting to a CCC mindset (cool, calm, collected). There IS a solution and I CAN fix this. Jumping straight into a situation without any preparation can be catastrophic. Rebooting a server the second the internet goes down isn’t the best approach. Get yourself prepared to tackle the situation first, and you just might realize that your network cable was unplugged. “There IS a solution and I CAN fix this.”
  4. 4. We’re often under huge amounts of pressure to just get things back up-and-running. It’s a mad-dash to the server room and anyone in our way is simply getting trampled. Kick open the door and hear the fire alarm going off... grab the water-bucket as outlined by the disaster-recovery steering committee and toss it into the server rack! Success! Just saved some lives. What you failed to realize was that it was just a fire-drill, and now the backup tapes from last week are water-logged. A little extreme -- but the point is, critical data needs to be considered before jumping into corrective action. We know our networks, and we know what matters most. Consider the worst-case scenario for your actions, and have a backup plan. “Consider the worst-case scenario for your actions, and have a backup plan.” Outline a Damage Control Plan 2
  5. 5. It might sound trivial, in that of course we need to know what’s happening -- otherwise we don’t know there’s a problem! Though that may be true, also true is the fact that in our field, many of the issues we deal with can have symptoms similar to another problem, and we need to carefully distinguish these. Get as much information as possible. This is where a “script” or “flow-chart” comes in handy for those front- line staff who are taking problem calls. The quality of the information being passed to the people who are going to be troubleshooting has a direct result on the quality of the way the incident is handled. This area is critical to the troubleshooting process, read the full section here. In summary, the types of questions that need to be asked are as follows: • When did it start happening? • What else happened around that time? • Any installations or configuration changes done around that time? • The who/what/where/when and why Get the Symptom Description 3
  6. 6. “Try and recreate the issue, so it can be witnessed first-hand.” This one is simple, but still vital. You can’t begin to implement corrective measures if you don’t have a full understanding of the problem. Using Step 3, try and recreate the issue so it can be witnessed first-hand. This isn’t always possible possible, but the alternative is to see it from the end-user perspective. Whether that’s a walk to their office or by using remote assistance, you should see for yourself what you’re working with. Many humans, by nature, develop a more solidified understanding of concepts and ideas by visual exposure, as opposed to by reading or hearing. Having this first-hand knowledge of the problem will greatly aid in carrying out Step 5. Reproduce the Symptom 4
  7. 7. Steps 1-4 have you collecting appropriate information, developing your contingency/back-out/backup plans, and generally just getting ready to tackle the task-at-hand -- solving the problem. A lot of the time, IT administrators and technicians jump straight to this step, setting themselves up for a world of potential new issues, not to mention the amount of lost time. Based on the sound, detailed information you now have, this is where you make your best-informed decision regarding what the issue is, and what actions to take towards resolution. Here is where actions to resolve the problem are taken. “Here is where actions to resolve the problem are taken.” Take Corrective Action 5
  8. 8. During this step, perform final validation in terms of what the problem was by reviewing the pertinent Event Logs, specific application logs, device-logs, and so on to try and iron out exactly what this issue was caused by. In the ITIL world, this step is critical to the Service Operation stage as it is brought to the table for review and root-cause analysis. The general idea here is to iron out a plan to: A) Stop this from occurring again. B) If it does happen again, because technology is unpredictable, how can we handle this better? This analysis is not done during this step, of course, because we still have work to do! “Iron out a plan.” Narrow it Down (Isolate the Root-Cause) 6
  9. 9. “Don’t get blindsided again.” The ugly truth of working in IT is that sometimes we don’t have any clue of an issue until it explodes in our face. This stage is about making sure you don’t get blindsided again. If faulty equipment or a bad configuration was the issue, correct that or install a replacement device -- whatever you need to do to decrease the chance of a repeat problem. Replace or Repair Defective Equipment 7
  10. 10. “Verify that the correct response to the problem was taken.” Once the fire is out and the smoke has settled, it’s time to reflect on the incident and verify that the correct response to the problem was taken. Ask the following questions: Did the symptom go away? Did the right symptom go away? Did I fix the right cause? Did I create any other problems? Having just dealt with a crisis, we’re starting to feel the relief and users are back to work as usual. We don’t want any unexpected surprises surfacing as a result of the incident, so asking ourselves these questions and performing any corresponding validation will help keep those users happy, and will ultimately help prevent a relapse. Test 8
  11. 11. “Take some time to talk about the incident.” Though not directly linked to the troubleshooting process itself, this step is, indeed, vital. You’ve just been involved in a stressful situation, with people coming at you from every-which-way looking for updates and ETAs. Now that things are back up and running, you need to take some time to talk about the incident. Tell your co-workers/managers/team-leads the process you went through to arrive at the solution. Brag with your teammates, respectfully of course. In IT, the concept of burnout is very real. This step is a great way to help prevent this from happening to you -- a chance to gloat, a chance to feel great about getting to the bottom of things. Always take this “debrief” period for your own mental sanity -- these situations are nerve-racking. Take Pride 9
  12. 12. “This stage is simply all about communication and documentation.” This stage is all about communication and documentation. Document the issue including initial symptoms, affected areas, affected systems, and any other pertinent details. Document your corrective measures and your root-cause analysis. Meet with your team and discuss the findings so that everyone is on the same page. Make sure that there is plenty of supporting detail, enough that you are confident that should this issue recur, your colleagues would have a much easier time with diagnosis and resolution. Prevent Future Occurence 10
  13. 13. The 10 steps outlined above were adapted from The Universal Troubleshooting Process which is the methodology I have used throughout my career, and also throughout my adult life. I have held many roles in the IT field since graduating from college – and in each of those roles I have experienced “cart-before-horse” troubleshooting VERY often. The audience for this article is unrestricted. It does not matter if you are an expert, a novice, intermediate, or a plumber. These steps can be applied to any industry or field, which is one of the things I like most.
  14. 14. Matt Minor (MattMinor86) is a product from a small Northern-Ontario town in Canada where he grew up, went to school, got his first IT job and even met his wife. Matt has a beautiful four-year-old boy named Jackson who means everything to him. When Matt isn’t helping out on Experts-Exchange, he can typically be found roughhousing with his son. Matt graduated at the top of his Computer Systems Technology - Networking class at Canadore College in 2004. He has certifications in Microsoft, Cisco, A+ Hardware and ITIL v3 Foundations. He is also a Microsoft Certified Systems Administrator, a Cisco Certified Network Associate, an Apple Certified Macintosh Technician and an Apple Certified Support Professional. Matt has been a Technical Specialist at the area hospital for nearly 7 years. In his current role, he is exposed to many facets of the IT atmosphere - from the end- user desktop, right up to infrastructure-level operations and equipment. About the Author
  15. 15. Experts Exchange is the network for technology professionals. Stop Searching, Start Solving.

×