We often find ourselves in reactive situations - with staff, managers, team-leads breathing down your neck while you are trying to figure out what happened and get it corrected STAT. It’s stressful. We are fighting the panic feeling, we are struggling with whether or not to fight or flight - we’ve had better times. Read our 10 step guide to troubleshooting like an Expert.
Written by Experts Exchange Expert, Matt Minor. View his profile at: http://www.experts-exchange.com/members/MattMinor86.html
See his original article here: http://www.experts-exchange.com/articles/25099/Troubleshoot-like-an-Expert-A-10-Step-Guide.html
like an Expert
Written by Matt Minor
Captain Jack Sparrow’s infinite wisdom never ceases to amaze. His
famous quote illustrated in the image above has really impacted the way
I interpret “problem situations”. In this field, we often find ourselves in
reactive situations -- with staff, managers, team-leads breathing down
your neck while you are trying to figure out what happened and get it
corrected STAT. It’s stressful. We are fighting the panic feeling, we are
struggling with whether or not to fight or flight -- we’ve had better times.
Here are ten steps to help better organize your troubleshooting procedure.
This COULD mean getting required tools ready or
pulling out product documentation. Though valuable,
these are not my primary focus in a troubleshooting
scenario. The preparation stage for me is mental
-- getting my attitude in-check and shifting to a CCC
mindset (cool, calm, collected). There IS a solution and
I CAN fix this. Jumping straight into a situation without
any preparation can be catastrophic. Rebooting a
server the second the internet goes down isn’t the best
approach. Get yourself prepared to tackle the situation
first, and you just might realize that your network cable
“There IS a solution and I CAN fix this.”
We’re often under huge amounts of pressure to just get things
It’s a mad-dash to the server room and anyone in our way is
simply getting trampled. Kick open the door and hear the fire
alarm going off... grab the water-bucket as outlined by the
disaster-recovery steering committee and toss it into the
server rack! Success! Just saved some lives.
What you failed to realize was that it was just a fire-drill, and
now the backup tapes from last week are water-logged. A
little extreme -- but the point is, critical data needs to be
considered before jumping into corrective action. We know
our networks, and we know what matters most. Consider the
worst-case scenario for your actions, and have a backup plan.
“Consider the worst-case scenario for your actions, and have a backup plan.”
Outline a Damage
It might sound trivial, in that of course we need to know what’s
happening -- otherwise we don’t know there’s a problem!
Though that may be true, also true is the fact that in our field,
many of the issues we deal with can have symptoms similar
to another problem, and we need to carefully distinguish
Get as much information as possible. This is where a
“script” or “flow-chart” comes in handy for those front-
line staff who are taking problem calls. The quality of the
information being passed to the people who are going to be
troubleshooting has a direct result on the quality of the way the
incident is handled.
This area is critical to the troubleshooting process,
read the full section here.
In summary, the types of questions that
need to be asked are as follows:
• When did it start happening?
• What else happened around that time?
• Any installations or configuration
changes done around that time?
• The who/what/where/when and why
Get the Symptom
“Try and recreate the issue, so it can be witnessed first-hand.”
This one is simple, but still vital. You can’t begin to implement
corrective measures if you don’t have a full understanding of
Using Step 3, try and recreate the issue so it can be witnessed
first-hand. This isn’t always possible possible, but the
alternative is to see it from the end-user perspective. Whether
that’s a walk to their office or by using remote assistance, you
should see for yourself what you’re working with.
Many humans, by nature, develop a more solidified
understanding of concepts and ideas by visual exposure,
as opposed to by reading or hearing. Having this first-hand
knowledge of the problem will greatly aid in carrying out Step 5.
Steps 1-4 have you collecting appropriate information,
developing your contingency/back-out/backup plans, and
generally just getting ready to tackle the task-at-hand -- solving
the problem. A lot of the time, IT administrators and technicians
jump straight to this step, setting themselves up for a world of
potential new issues, not to mention the amount of lost time.
Based on the sound, detailed information you now have, this is
where you make your best-informed decision regarding what
the issue is, and what actions to take towards resolution. Here is
where actions to resolve the problem are taken.
“Here is where actions to resolve the problem are taken.”
During this step, perform final validation in terms of what the
problem was by reviewing the pertinent Event Logs, specific
application logs, device-logs, and so on to try and iron out
exactly what this issue was caused by. In the ITIL world, this step
is critical to the Service Operation stage as it is brought to the
table for review and root-cause analysis.
The general idea here is to iron out a plan to:
A) Stop this from occurring again.
B) If it does happen again, because technology is unpredictable,
how can we handle this better?
This analysis is not done during this step, of course, because we
still have work to do!
“Iron out a plan.”
Narrow it Down
(Isolate the Root-Cause)
“Don’t get blindsided again.”
The ugly truth of working in IT is that sometimes we don’t have
any clue of an issue until it explodes in our face. This stage is
about making sure you don’t get blindsided again.
If faulty equipment or a bad configuration was the issue, correct
that or install a replacement device -- whatever you need to do
to decrease the chance of a repeat problem.
Replace or Repair
“Verify that the correct response to the problem was taken.”
Once the fire is out and the smoke has settled, it’s time to
reflect on the incident and verify that the correct response to
the problem was taken. Ask the following questions:
Did the symptom go away?
Did the right symptom go away?
Did I fix the right cause?
Did I create any other problems?
Having just dealt with a crisis, we’re starting to feel the relief and
users are back to work as usual. We don’t want any unexpected
surprises surfacing as a result of the incident, so asking
ourselves these questions and performing any corresponding
validation will help keep those users happy, and will ultimately
help prevent a relapse.
“Take some time to talk about the incident.”
Though not directly linked to the troubleshooting process itself,
this step is, indeed, vital. You’ve just been involved in a stressful
situation, with people coming at you from every-which-way
looking for updates and ETAs. Now that things are back up and
running, you need to take some time to talk about the incident.
Tell your co-workers/managers/team-leads the process
you went through to arrive at the solution. Brag with your
teammates, respectfully of course. In IT, the concept of burnout
is very real. This step is a great way to help prevent this from
happening to you -- a chance to gloat, a chance to feel great
about getting to the bottom of things.
Always take this “debrief” period for your own mental
sanity -- these situations are nerve-racking.
“This stage is simply all about communication and documentation.”
This stage is all about communication and documentation.
Document the issue including initial symptoms, affected areas,
affected systems, and any other pertinent details. Document
your corrective measures and your root-cause analysis.
Meet with your team and discuss the findings so that everyone
is on the same page. Make sure that there is plenty of
supporting detail, enough that you are confident that should
this issue recur, your colleagues would have a much easier time
with diagnosis and resolution.
The 10 steps outlined above were adapted from The Universal Troubleshooting
Process which is the methodology I have used throughout my career, and also
throughout my adult life. I have held many roles in the IT field since graduating
from college – and in each of those roles I have experienced “cart-before-horse”
troubleshooting VERY often. The audience for this article is unrestricted. It does
not matter if you are an expert, a novice, intermediate, or a plumber. These steps
can be applied to any industry or field, which is one of the things I like most.
Matt Minor (MattMinor86) is a product from a small Northern-Ontario town
in Canada where he grew up, went to school, got his first IT job and even met
his wife. Matt has a beautiful four-year-old boy named Jackson who means
everything to him. When Matt isn’t helping out on Experts-Exchange, he can
typically be found roughhousing with his son.
Matt graduated at the top of his Computer Systems Technology - Networking
class at Canadore College in 2004. He has certifications in Microsoft, Cisco,
A+ Hardware and ITIL v3 Foundations. He is also a Microsoft Certified Systems
Administrator, a Cisco Certified Network Associate, an Apple Certified
Macintosh Technician and an Apple Certified Support Professional.
Matt has been a Technical Specialist at the area hospital for nearly 7 years. In his
current role, he is exposed to many facets of the IT atmosphere - from the end-
user desktop, right up to infrastructure-level operations and equipment.
About the Author
Experts Exchange is the network for technology professionals.
Stop Searching, Start Solving.