Reducing service interruption duration

Reducing Service
Interruption Duration
Erik Giles
Command Center Manager / Outage Event Coordinator

Are you in the right place?
• Have you ever…
– Spent hours sitting on a group conference call waiting for services to be restored?
– Arrived at work to find your own services down?
– Discover that your services have been going up and down and no one told you?
– Been woken up in the middle of the night to address a service interruption?
– Had your own service go down because of an upstream infrastructure outage?
– Found out your system or service was down from your own customers?
• OR BY THE DEAN?
– Are unhappy with how your institution manages outages and/or service interruptions?
2
Why are you here?

What I am going to do
• Decompose the elements of operational recovery
• Apply a maturity model to these elements
• Evaluate your current environment
• Have Fun playing a game where we role play
outages against one another
3
10 – 15 Minutes
The whole rest of
the time!

What you will get from this… hopefully.
• A deeper understanding of things that contribute to outage duration.
• At least one or two really easy, cheap things you can do to reduce outage
duration in less than 30 days.
• The six stages of an outage paradigm
– Also a convenient way to create a metric model to measure outage duration
• Have some fun!
4

Who am I
• Currently the Command Center Manager and Outage Coordinator for the
University of Chicago.
• Have both a MBA and MS in Systems Engineering (from USC)
• Led teams in Boeing’s research division on technology that went into military
and government NOCs.
• Ran the Command Center for the Orbitz Travel company that managed over
1,200,000 real time metrics on ~3300 hosts with a 24/7 staff of 25
• With 10 years experience before all of this as a system administrator and
architect, and implementer of ERP solutions.
• Find me here: http://www.linkedin.com/in/erikgiles/
5

Take out your Worksheet: Part I
• You are going to score your own organization (as well as you understand it)
as we go and fill in your answers on the worksheet.
• We have 10 elements to get through so we only have about a minute per
element to go through each element.
• I will read the descriptor and “Level 0” (because it is the most fun) but I’m not
going to read the whole thing to you.
– Please ask questions if it is unclear
• Fill in your points number as you go.
– Notice that Metrics and Process are reversed from the others
7

Tools
Do you have tools dedicated to detecting service interruptions or outages used
by your staff to inform you of issues BEFORE customers call to tell you?
• Level 0: There probably are but it isn’t centrally managed or even really understood so when services
goes down we just call that team and ask them to look into it. (8 points)
• Level 1: Our teams use some of the vendor tools but no one else has access to them. We don’t have
any contact management tools or common/documented gathering solutions like chat. (5 points)
• Level 2: We have vendor monitoring tools for most of our systems (but not services) and we use things
like Outlook to manage our contact along with a group chat room for outages. (4 points)
• Level 3: We have a central monitoring solution used by some of the teams but manage our contacts,
documentation, and collaboration using just standard tools. (3 points)
• Level 4: Our central monitoring solution sees most infrastructure and we have dedicated tools for our
contact and document management but our services themselves are not monitored. (2 points)
• Level 5: Our central monitoring solution sees all of our infrastructure and tracks service uptime along
with a dedicated set of tools for contact and document management. (1 point)
8

Breadth
Do you centrally monitor all, some or very little of the various infrastructure and
services for which your organization is accountable?
• Level 0: The team that does most of our outage management all comes from the same small group and
they have no real understanding of the other areas in the organization’s domain. (8 points)
• Level 1: One of the major groups (servers OR network) handles most of the monitoring and outage
management and they have a limited understanding of the other areas as well. (5 points)
• Level 2: Several of the key teams all manage the primary infrastructure but other areas such a phones
or data center equipment is still siloed and not well understood. (4 points)
• Level 3: Most of the key infrastructure is monitored centrally including dis-similar types of technology
but services themselves are only monitored based their underlying hardware and OS’s. (3 points)
• Level 4: The infrastructure is all visible to the outage management groups and some of the big common
services (email, wireless, financial systems) are monitored centrally.(2 points)
• Level 5: All infrastructure and services are visible to the central command center / outage management
team and this team can see and understand their current state. (1 points)
9

Coverage
Do you have a dedicated monitoring/event response team and what is their
coverage window?
• Level 0: There is no team responsible for addressing outages. When something goes down or wrong
anyone who can help just jumps on a call or meets in a room. (8 points)
• Level 1: There is a team who coordinates outages and watches monitoring tools but they have other
duties and are only available during their personal work schedule. (5 points)
• Level 2: There is a dedicated team of people for monitoring and outage management but they work a
business hours schedule. (4 points)
• Level 3: There is a dedicated team of people who do monitoring and outage management and they
work extended hours with some staff redundancy. (3 points)
• Level 4: There is a dedicated team of people who do monitoring and outage management 24/7/365
with limited or no redundancy. (2 points)
• Level 5: There is a dedicated team of people who do monitoring and outage management 24/7/365
with at least two (or more) staff for most of the schedule (1 point)
10

Skills
Does your command center team have technical skills such that certain staff in
some technical areas be a resource to the service teams (unix, network, etc.)
• Level 0: The staff that support outages just take messages and call people and do not have any skills
to investigate or fix or even do maintenance on any of the production systems or services. (8 points)
• Level 1: The command center or outage management team comes from one of the teams and they can
address issues in that area only but everything else the have to escalate. (5 points)
• Level 2: The team has skills in one area and can take limited action from documentation and access in
areas to do read only investigation to see if a service or system is up, down, or impacted. (4 points)
• Level 3: The team has skills in one area and can do fairly comprehensive activities to do read only
investigation to see if things are up, down, or impacted. (3 points)
• Level 4: The team has multiple skills represented and can do lots of read-only activity, including limited
direct action to fix or address certain kinds of outages independently. (2 points)
• Level 5: The team has most of the major skill areas covered and is capable of any kind of read-only
investigation along with the ability to do level one maintenance and support tasks. (1 points)
11

Training
Do you have processes in place to train outage management staff on newly
introduced systems or services and what to do during an outage?
• Level 0: We rely on word of mouth for folks to know about new services and systems, and consider it
their responsibility to reach out to those folks to learn how to support them during an outage. (8 points)
• Level 1: When a new service or system is released or upgrades someone calls or drops by and
explains it to a member of the outage staff, sometimes an email goes out. (5 points)
• Level 2: For a new service (or major change to a service) there is a requirement that it be documented
and that documentation shared with the support group before it goes live.(4 points)
• Level 3: For a new service there is a requirement that formal training be conducted with the support
team including documentation.(3 points)
• Level 4: The command center / outage management team has an internally tracked training plan to
ensure that all staff are trained on all systems they support, including testing. (2 points)
• Level 5: The command center team has a formal training plan that includes all systems and testing and
also has refresh cycle and on-boarding process for all new staff. (1 points)
12

Data
Do you keep an accurate list of who owns and manages what systems, what
vendors are used, and standard architectural documentation?
• Level 0: We rely entirely on tribal knowledge and people’s personal notes when it comes to who
manages what, what vendors we use, and how our systems and architecture is laid out. (8 points)
• Level 1: We use outlook to find people during an outage and use individual collaboration spaces to
figure out how their service or system works, if they publish or share it. (5 points)
• Level 2: We keep a list of who to call when something is broken and it includes most of their contact
information including home/cell numbers, along with space for some system documentation. (4 points)
• Level 3: We maintain a list of technical contact’s, service owner’s contact info along with an escalation
tree if they don’t answer, along with our central space for system documentation. (3 points)
• Level 4: We maintain a list of technical contact’s, service owner’s, and vendor’s contact data along with
an escalation path, and keep our system documentation in a document management system. (2 points)
• Level 5: We keep all our contact data in a CRM tool with all relevant types of contacts and contact
channels along with documentation in a document management system. (1 points)
13

Space
Do you locate the staff responsible for outages together and near the people
whose systems them watch as much as possible?
• Level 0: Our staff that support outages sit where ever they were, including remote and work from home
staff and may or may not know where the rest of the team sits. (8 points)
• Level 1: Staff all sit on site in cubs and or offices near one another but have no common screens or
central phone area from which to manage outages. (5 points)
• Level 2: Staff sit near one another in cubes but have a set of monitor screens they can all see that
show the health of the systems and services.(4 points)
• Level 3: Staff all sit near one another in cubes where they can see common screens and have access
to a conference room they can use for outages. (3 points)
• Level 4: Staff all sit in a common “war room” arranged to face the area of screens showing the health of
the systems with a conference room phone for conducting outages. (2 points)
• Level 5: Staff all sit and use the common “command center” with screens and phone and this space is
used by other staff to come and work collaboratively during outages.(1 points)
14

Access
Does your command center team have access to the management interface of
any of your systems such that they can address simple issues on their own?
• Level 0: Our command center staff do not have access to any of the tools used to fix or even see what
is going on with the systems or services during an outage or service interruption. (8 points)
• Level 1: The staff have limited read-only access to a few systems based not on priority but on who
wanted to share with that team. (5 points)
• Level 2: The staff have limited read-only access and some write access to a few systems but not all of
the key systems just ones with whom they happen to work well. (4 points)
• Level 3: The staff have read-only access to most of the key systems but only limited access to anything
else and not consistently. (3 points)
• Level 4: The staff have read-only access to all monitored systems and access to key systems to make
simple fixes via documentation, training, or specific direction. (2 points)
• Level 5: The staff have full access to all key and most other systems and are routinely asked to perform
simple fixes at the behest of the technical teams.(1 points)
15

Process
Do you have managed processes that you use to manage service interruption
and reduce outage duration?
• Level 0: When there are outages we all just get together and try and figure it out as quickly as we can
letting the situation determine how we respond. (0 points)
• Level 1: All processes are done via emails to the team on what to do, kept in a folder or printed for a
binder, and not reviewed or evaluated for completeness or effectiveness. (1 point)
• Level 2: Processes for outage and support activities are written up formally but none of the processes
for restoring service or prioritizing work are documented and no reviews are done. (2 points)
• Level 3: Templates are used for processes and they are done for support work and for the systems and
services but not required and some review is done but there is no formal process.(3 points)
• Level 4: Process lifecycle is formalized with templates, reviews, and required updates when documents
expire but still not mandatory for all services and systems. (4 points)
• Level 5: Process lifecycle is formalized and mandatory for all services and systems. (5 points)
16

Metrics
Do you have a standard set of metrics or KPI’s that you use to track outage
management (duration, response time, frequency, etc.)
• Level 0: We don’t have any metrics or KPI’s and go by instinct when something is getting bad or needs
fixing and let staff tell us when the work volume is getting too much. (0 points)
• Level 1: When we need metrics we have someone who will go in and review emails or requests and
then we count them up and make charts for management. (1 points)
• Level 2: We have a set of metrics that use but we don’t keep the data current and no one really trusts
the underlying data (or our system doesn’t do it well) so no one really uses or trusts them. (2 points)
• Level 3: We have current and up to date metrics but because the tools don’t work very well someone
has to go in by hand and count them up and create each chart and graph by hand. (3 points)
• Level 4: We have solid metrics with data visualization but the data isn’t shared and management does
not really look at them to make decisions because they aren’t counting the important things. (4 points)
• Level 5: We have solid metrics that are automatically generated and shared publicly and it drives
decision making and resource management on a daily basis. (5 points)
17

How are we going to
make this fun?
18

How we play the game
• We are going to go through a series of “outage scenarios” and see who can
“resolve” the issue the fastest?
• We will combine your current maturity levels with a roll of dice to get through
the six stages of an outage.
19

Phases of Outage
1) Detect – the period of time from an event happens and tools or systems log or detect the
event within the system.
2) See – the period of time from when the event is logged or detected in a system to the time
when a key staff member knows about it.
3) Engage – the period of time from when the first key staff member sees the event and the
time when the “right” staff who can fix it are fully engaged.
4) Determine – the period of time from when the event is fully resourced until there is an
agreed to cause of the issue (this one can iterate).
5) Plan – the period of time from when you know what the issue is and when staff knows
exactly how to address and mitigate it.
6) Mitigate – the period of time from when you know what to do (or think you do) and when the
issue itself is no longer impacting.
1) At the end of Mitigate, if the issue isn’t resolved, then you go back to Step 4 [Determine]
20

Worksheet Part II
• Detect = (Tools + Breadth) * “chance” – (metrics + process)
• See = (Coverage + Skills + Training) * “chance” – (metrics + process)
• Engage = (Coverage + Data + Space) * “chance” – (metrics + process)
• Determine = (Access + Breadth) * “chance” – (metrics + process)
• Plan = (Access + Skills) * “chance” – (metrics + process)
• Mitigate = (Tools + Training) * “chance” – (metrics + process)
21

Scenario One: one dice
After the first three days of the new term without any major incidents you are
starting to feel like everyone is really getting a handle on this outage
management thing. Then you get the call.
EMAIL IS DOWN!
Using your “Round One” section see how long before you can use email again.
23

Scenario Two: two dice
After a long hard year, you decide to take the week from Christmas to New
Years off and really unwind. It is all going great and you are just getting ready
for your first New Years out on the town in years and you get the email.
YOUR PRIMARY DATA CENTER IS DARK!
Using your “Round Two” section see how long before you can start celebrating
the New Year.
24

Scenario Three: three dice
It’s that season again… no not Academy Awards but Nobel Season; You were
awoken this morning to find out that one of your faculty has just won a prize and
everyone is on their way to campus for the big press release and speech. You
are about to walk over to the pavilion and you get the call.
ALL CAMPUS WIRELESS IS OFFLINE!
Using your “Round Three” section see if you can get your wireless up before
the press arrives on campus.
25

Scenario Four: four dice
Finally the term is over and summer is here. You’ve got some summer classes
and a few projects to get through but the campus is mostly quiet. It is a nice
relaxing Friday afternoon and you are thinking about heading out a little early.
Then you get the call… from your security officer.
YOU’RE UNDER A DENIAL OF SERVICE ATTACK!
Using your “Round Four” section see how long before you get to go home.
26

Summary
There is no one thing you can do that will reduce or eliminate outages.
Every outage management strategy can be different and there isn’t any one
answer that will work for everyone.
Often the simplest steps can have a huge impact of reducing outage duration.
Write stuff down (process)
Count stuff (metrics)
27

Final Question
How many people have already identified one
simply change they could easily and cheaply
make based on what they saw here?
28

Reducing service interruption duration

Recommended

Recommended

More Related Content

Similar to Reducing service interruption duration

Similar to Reducing service interruption duration (20)

Recently uploaded

Recently uploaded (20)

Reducing service interruption duration