SlideShare a Scribd company logo
1 of 29
Reducing Service
Interruption Duration
Erik Giles
Command Center Manager / Outage Event Coordinator
Are you in the right place?
• Have you ever…
– Spent hours sitting on a group conference call waiting for services to be restored?
– Arrived at work to find your own services down?
– Discover that your services have been going up and down and no one told you?
– Been woken up in the middle of the night to address a service interruption?
– Had your own service go down because of an upstream infrastructure outage?
– Found out your system or service was down from your own customers?
• OR BY THE DEAN?
– Are unhappy with how your institution manages outages and/or service interruptions?
2
Why are you here?
What I am going to do
• Decompose the elements of operational recovery
• Apply a maturity model to these elements
• Evaluate your current environment
• Have Fun playing a game where we role play
outages against one another
3
10 – 15 Minutes
The whole rest of
the time!
What you will get from this… hopefully.
• A deeper understanding of things that contribute to outage duration.
• At least one or two really easy, cheap things you can do to reduce outage
duration in less than 30 days.
• The six stages of an outage paradigm
– Also a convenient way to create a metric model to measure outage duration
• Have some fun!
4
Who am I
• Currently the Command Center Manager and Outage Coordinator for the
University of Chicago.
• Have both a MBA and MS in Systems Engineering (from USC)
• Led teams in Boeing’s research division on technology that went into military
and government NOCs.
• Ran the Command Center for the Orbitz Travel company that managed over
1,200,000 real time metrics on ~3300 hosts with a 24/7 staff of 25
• With 10 years experience before all of this as a system administrator and
architect, and implementer of ERP solutions.
• Find me here: http://www.linkedin.com/in/erikgiles/
5
6
Take out your Worksheet: Part I
• You are going to score your own organization (as well as you understand it)
as we go and fill in your answers on the worksheet.
• We have 10 elements to get through so we only have about a minute per
element to go through each element.
• I will read the descriptor and “Level 0” (because it is the most fun) but I’m not
going to read the whole thing to you.
– Please ask questions if it is unclear
• Fill in your points number as you go.
– Notice that Metrics and Process are reversed from the others
7
Tools
Do you have tools dedicated to detecting service interruptions or outages used
by your staff to inform you of issues BEFORE customers call to tell you?
• Level 0: There probably are but it isn’t centrally managed or even really understood so when services
goes down we just call that team and ask them to look into it. (8 points)
• Level 1: Our teams use some of the vendor tools but no one else has access to them. We don’t have
any contact management tools or common/documented gathering solutions like chat. (5 points)
• Level 2: We have vendor monitoring tools for most of our systems (but not services) and we use things
like Outlook to manage our contact along with a group chat room for outages. (4 points)
• Level 3: We have a central monitoring solution used by some of the teams but manage our contacts,
documentation, and collaboration using just standard tools. (3 points)
• Level 4: Our central monitoring solution sees most infrastructure and we have dedicated tools for our
contact and document management but our services themselves are not monitored. (2 points)
• Level 5: Our central monitoring solution sees all of our infrastructure and tracks service uptime along
with a dedicated set of tools for contact and document management. (1 point)
8
Breadth
Do you centrally monitor all, some or very little of the various infrastructure and
services for which your organization is accountable?
• Level 0: The team that does most of our outage management all comes from the same small group and
they have no real understanding of the other areas in the organization’s domain. (8 points)
• Level 1: One of the major groups (servers OR network) handles most of the monitoring and outage
management and they have a limited understanding of the other areas as well. (5 points)
• Level 2: Several of the key teams all manage the primary infrastructure but other areas such a phones
or data center equipment is still siloed and not well understood. (4 points)
• Level 3: Most of the key infrastructure is monitored centrally including dis-similar types of technology
but services themselves are only monitored based their underlying hardware and OS’s. (3 points)
• Level 4: The infrastructure is all visible to the outage management groups and some of the big common
services (email, wireless, financial systems) are monitored centrally.(2 points)
• Level 5: All infrastructure and services are visible to the central command center / outage management
team and this team can see and understand their current state. (1 points)
9
Coverage
Do you have a dedicated monitoring/event response team and what is their
coverage window?
• Level 0: There is no team responsible for addressing outages. When something goes down or wrong
anyone who can help just jumps on a call or meets in a room. (8 points)
• Level 1: There is a team who coordinates outages and watches monitoring tools but they have other
duties and are only available during their personal work schedule. (5 points)
• Level 2: There is a dedicated team of people for monitoring and outage management but they work a
business hours schedule. (4 points)
• Level 3: There is a dedicated team of people who do monitoring and outage management and they
work extended hours with some staff redundancy. (3 points)
• Level 4: There is a dedicated team of people who do monitoring and outage management 24/7/365
with limited or no redundancy. (2 points)
• Level 5: There is a dedicated team of people who do monitoring and outage management 24/7/365
with at least two (or more) staff for most of the schedule (1 point)
10
Skills
Does your command center team have technical skills such that certain staff in
some technical areas be a resource to the service teams (unix, network, etc.)
• Level 0: The staff that support outages just take messages and call people and do not have any skills
to investigate or fix or even do maintenance on any of the production systems or services. (8 points)
• Level 1: The command center or outage management team comes from one of the teams and they can
address issues in that area only but everything else the have to escalate. (5 points)
• Level 2: The team has skills in one area and can take limited action from documentation and access in
areas to do read only investigation to see if a service or system is up, down, or impacted. (4 points)
• Level 3: The team has skills in one area and can do fairly comprehensive activities to do read only
investigation to see if things are up, down, or impacted. (3 points)
• Level 4: The team has multiple skills represented and can do lots of read-only activity, including limited
direct action to fix or address certain kinds of outages independently. (2 points)
• Level 5: The team has most of the major skill areas covered and is capable of any kind of read-only
investigation along with the ability to do level one maintenance and support tasks. (1 points)
11
Training
Do you have processes in place to train outage management staff on newly
introduced systems or services and what to do during an outage?
• Level 0: We rely on word of mouth for folks to know about new services and systems, and consider it
their responsibility to reach out to those folks to learn how to support them during an outage. (8 points)
• Level 1: When a new service or system is released or upgrades someone calls or drops by and
explains it to a member of the outage staff, sometimes an email goes out. (5 points)
• Level 2: For a new service (or major change to a service) there is a requirement that it be documented
and that documentation shared with the support group before it goes live.(4 points)
• Level 3: For a new service there is a requirement that formal training be conducted with the support
team including documentation.(3 points)
• Level 4: The command center / outage management team has an internally tracked training plan to
ensure that all staff are trained on all systems they support, including testing. (2 points)
• Level 5: The command center team has a formal training plan that includes all systems and testing and
also has refresh cycle and on-boarding process for all new staff. (1 points)
12
Data
Do you keep an accurate list of who owns and manages what systems, what
vendors are used, and standard architectural documentation?
• Level 0: We rely entirely on tribal knowledge and people’s personal notes when it comes to who
manages what, what vendors we use, and how our systems and architecture is laid out. (8 points)
• Level 1: We use outlook to find people during an outage and use individual collaboration spaces to
figure out how their service or system works, if they publish or share it. (5 points)
• Level 2: We keep a list of who to call when something is broken and it includes most of their contact
information including home/cell numbers, along with space for some system documentation. (4 points)
• Level 3: We maintain a list of technical contact’s, service owner’s contact info along with an escalation
tree if they don’t answer, along with our central space for system documentation. (3 points)
• Level 4: We maintain a list of technical contact’s, service owner’s, and vendor’s contact data along with
an escalation path, and keep our system documentation in a document management system. (2 points)
• Level 5: We keep all our contact data in a CRM tool with all relevant types of contacts and contact
channels along with documentation in a document management system. (1 points)
13
Space
Do you locate the staff responsible for outages together and near the people
whose systems them watch as much as possible?
• Level 0: Our staff that support outages sit where ever they were, including remote and work from home
staff and may or may not know where the rest of the team sits. (8 points)
• Level 1: Staff all sit on site in cubs and or offices near one another but have no common screens or
central phone area from which to manage outages. (5 points)
• Level 2: Staff sit near one another in cubes but have a set of monitor screens they can all see that
show the health of the systems and services.(4 points)
• Level 3: Staff all sit near one another in cubes where they can see common screens and have access
to a conference room they can use for outages. (3 points)
• Level 4: Staff all sit in a common “war room” arranged to face the area of screens showing the health of
the systems with a conference room phone for conducting outages. (2 points)
• Level 5: Staff all sit and use the common “command center” with screens and phone and this space is
used by other staff to come and work collaboratively during outages.(1 points)
14
Access
Does your command center team have access to the management interface of
any of your systems such that they can address simple issues on their own?
• Level 0: Our command center staff do not have access to any of the tools used to fix or even see what
is going on with the systems or services during an outage or service interruption. (8 points)
• Level 1: The staff have limited read-only access to a few systems based not on priority but on who
wanted to share with that team. (5 points)
• Level 2: The staff have limited read-only access and some write access to a few systems but not all of
the key systems just ones with whom they happen to work well. (4 points)
• Level 3: The staff have read-only access to most of the key systems but only limited access to anything
else and not consistently. (3 points)
• Level 4: The staff have read-only access to all monitored systems and access to key systems to make
simple fixes via documentation, training, or specific direction. (2 points)
• Level 5: The staff have full access to all key and most other systems and are routinely asked to perform
simple fixes at the behest of the technical teams.(1 points)
15
Process
Do you have managed processes that you use to manage service interruption
and reduce outage duration?
• Level 0: When there are outages we all just get together and try and figure it out as quickly as we can
letting the situation determine how we respond. (0 points)
• Level 1: All processes are done via emails to the team on what to do, kept in a folder or printed for a
binder, and not reviewed or evaluated for completeness or effectiveness. (1 point)
• Level 2: Processes for outage and support activities are written up formally but none of the processes
for restoring service or prioritizing work are documented and no reviews are done. (2 points)
• Level 3: Templates are used for processes and they are done for support work and for the systems and
services but not required and some review is done but there is no formal process.(3 points)
• Level 4: Process lifecycle is formalized with templates, reviews, and required updates when documents
expire but still not mandatory for all services and systems. (4 points)
• Level 5: Process lifecycle is formalized and mandatory for all services and systems. (5 points)
16
Metrics
Do you have a standard set of metrics or KPI’s that you use to track outage
management (duration, response time, frequency, etc.)
• Level 0: We don’t have any metrics or KPI’s and go by instinct when something is getting bad or needs
fixing and let staff tell us when the work volume is getting too much. (0 points)
• Level 1: When we need metrics we have someone who will go in and review emails or requests and
then we count them up and make charts for management. (1 points)
• Level 2: We have a set of metrics that use but we don’t keep the data current and no one really trusts
the underlying data (or our system doesn’t do it well) so no one really uses or trusts them. (2 points)
• Level 3: We have current and up to date metrics but because the tools don’t work very well someone
has to go in by hand and count them up and create each chart and graph by hand. (3 points)
• Level 4: We have solid metrics with data visualization but the data isn’t shared and management does
not really look at them to make decisions because they aren’t counting the important things. (4 points)
• Level 5: We have solid metrics that are automatically generated and shared publicly and it drives
decision making and resource management on a daily basis. (5 points)
17
How are we going to
make this fun?
18
How we play the game
• We are going to go through a series of “outage scenarios” and see who can
“resolve” the issue the fastest?
• We will combine your current maturity levels with a roll of dice to get through
the six stages of an outage.
19
Phases of Outage
1) Detect – the period of time from an event happens and tools or systems log or detect the
event within the system.
2) See – the period of time from when the event is logged or detected in a system to the time
when a key staff member knows about it.
3) Engage – the period of time from when the first key staff member sees the event and the
time when the “right” staff who can fix it are fully engaged.
4) Determine – the period of time from when the event is fully resourced until there is an
agreed to cause of the issue (this one can iterate).
5) Plan – the period of time from when you know what the issue is and when staff knows
exactly how to address and mitigate it.
6) Mitigate – the period of time from when you know what to do (or think you do) and when the
issue itself is no longer impacting.
1) At the end of Mitigate, if the issue isn’t resolved, then you go back to Step 4 [Determine]
20
Worksheet Part II
• Detect = (Tools + Breadth) * “chance” – (metrics + process)
• See = (Coverage + Skills + Training) * “chance” – (metrics + process)
• Engage = (Coverage + Data + Space) * “chance” – (metrics + process)
• Determine = (Access + Breadth) * “chance” – (metrics + process)
• Plan = (Access + Skills) * “chance” – (metrics + process)
• Mitigate = (Tools + Training) * “chance” – (metrics + process)
21
Let’s Play!!
22
Scenario One: one dice
After the first three days of the new term without any major incidents you are
starting to feel like everyone is really getting a handle on this outage
management thing. Then you get the call.
EMAIL IS DOWN!
Using your “Round One” section see how long before you can use email again.
23
Scenario Two: two dice
After a long hard year, you decide to take the week from Christmas to New
Years off and really unwind. It is all going great and you are just getting ready
for your first New Years out on the town in years and you get the email.
YOUR PRIMARY DATA CENTER IS DARK!
Using your “Round Two” section see how long before you can start celebrating
the New Year.
24
Scenario Three: three dice
It’s that season again… no not Academy Awards but Nobel Season; You were
awoken this morning to find out that one of your faculty has just won a prize and
everyone is on their way to campus for the big press release and speech. You
are about to walk over to the pavilion and you get the call.
ALL CAMPUS WIRELESS IS OFFLINE!
Using your “Round Three” section see if you can get your wireless up before
the press arrives on campus.
25
Scenario Four: four dice
Finally the term is over and summer is here. You’ve got some summer classes
and a few projects to get through but the campus is mostly quiet. It is a nice
relaxing Friday afternoon and you are thinking about heading out a little early.
Then you get the call… from your security officer.
YOU’RE UNDER A DENIAL OF SERVICE ATTACK!
Using your “Round Four” section see how long before you get to go home.
26
Summary
There is no one thing you can do that will reduce or eliminate outages.
Every outage management strategy can be different and there isn’t any one
answer that will work for everyone.
Often the simplest steps can have a huge impact of reducing outage duration.
Write stuff down (process)
Count stuff (metrics)
27
Final Question
How many people have already identified one
simply change they could easily and cheaply
make based on what they saw here?
28
Thank You
29

More Related Content

Similar to Reducing service interruption duration

IS L02 - Development of Information Systems
IS L02 - Development of Information SystemsIS L02 - Development of Information Systems
IS L02 - Development of Information SystemsJan Wong
 
Lesson 9 system develpment life cycle
Lesson 9 system develpment life cycleLesson 9 system develpment life cycle
Lesson 9 system develpment life cycleOneil Powers
 
How bad is your toil? Measuring the Human Impact of Process
How bad is your toil? Measuring the Human Impact of ProcessHow bad is your toil? Measuring the Human Impact of Process
How bad is your toil? Measuring the Human Impact of ProcessKurt Andersen
 
Systems development
Systems developmentSystems development
Systems developmentElijah Liu
 
introduction to system administration
introduction to system administrationintroduction to system administration
introduction to system administrationgamme123
 
Help_Desk_System Presentation Overview
Help_Desk_System Presentation Overview Help_Desk_System Presentation Overview
Help_Desk_System Presentation Overview Leon Blum
 
System development analysis life cycle
System development analysis life cycleSystem development analysis life cycle
System development analysis life cycleCommunication telecom
 
#speakgeek - Support Processes for iconnect360
#speakgeek - Support Processes for iconnect360#speakgeek - Support Processes for iconnect360
#speakgeek - Support Processes for iconnect360Derek Chan
 
management system development and planning
management system development and planningmanagement system development and planning
management system development and planningmilkesa13
 
Building & sustaining a monitoring team in a multi-application landscape
Building & sustaining a monitoring team in a multi-application landscapeBuilding & sustaining a monitoring team in a multi-application landscape
Building & sustaining a monitoring team in a multi-application landscapeMeryemElMorabit
 
System development life cycle
System development life cycleSystem development life cycle
System development life cyclenayriehl
 
Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!
Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!
Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!Cprime
 
Bua 235 teamwork
Bua 235 teamwork Bua 235 teamwork
Bua 235 teamwork UMaine
 
· Choose an information system for an individual project.  During .docx
· Choose an information system for an individual project.  During .docx· Choose an information system for an individual project.  During .docx
· Choose an information system for an individual project.  During .docxLynellBull52
 

Similar to Reducing service interruption duration (20)

IS L02 - Development of Information Systems
IS L02 - Development of Information SystemsIS L02 - Development of Information Systems
IS L02 - Development of Information Systems
 
Lesson 9 system develpment life cycle
Lesson 9 system develpment life cycleLesson 9 system develpment life cycle
Lesson 9 system develpment life cycle
 
SDLC
SDLCSDLC
SDLC
 
How bad is your toil? Measuring the Human Impact of Process
How bad is your toil? Measuring the Human Impact of ProcessHow bad is your toil? Measuring the Human Impact of Process
How bad is your toil? Measuring the Human Impact of Process
 
Systems development
Systems developmentSystems development
Systems development
 
Dit yvol4iss06
Dit yvol4iss06Dit yvol4iss06
Dit yvol4iss06
 
Working Effectively with PeopleSoft Support
Working Effectively with PeopleSoft SupportWorking Effectively with PeopleSoft Support
Working Effectively with PeopleSoft Support
 
introduction to system administration
introduction to system administrationintroduction to system administration
introduction to system administration
 
Help_Desk_System Presentation Overview
Help_Desk_System Presentation Overview Help_Desk_System Presentation Overview
Help_Desk_System Presentation Overview
 
System development analysis life cycle
System development analysis life cycleSystem development analysis life cycle
System development analysis life cycle
 
Case: build an IT pool
Case: build an IT poolCase: build an IT pool
Case: build an IT pool
 
#speakgeek - Support Processes for iconnect360
#speakgeek - Support Processes for iconnect360#speakgeek - Support Processes for iconnect360
#speakgeek - Support Processes for iconnect360
 
management system development and planning
management system development and planningmanagement system development and planning
management system development and planning
 
Building & sustaining a monitoring team in a multi-application landscape
Building & sustaining a monitoring team in a multi-application landscapeBuilding & sustaining a monitoring team in a multi-application landscape
Building & sustaining a monitoring team in a multi-application landscape
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
System development life cycle
System development life cycleSystem development life cycle
System development life cycle
 
Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!
Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!
Atlassian Based DevOps Command Center: Adding Opsgenie to the Powerful Mix!
 
Development Guideline
Development GuidelineDevelopment Guideline
Development Guideline
 
Bua 235 teamwork
Bua 235 teamwork Bua 235 teamwork
Bua 235 teamwork
 
· Choose an information system for an individual project.  During .docx
· Choose an information system for an individual project.  During .docx· Choose an information system for an individual project.  During .docx
· Choose an information system for an individual project.  During .docx
 

Recently uploaded

Motihari ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In motihari ❤ Low ...
Motihari ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In motihari ❤ Low ...Motihari ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In motihari ❤ Low ...
Motihari ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In motihari ❤ Low ...Monika Rani
 
9999266834 Call Girls In Noida Sector 51 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 51 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 51 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 51 (Delhi) Call Girl Servicenishacall1
 
Davangere ❤CALL GIRL 9973520673 ❤CALL GIRLS IN Davangere ESCORT SERVICE❤CALL ...
Davangere ❤CALL GIRL 9973520673 ❤CALL GIRLS IN Davangere ESCORT SERVICE❤CALL ...Davangere ❤CALL GIRL 9973520673 ❤CALL GIRLS IN Davangere ESCORT SERVICE❤CALL ...
Davangere ❤CALL GIRL 9973520673 ❤CALL GIRLS IN Davangere ESCORT SERVICE❤CALL ...deepak38245
 
Budaun Call Girl WhatsApp Chat: 📞 8617370543 | Girls Number for Friendship
Budaun Call Girl WhatsApp Chat: 📞 8617370543 | Girls Number for FriendshipBudaun Call Girl WhatsApp Chat: 📞 8617370543 | Girls Number for Friendship
Budaun Call Girl WhatsApp Chat: 📞 8617370543 | Girls Number for FriendshipNitya salvi
 
Russian 8728932633 Call Girls Ludhiana No Advance cash Only
Russian 8728932633 Call Girls Ludhiana No Advance cash OnlyRussian 8728932633 Call Girls Ludhiana No Advance cash Only
Russian 8728932633 Call Girls Ludhiana No Advance cash Onlyrobindsuza9167
 
Call Now ☎8264348440|| Call Girls in Mehrauli Escort Service Delhi N.C.R..pdf
Call Now ☎8264348440|| Call Girls in Mehrauli Escort Service Delhi N.C.R..pdfCall Now ☎8264348440|| Call Girls in Mehrauli Escort Service Delhi N.C.R..pdf
Call Now ☎8264348440|| Call Girls in Mehrauli Escort Service Delhi N.C.R..pdfsoniya singh
 
MORADABAD CALL GIRL 9661985112 IN CALL GIRLS ESCORT SERVICE
MORADABAD CALL GIRL 9661985112 IN CALL GIRLS ESCORT SERVICEMORADABAD CALL GIRL 9661985112 IN CALL GIRLS ESCORT SERVICE
MORADABAD CALL GIRL 9661985112 IN CALL GIRLS ESCORT SERVICESapna Call girl
 
Prince Armahs(Tinky) Brochure, for Funeral service
Prince Armahs(Tinky) Brochure, for Funeral servicePrince Armahs(Tinky) Brochure, for Funeral service
Prince Armahs(Tinky) Brochure, for Funeral serviceednyonat
 
Call Girls In Karachi-->>03274048030<<--Meet Call Girls In Karachi for Unforg...
Call Girls In Karachi-->>03274048030<<--Meet Call Girls In Karachi for Unforg...Call Girls In Karachi-->>03274048030<<--Meet Call Girls In Karachi for Unforg...
Call Girls In Karachi-->>03274048030<<--Meet Call Girls In Karachi for Unforg...Sana Rajpoot
 
Agra 💋Call Girl 9748763073 Call Girls in Agra Escort service book now
Agra 💋Call Girl 9748763073 Call Girls in Agra Escort service book nowAgra 💋Call Girl 9748763073 Call Girls in Agra Escort service book now
Agra 💋Call Girl 9748763073 Call Girls in Agra Escort service book nowapshanarani255
 
👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...
👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...
👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...Sheetaleventcompany
 
Vadodara 💋 Call Girl 9748763073 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 9748763073 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 9748763073 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 9748763073 Call Girls in Vadodara Escort service book nowapshanarani255
 
ULHASNAGAR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
ULHASNAGAR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEULHASNAGAR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
ULHASNAGAR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Udaipur Call Girls ☎ 9602870969✅ Better Genuine Call Girl in Udaipur Escort S...
Udaipur Call Girls ☎ 9602870969✅ Better Genuine Call Girl in Udaipur Escort S...Udaipur Call Girls ☎ 9602870969✅ Better Genuine Call Girl in Udaipur Escort S...
Udaipur Call Girls ☎ 9602870969✅ Better Genuine Call Girl in Udaipur Escort S...Apsara Of India
 
Udaipur Call Girls ☎ 9602870969✅ Best Genuine Call Girl in Udaipur Escort Ser...
Udaipur Call Girls ☎ 9602870969✅ Best Genuine Call Girl in Udaipur Escort Ser...Udaipur Call Girls ☎ 9602870969✅ Best Genuine Call Girl in Udaipur Escort Ser...
Udaipur Call Girls ☎ 9602870969✅ Best Genuine Call Girl in Udaipur Escort Ser...Apsara Of India
 
Jamnagar 💋 Call Girl 9748763073 Call Girls in Jamnagar Escort service book now
Jamnagar 💋 Call Girl 9748763073 Call Girls in Jamnagar Escort service book nowJamnagar 💋 Call Girl 9748763073 Call Girls in Jamnagar Escort service book now
Jamnagar 💋 Call Girl 9748763073 Call Girls in Jamnagar Escort service book nowapshanarani255
 
Call Girls Nagpur 💋Just Call WhatsApp 7870993772 Top Class Call Girl Service ...
Call Girls Nagpur 💋Just Call WhatsApp 7870993772 Top Class Call Girl Service ...Call Girls Nagpur 💋Just Call WhatsApp 7870993772 Top Class Call Girl Service ...
Call Girls Nagpur 💋Just Call WhatsApp 7870993772 Top Class Call Girl Service ...Monika Rani
 
AGARTALA CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
AGARTALA CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEAGARTALA CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
AGARTALA CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
BHOPAL CALL GIRL 9262871154 HIGH PROFILE BHOPAL ESCORT SERVICE
BHOPAL CALL GIRL 9262871154 HIGH PROFILE BHOPAL ESCORT SERVICEBHOPAL CALL GIRL 9262871154 HIGH PROFILE BHOPAL ESCORT SERVICE
BHOPAL CALL GIRL 9262871154 HIGH PROFILE BHOPAL ESCORT SERVICENiteshKumar82226
 

Recently uploaded (20)

Motihari ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In motihari ❤ Low ...
Motihari ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In motihari ❤ Low ...Motihari ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In motihari ❤ Low ...
Motihari ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In motihari ❤ Low ...
 
9999266834 Call Girls In Noida Sector 51 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 51 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 51 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 51 (Delhi) Call Girl Service
 
Davangere ❤CALL GIRL 9973520673 ❤CALL GIRLS IN Davangere ESCORT SERVICE❤CALL ...
Davangere ❤CALL GIRL 9973520673 ❤CALL GIRLS IN Davangere ESCORT SERVICE❤CALL ...Davangere ❤CALL GIRL 9973520673 ❤CALL GIRLS IN Davangere ESCORT SERVICE❤CALL ...
Davangere ❤CALL GIRL 9973520673 ❤CALL GIRLS IN Davangere ESCORT SERVICE❤CALL ...
 
Budaun Call Girl WhatsApp Chat: 📞 8617370543 | Girls Number for Friendship
Budaun Call Girl WhatsApp Chat: 📞 8617370543 | Girls Number for FriendshipBudaun Call Girl WhatsApp Chat: 📞 8617370543 | Girls Number for Friendship
Budaun Call Girl WhatsApp Chat: 📞 8617370543 | Girls Number for Friendship
 
Russian 8728932633 Call Girls Ludhiana No Advance cash Only
Russian 8728932633 Call Girls Ludhiana No Advance cash OnlyRussian 8728932633 Call Girls Ludhiana No Advance cash Only
Russian 8728932633 Call Girls Ludhiana No Advance cash Only
 
Call Now ☎8264348440|| Call Girls in Mehrauli Escort Service Delhi N.C.R..pdf
Call Now ☎8264348440|| Call Girls in Mehrauli Escort Service Delhi N.C.R..pdfCall Now ☎8264348440|| Call Girls in Mehrauli Escort Service Delhi N.C.R..pdf
Call Now ☎8264348440|| Call Girls in Mehrauli Escort Service Delhi N.C.R..pdf
 
MORADABAD CALL GIRL 9661985112 IN CALL GIRLS ESCORT SERVICE
MORADABAD CALL GIRL 9661985112 IN CALL GIRLS ESCORT SERVICEMORADABAD CALL GIRL 9661985112 IN CALL GIRLS ESCORT SERVICE
MORADABAD CALL GIRL 9661985112 IN CALL GIRLS ESCORT SERVICE
 
Prince Armahs(Tinky) Brochure, for Funeral service
Prince Armahs(Tinky) Brochure, for Funeral servicePrince Armahs(Tinky) Brochure, for Funeral service
Prince Armahs(Tinky) Brochure, for Funeral service
 
NO ADVANCE PAYMENT ONLY CASH PAYMENT DIRECT MEETING GENUINE
NO ADVANCE PAYMENT ONLY CASH PAYMENT DIRECT MEETING GENUINENO ADVANCE PAYMENT ONLY CASH PAYMENT DIRECT MEETING GENUINE
NO ADVANCE PAYMENT ONLY CASH PAYMENT DIRECT MEETING GENUINE
 
Call Girls In Karachi-->>03274048030<<--Meet Call Girls In Karachi for Unforg...
Call Girls In Karachi-->>03274048030<<--Meet Call Girls In Karachi for Unforg...Call Girls In Karachi-->>03274048030<<--Meet Call Girls In Karachi for Unforg...
Call Girls In Karachi-->>03274048030<<--Meet Call Girls In Karachi for Unforg...
 
Agra 💋Call Girl 9748763073 Call Girls in Agra Escort service book now
Agra 💋Call Girl 9748763073 Call Girls in Agra Escort service book nowAgra 💋Call Girl 9748763073 Call Girls in Agra Escort service book now
Agra 💋Call Girl 9748763073 Call Girls in Agra Escort service book now
 
👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...
👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...
👉 Amritsar Call Girls 👉📞 8725944379 👉📞 Just📲 Call Ruhi Call Girl Near Me Amri...
 
Vadodara 💋 Call Girl 9748763073 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 9748763073 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 9748763073 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 9748763073 Call Girls in Vadodara Escort service book now
 
ULHASNAGAR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
ULHASNAGAR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEULHASNAGAR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
ULHASNAGAR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Udaipur Call Girls ☎ 9602870969✅ Better Genuine Call Girl in Udaipur Escort S...
Udaipur Call Girls ☎ 9602870969✅ Better Genuine Call Girl in Udaipur Escort S...Udaipur Call Girls ☎ 9602870969✅ Better Genuine Call Girl in Udaipur Escort S...
Udaipur Call Girls ☎ 9602870969✅ Better Genuine Call Girl in Udaipur Escort S...
 
Udaipur Call Girls ☎ 9602870969✅ Best Genuine Call Girl in Udaipur Escort Ser...
Udaipur Call Girls ☎ 9602870969✅ Best Genuine Call Girl in Udaipur Escort Ser...Udaipur Call Girls ☎ 9602870969✅ Best Genuine Call Girl in Udaipur Escort Ser...
Udaipur Call Girls ☎ 9602870969✅ Best Genuine Call Girl in Udaipur Escort Ser...
 
Jamnagar 💋 Call Girl 9748763073 Call Girls in Jamnagar Escort service book now
Jamnagar 💋 Call Girl 9748763073 Call Girls in Jamnagar Escort service book nowJamnagar 💋 Call Girl 9748763073 Call Girls in Jamnagar Escort service book now
Jamnagar 💋 Call Girl 9748763073 Call Girls in Jamnagar Escort service book now
 
Call Girls Nagpur 💋Just Call WhatsApp 7870993772 Top Class Call Girl Service ...
Call Girls Nagpur 💋Just Call WhatsApp 7870993772 Top Class Call Girl Service ...Call Girls Nagpur 💋Just Call WhatsApp 7870993772 Top Class Call Girl Service ...
Call Girls Nagpur 💋Just Call WhatsApp 7870993772 Top Class Call Girl Service ...
 
AGARTALA CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
AGARTALA CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEAGARTALA CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
AGARTALA CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
BHOPAL CALL GIRL 9262871154 HIGH PROFILE BHOPAL ESCORT SERVICE
BHOPAL CALL GIRL 9262871154 HIGH PROFILE BHOPAL ESCORT SERVICEBHOPAL CALL GIRL 9262871154 HIGH PROFILE BHOPAL ESCORT SERVICE
BHOPAL CALL GIRL 9262871154 HIGH PROFILE BHOPAL ESCORT SERVICE
 

Reducing service interruption duration

  • 1. Reducing Service Interruption Duration Erik Giles Command Center Manager / Outage Event Coordinator
  • 2. Are you in the right place? • Have you ever… – Spent hours sitting on a group conference call waiting for services to be restored? – Arrived at work to find your own services down? – Discover that your services have been going up and down and no one told you? – Been woken up in the middle of the night to address a service interruption? – Had your own service go down because of an upstream infrastructure outage? – Found out your system or service was down from your own customers? • OR BY THE DEAN? – Are unhappy with how your institution manages outages and/or service interruptions? 2 Why are you here?
  • 3. What I am going to do • Decompose the elements of operational recovery • Apply a maturity model to these elements • Evaluate your current environment • Have Fun playing a game where we role play outages against one another 3 10 – 15 Minutes The whole rest of the time!
  • 4. What you will get from this… hopefully. • A deeper understanding of things that contribute to outage duration. • At least one or two really easy, cheap things you can do to reduce outage duration in less than 30 days. • The six stages of an outage paradigm – Also a convenient way to create a metric model to measure outage duration • Have some fun! 4
  • 5. Who am I • Currently the Command Center Manager and Outage Coordinator for the University of Chicago. • Have both a MBA and MS in Systems Engineering (from USC) • Led teams in Boeing’s research division on technology that went into military and government NOCs. • Ran the Command Center for the Orbitz Travel company that managed over 1,200,000 real time metrics on ~3300 hosts with a 24/7 staff of 25 • With 10 years experience before all of this as a system administrator and architect, and implementer of ERP solutions. • Find me here: http://www.linkedin.com/in/erikgiles/ 5
  • 6. 6
  • 7. Take out your Worksheet: Part I • You are going to score your own organization (as well as you understand it) as we go and fill in your answers on the worksheet. • We have 10 elements to get through so we only have about a minute per element to go through each element. • I will read the descriptor and “Level 0” (because it is the most fun) but I’m not going to read the whole thing to you. – Please ask questions if it is unclear • Fill in your points number as you go. – Notice that Metrics and Process are reversed from the others 7
  • 8. Tools Do you have tools dedicated to detecting service interruptions or outages used by your staff to inform you of issues BEFORE customers call to tell you? • Level 0: There probably are but it isn’t centrally managed or even really understood so when services goes down we just call that team and ask them to look into it. (8 points) • Level 1: Our teams use some of the vendor tools but no one else has access to them. We don’t have any contact management tools or common/documented gathering solutions like chat. (5 points) • Level 2: We have vendor monitoring tools for most of our systems (but not services) and we use things like Outlook to manage our contact along with a group chat room for outages. (4 points) • Level 3: We have a central monitoring solution used by some of the teams but manage our contacts, documentation, and collaboration using just standard tools. (3 points) • Level 4: Our central monitoring solution sees most infrastructure and we have dedicated tools for our contact and document management but our services themselves are not monitored. (2 points) • Level 5: Our central monitoring solution sees all of our infrastructure and tracks service uptime along with a dedicated set of tools for contact and document management. (1 point) 8
  • 9. Breadth Do you centrally monitor all, some or very little of the various infrastructure and services for which your organization is accountable? • Level 0: The team that does most of our outage management all comes from the same small group and they have no real understanding of the other areas in the organization’s domain. (8 points) • Level 1: One of the major groups (servers OR network) handles most of the monitoring and outage management and they have a limited understanding of the other areas as well. (5 points) • Level 2: Several of the key teams all manage the primary infrastructure but other areas such a phones or data center equipment is still siloed and not well understood. (4 points) • Level 3: Most of the key infrastructure is monitored centrally including dis-similar types of technology but services themselves are only monitored based their underlying hardware and OS’s. (3 points) • Level 4: The infrastructure is all visible to the outage management groups and some of the big common services (email, wireless, financial systems) are monitored centrally.(2 points) • Level 5: All infrastructure and services are visible to the central command center / outage management team and this team can see and understand their current state. (1 points) 9
  • 10. Coverage Do you have a dedicated monitoring/event response team and what is their coverage window? • Level 0: There is no team responsible for addressing outages. When something goes down or wrong anyone who can help just jumps on a call or meets in a room. (8 points) • Level 1: There is a team who coordinates outages and watches monitoring tools but they have other duties and are only available during their personal work schedule. (5 points) • Level 2: There is a dedicated team of people for monitoring and outage management but they work a business hours schedule. (4 points) • Level 3: There is a dedicated team of people who do monitoring and outage management and they work extended hours with some staff redundancy. (3 points) • Level 4: There is a dedicated team of people who do monitoring and outage management 24/7/365 with limited or no redundancy. (2 points) • Level 5: There is a dedicated team of people who do monitoring and outage management 24/7/365 with at least two (or more) staff for most of the schedule (1 point) 10
  • 11. Skills Does your command center team have technical skills such that certain staff in some technical areas be a resource to the service teams (unix, network, etc.) • Level 0: The staff that support outages just take messages and call people and do not have any skills to investigate or fix or even do maintenance on any of the production systems or services. (8 points) • Level 1: The command center or outage management team comes from one of the teams and they can address issues in that area only but everything else the have to escalate. (5 points) • Level 2: The team has skills in one area and can take limited action from documentation and access in areas to do read only investigation to see if a service or system is up, down, or impacted. (4 points) • Level 3: The team has skills in one area and can do fairly comprehensive activities to do read only investigation to see if things are up, down, or impacted. (3 points) • Level 4: The team has multiple skills represented and can do lots of read-only activity, including limited direct action to fix or address certain kinds of outages independently. (2 points) • Level 5: The team has most of the major skill areas covered and is capable of any kind of read-only investigation along with the ability to do level one maintenance and support tasks. (1 points) 11
  • 12. Training Do you have processes in place to train outage management staff on newly introduced systems or services and what to do during an outage? • Level 0: We rely on word of mouth for folks to know about new services and systems, and consider it their responsibility to reach out to those folks to learn how to support them during an outage. (8 points) • Level 1: When a new service or system is released or upgrades someone calls or drops by and explains it to a member of the outage staff, sometimes an email goes out. (5 points) • Level 2: For a new service (or major change to a service) there is a requirement that it be documented and that documentation shared with the support group before it goes live.(4 points) • Level 3: For a new service there is a requirement that formal training be conducted with the support team including documentation.(3 points) • Level 4: The command center / outage management team has an internally tracked training plan to ensure that all staff are trained on all systems they support, including testing. (2 points) • Level 5: The command center team has a formal training plan that includes all systems and testing and also has refresh cycle and on-boarding process for all new staff. (1 points) 12
  • 13. Data Do you keep an accurate list of who owns and manages what systems, what vendors are used, and standard architectural documentation? • Level 0: We rely entirely on tribal knowledge and people’s personal notes when it comes to who manages what, what vendors we use, and how our systems and architecture is laid out. (8 points) • Level 1: We use outlook to find people during an outage and use individual collaboration spaces to figure out how their service or system works, if they publish or share it. (5 points) • Level 2: We keep a list of who to call when something is broken and it includes most of their contact information including home/cell numbers, along with space for some system documentation. (4 points) • Level 3: We maintain a list of technical contact’s, service owner’s contact info along with an escalation tree if they don’t answer, along with our central space for system documentation. (3 points) • Level 4: We maintain a list of technical contact’s, service owner’s, and vendor’s contact data along with an escalation path, and keep our system documentation in a document management system. (2 points) • Level 5: We keep all our contact data in a CRM tool with all relevant types of contacts and contact channels along with documentation in a document management system. (1 points) 13
  • 14. Space Do you locate the staff responsible for outages together and near the people whose systems them watch as much as possible? • Level 0: Our staff that support outages sit where ever they were, including remote and work from home staff and may or may not know where the rest of the team sits. (8 points) • Level 1: Staff all sit on site in cubs and or offices near one another but have no common screens or central phone area from which to manage outages. (5 points) • Level 2: Staff sit near one another in cubes but have a set of monitor screens they can all see that show the health of the systems and services.(4 points) • Level 3: Staff all sit near one another in cubes where they can see common screens and have access to a conference room they can use for outages. (3 points) • Level 4: Staff all sit in a common “war room” arranged to face the area of screens showing the health of the systems with a conference room phone for conducting outages. (2 points) • Level 5: Staff all sit and use the common “command center” with screens and phone and this space is used by other staff to come and work collaboratively during outages.(1 points) 14
  • 15. Access Does your command center team have access to the management interface of any of your systems such that they can address simple issues on their own? • Level 0: Our command center staff do not have access to any of the tools used to fix or even see what is going on with the systems or services during an outage or service interruption. (8 points) • Level 1: The staff have limited read-only access to a few systems based not on priority but on who wanted to share with that team. (5 points) • Level 2: The staff have limited read-only access and some write access to a few systems but not all of the key systems just ones with whom they happen to work well. (4 points) • Level 3: The staff have read-only access to most of the key systems but only limited access to anything else and not consistently. (3 points) • Level 4: The staff have read-only access to all monitored systems and access to key systems to make simple fixes via documentation, training, or specific direction. (2 points) • Level 5: The staff have full access to all key and most other systems and are routinely asked to perform simple fixes at the behest of the technical teams.(1 points) 15
  • 16. Process Do you have managed processes that you use to manage service interruption and reduce outage duration? • Level 0: When there are outages we all just get together and try and figure it out as quickly as we can letting the situation determine how we respond. (0 points) • Level 1: All processes are done via emails to the team on what to do, kept in a folder or printed for a binder, and not reviewed or evaluated for completeness or effectiveness. (1 point) • Level 2: Processes for outage and support activities are written up formally but none of the processes for restoring service or prioritizing work are documented and no reviews are done. (2 points) • Level 3: Templates are used for processes and they are done for support work and for the systems and services but not required and some review is done but there is no formal process.(3 points) • Level 4: Process lifecycle is formalized with templates, reviews, and required updates when documents expire but still not mandatory for all services and systems. (4 points) • Level 5: Process lifecycle is formalized and mandatory for all services and systems. (5 points) 16
  • 17. Metrics Do you have a standard set of metrics or KPI’s that you use to track outage management (duration, response time, frequency, etc.) • Level 0: We don’t have any metrics or KPI’s and go by instinct when something is getting bad or needs fixing and let staff tell us when the work volume is getting too much. (0 points) • Level 1: When we need metrics we have someone who will go in and review emails or requests and then we count them up and make charts for management. (1 points) • Level 2: We have a set of metrics that use but we don’t keep the data current and no one really trusts the underlying data (or our system doesn’t do it well) so no one really uses or trusts them. (2 points) • Level 3: We have current and up to date metrics but because the tools don’t work very well someone has to go in by hand and count them up and create each chart and graph by hand. (3 points) • Level 4: We have solid metrics with data visualization but the data isn’t shared and management does not really look at them to make decisions because they aren’t counting the important things. (4 points) • Level 5: We have solid metrics that are automatically generated and shared publicly and it drives decision making and resource management on a daily basis. (5 points) 17
  • 18. How are we going to make this fun? 18
  • 19. How we play the game • We are going to go through a series of “outage scenarios” and see who can “resolve” the issue the fastest? • We will combine your current maturity levels with a roll of dice to get through the six stages of an outage. 19
  • 20. Phases of Outage 1) Detect – the period of time from an event happens and tools or systems log or detect the event within the system. 2) See – the period of time from when the event is logged or detected in a system to the time when a key staff member knows about it. 3) Engage – the period of time from when the first key staff member sees the event and the time when the “right” staff who can fix it are fully engaged. 4) Determine – the period of time from when the event is fully resourced until there is an agreed to cause of the issue (this one can iterate). 5) Plan – the period of time from when you know what the issue is and when staff knows exactly how to address and mitigate it. 6) Mitigate – the period of time from when you know what to do (or think you do) and when the issue itself is no longer impacting. 1) At the end of Mitigate, if the issue isn’t resolved, then you go back to Step 4 [Determine] 20
  • 21. Worksheet Part II • Detect = (Tools + Breadth) * “chance” – (metrics + process) • See = (Coverage + Skills + Training) * “chance” – (metrics + process) • Engage = (Coverage + Data + Space) * “chance” – (metrics + process) • Determine = (Access + Breadth) * “chance” – (metrics + process) • Plan = (Access + Skills) * “chance” – (metrics + process) • Mitigate = (Tools + Training) * “chance” – (metrics + process) 21
  • 23. Scenario One: one dice After the first three days of the new term without any major incidents you are starting to feel like everyone is really getting a handle on this outage management thing. Then you get the call. EMAIL IS DOWN! Using your “Round One” section see how long before you can use email again. 23
  • 24. Scenario Two: two dice After a long hard year, you decide to take the week from Christmas to New Years off and really unwind. It is all going great and you are just getting ready for your first New Years out on the town in years and you get the email. YOUR PRIMARY DATA CENTER IS DARK! Using your “Round Two” section see how long before you can start celebrating the New Year. 24
  • 25. Scenario Three: three dice It’s that season again… no not Academy Awards but Nobel Season; You were awoken this morning to find out that one of your faculty has just won a prize and everyone is on their way to campus for the big press release and speech. You are about to walk over to the pavilion and you get the call. ALL CAMPUS WIRELESS IS OFFLINE! Using your “Round Three” section see if you can get your wireless up before the press arrives on campus. 25
  • 26. Scenario Four: four dice Finally the term is over and summer is here. You’ve got some summer classes and a few projects to get through but the campus is mostly quiet. It is a nice relaxing Friday afternoon and you are thinking about heading out a little early. Then you get the call… from your security officer. YOU’RE UNDER A DENIAL OF SERVICE ATTACK! Using your “Round Four” section see how long before you get to go home. 26
  • 27. Summary There is no one thing you can do that will reduce or eliminate outages. Every outage management strategy can be different and there isn’t any one answer that will work for everyone. Often the simplest steps can have a huge impact of reducing outage duration. Write stuff down (process) Count stuff (metrics) 27
  • 28. Final Question How many people have already identified one simply change they could easily and cheaply make based on what they saw here? 28