Your SlideShare is downloading. ×
  • Like
Service Watch Proposal
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Service Watch Proposal

  • 180 views
Published

This document describes the personell/business side of webserver uptime monitoring.

This document describes the personell/business side of webserver uptime monitoring.

Published in Business , Technology , Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
180
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Service Watch proposal Author: G.J. Petersen (gerard@gp-net.nl) Date: 27-03- 2008 (rewritten for publicaton at 10-04-09) Version: Draft Table of Contents Introduction............................................................................................................................1 Processes...............................................................................................................................1 Continuity of Service..............................................................................................................2 Monitoring of services............................................................................................................2 Service watch.........................................................................................................................3 Follow-up................................................................................................................................4 Introduction This document is a proposal for a Company Service Watch division (SW). The SW is primarily to guard the availability of customers' services. This proposal will describe the processes involved that are needed to guarantee the availability as described in a SLA, as well as internal and supporting processes. It will define several service types, with according response times. How people are notified in case of failure and what to do when issues arise. It will describe how 24/7 monitoring can be implemented as well as what means are needed to do so. This beholds event logging, formal weekly transfer of responsibility and advisable actions after returning to normal service behavior. Note: This documentation does not cover implementation of tools but is business and process oriented. Processes There are three somewhat distinct process groups which are: external, intermal and supporting processes. In other words, customer related, employee related and IT related. Below three examples in order of importance: A. Customer services (e.g. live websites with customer content) B. Company internal services (e.g. company filing, printing, Email and other internally used services) C. IT processes (e.g. Creating backups, resource monitoring and other IT related processes) Depending on when a problem occurs and on what type of service a different priority is given to the issue at hand. During office hours all personnel normally would be available, but on Sunday morning or during a week night the available means and employees are reduced to almost nil. Therefor prioritizing, thus
  • 2. categorizing, issues creates leverage on what needs to be done right away and what can be postponed to office hours. Two important response times should be taken into account in the problem life cycle. Action response time - The time between receiving a issue notification and starting to resolve the problem. Notification response time – The time between having solved the problem and notifying the people affected by the problem. The following two tables describe the three aforementioned service types and their response times. Outside office hours: Type Action response Notification response A Immediately First thing, next business day B*) First thing, next business day n/a C*) First thing, next business day after n/a having solved issues of type B *) For B- and C-type services notifies are not send to cellphones at the moment Inside office hours: Type Action response Notification response A Immediately Immediately after solution B Immediately after having solved A-type Immediately service issues C Immediately after having solved B-type n/a (Will be seen in the service issues logbook) Continuity of Service A proper SLA for customers these days states a guarantee of uptime round or above 99% which allows only days of downtime on a yearly base (including planned maintenance). Mind you, this only accounts for the A-type services. Moving towards as much infrastructure redundancy as possible, for at least the A-type service, is advisable to avoid downtime in case of failure. Nevertheless, this still means round the clock monitoring is necessary to make sure the necessary services are 'up'. Monitoring of services Software can monitor everything from availability of customer services, internal
  • 3. status of servers and connectivity of those servers. This would be the 1st line of defense against downtime. The software should have a redundant setup as well to avoid the monitoring functionality not being able to notify in case of failure. But what do we do if it detects problems? If one of the A-type services are unavailable it should send out SMS text messages with a short problem description to a cell phone. Then the 2nd line of defense, humans, come in to action to see what the problem is and start working on a solution. Having a phone next to your bed 24/7 365 days a year would not be realistic. So for a weekly shift, taking health reasons into account, at least 6 colleagues should be appointed to take turns. This results in being standby for a week every other 1,5 months. Service watch Taking up a Service Watch shift should provide you with the necessary means to be able to take on whatever problems might arise. The equipment for this should at least contain the following: A cellphone for receiving alerts – A laptop (or home workstation) being able to remotely work on the servers – (and services) The proper account privileges to access system and/or process resources, as – well as physical access to a data centre for instance. A logbook containing: – – Documentation on the composition of network Infrastructure and servers – Procedures for issues that need manual intervention 1 – Action sheets for reporting the problems that occured, logging the steps taken to resolve those problems and the time it took to get back to normal operation. – Hardcopies of the procedures evolved from previous issues Assuming the weekly shift changes on monday morning, a proper transfer of responsibilty should take place concerning the following steps: The equipment described should be handed over to the person who takes on – the next shift A report should be send to the helpdesk with the issues/actions taken so – possible follow ups can be carried out. (e.g. Officially notifying customers, if necessary recover B- and/or C-type services). A report should be send to the helpdesk with an elaborate command history – to categorize issues, and creating procedures thereof, for future use. For keeping track of who takes the shift on what weeks a central online calendar or log should be implemented, so people know when they run the service watch. Therewith creating flexibility towards personnel's holiday wishes. 1 - It's impossible to document every possible problem up front so procedures 'in case of' should be written and stored on a wiki for instance as well as in the logbook as time passes
  • 4. Follow-up For actually implementing the Service watch the list below can be a guideline after the proposal has been approved by company management. A knowledge matrix for people and used tools – Does involved personnel have a laptop or home workstation and remote – access. Admin account access for the people on all servers (or 1 admin account) – Setup action sheet templates and logbook (what was done and when) – Procedure inventory and storage (A place to store documentation) – Personnel / Payment proposal (involves HR department) – Questions are welcome at info@gp-net.nl