This document discusses the gap between reported service levels and actual user experience of those services. It examines how this gap can develop due to incorrect measurement and reporting of service levels. The key points are:
1) Service level reports often focus on simple metrics like incident volumes that don't capture the true impact on users.
2) Goals and measurements may not be properly defined, leading IT and business units to think they are measuring the same thing when they are not.
3) Common metrics like availability can be misleading if they don't account for usability factors like performance that impact the user experience.
4) Dashboards and reports must be scrutinized to ensure they provide the necessary context and
1. .
UNDERNEATH THE SPIN A PRACTICAL LOOK AT SERVICE LEVELS
(IS WHAT YOU SEE WHAT YOU GET?)
Malcolm Gunn
Service Availability Management Consultant
Barclays Bank Plc
TEL +44(0)7966224346
E-Mail:malcolm.gunn@barclays.com
Abstract
This paper looks at the gap between the service level figures provide by the technical
teams and the users actual experience. By using real life examples and reporting
models, it examines how the gap between reporting and user experience has evolved, as
well as highlighting potential errors that can be avoided. The paper finishes by looking at
what steps can be taken in order to ensure that service level reports move closer to
matching the user experience and the next stages in reporting development.
Introduction Service Levels
The setting of service levels appears easy but getting The OGS handbook states,
those levels to reflect the flexible requirements of the
client is something that seems almost alien to Service “The improvements in service quality and the
Management. reduction in service disruption that can be achieved
through effective Service Level Management (SLM)
Service levels are in common use across can ultimately lead to significant financial savings.
organizations as IT areas endeavor to show how Less time and effort is spent by IT staff in resolving
effective they are at delivering their clients services. fewer failures and IT Customers are able to perform
1
They’re not new and they’re not complex but in many their business functions without adverse impact.”
cases neither are they an accurate reflection of the
user experience With such a clear definition showing what effective
service level management can deliver it would be
Even the basics can be a challenge; figures are realistic to expect that the delivery of service level
normally based around simple easy to report reporting to be clear cut, simply defined and a high
measures. Often we use the wrong figures for the priority. In practice the choice of service levels their
wrong measure, for example Incident volumes to monitoring and reporting are often far harder to
show availability. establish and deliver in a meaningful manner.
Remember just because something is available The problem often lies in how and why the service
doesn’t mean it’s usable. What’s needed is the ability levels are agreed and measured as with most
to make the levels flexible enough to meet highs and theoretical processes the practical implementation is
lows in demand. Levels need to be set in such a way not always as simple as it sounds.
that they allow the client to do their job as effectively
as possible. If the implementation is not correctly thought out or
the delivery of the service level reporting isn’t clearly
This paper will look at how easily the gap between the defined a gap will quickly develop between the
theoretical figures and real life performance can start Service Level Agreement (SLA) and the user
to widen until the two bare no relationship to each experience,
other.
The gap between users and the IT area could equal
In extreme cases the service level reporting will show be said to be the gap between the business owners
service performance working within agreed perspective and the users within that business area.
parameters, when in reality the application is
unusable. This is true as often the business areas as a whole
can be unaware of the true performance of the IT they
are being supplied with. This is because they are
relying on the data supplied by the IT areas to make
2. their judgment. Then if that PC is unavailable they are unable to
operate which results in missed targets and ultimately
Whilst these areas are responsible for the business
users of the applications they are not and nor should they receive a lower pay package and bonus.
they be expected to be aware of the actual day to day
mechanics of how the applications work. They are They may be fully capable of performing at a higher
after all buying a service from the IT area and they level but IT infrastructure has held them back. If the
expect it to work correctly. infrastructure and delivery are wrong this will have a
significant impact on the businesses ability to retain
It is because reporting is often reviewed at such a top quality staff and eventually to recruit the best as
high level that the mismatches fail to be noticed so word will travel around the user community.
both the IT areas and business groups both believe
they have put something in place that will allow them So poor IT will impact the bottom line this means that
to monitor and control the performance of the service. cost savings in the IT area may have an impact on the
companies’ bottom line and it may not be the one the
company expected……
They review the figures on a regular basis and these
will show services are working satisfactory (within The danger in using incident volumes in this scenario
documented guidelines). Even if in practice the users is that often companies will address the severity 1
are struggling to deliver an effective service to the symptoms removing the high severity incidents from
organizations customers. the reports but they don’t fix the underlying root
cause. This type of target driven management will
So here’s a quick look at how this happens, by looking work to mask issues that are then waiting in the wings
at a sample conversation that takes place and the to return and cause more damage later.
actions that are agreed as a result to improve
performance Looking at service management information requires
the recipient to understand and question to ensure
Business Directors I want “I want improved stability” that what they are seeing is what they think they are
seeing sometimes just checking the data behind the
IT Director’s solution “I’ll reduce the number of high figures can reveal what’s really gong on.
2
severity incidents“
Does What We Document Show What We Mean
The resulting users impact “No real change”
Does the information we see make sense or is more
How can this happen, information required to make the judgment call.
In practice the conversation will go on for much longer 3+2=11
and contain a lot of fine complex words but this is
essentially what they say. Perhaps if that’s all they At first glance the sum appears to be wrong however
said they may realize why things don’t turn out as if one extra piece of information is supplied and the
expected recipient has the right background knowledge the sum
makes perfect sense.
As a result of this meeting much work will be done to
reduce the volume of high severity incidents but The missing piece of information is that the sum is
nothing will have changed for the user. This is working in base 4 and suddenly it all makes sense.
because although at first reading it looks as though
the two areas were talking about the same thing they The same principle applies to service level reports,
are in fact not getting close to talking about the same whilst they may appear to be showing a particular
thing, version of the truth unless the recipient has the
required background knowledge and is provided with
One is talking about business impact the other about enough relevant information they may interpret the
a particular type of incident and its volume. reports in a completely different way from that which
was intended. This doesn’t mean to say that the
The causes of this are many and varied but the impact report will be giving an untrue picture but it may be a
can be clear and damaging for the business. Imagine narrow view of the truth
a user facing a business customer who requires a PC
to do their job against which they are measured for When ever you send or perhaps more importantly
sales and contacts. receive reports challenge do they mean what they
appear to mean. The copies of the reports on the
3. next pages illustrate this as a first glance they all has delivered.
appear to be showing stable and available services.
Closer examination will show that perhaps all is not as So what is this report actually telling us? What the
it seems recipient of the report needs to know is what success
are we showing here it looks like we have run without
Do We Really Have a Stable Service a major incident for 21 days out of 30 days?
There are times when reports are produced for a In this case the red dots only indicate when severity 1
specific purpose in this case the purpose was to show incident starts they could run for 2, 3 or 4 days but
the improvement in the stability of the services only show up on its first day. So a green day could
delivered by the IT area of an organization. actually have 1 or more severity 1 incidents running,
which may not be what the recipient wants to see?
Mon Tue Wed Thu Fri Sat Sun
As with all incidents its not just how long they run but
1 2 4
the damage they cause that the clients is interested
3 5
in. The danger with the IT community is that we focus
on volumes rather than impact. Plotting incident
volumes does not tell you the impact on the client it
6 7 8 9 10 11 12 just tells you the number of incidents that have been
logged with the help desk. That may be a shock to
13 14 15 16 17 18 19
some organizations that spend a lot of time trending
incident volumes.
20 21 22 23 24 25 26 Reporting and trending incident volumes isn’t wrong
and it is often a good indicator of where potential
issues may be impacting the client but in order to be
27 28 29 30 31
effective even at this basic level some further
categorization will be required.
The only background information that was supplied Unless the data can be split into useful categories
was a description of the symbols such specific services or types of infrastructure the
• Green tick means no new severity 1 incident
3 figures will not show anything other than a total
• Red dot new severity 1 incident. number of calls logged.
The chart is supplied with table of incident volumes Its not clear from this report which month actually did
the most damage to the client as there is no indication
Sept Oct Nov Dec Jan Feb of the duration or more importantly impact to the
client.
Severity 1
6 5 2 1 7 4
Green Days
This report gives the impression of a stable
Severity 1 66 108 131 104 82 72 environment because it focuses on the high profile
Severity 2 2086 2075 2120 1966 2004 1924 incidents but how many lower severity incidents are
Mar Apr May Jun Jul Aug eating away at the available up time. This looks good
but shows you almost no actual usable facts.
Severity 1
3 7 15 17 20 21
Green Days
From a client viewpoint they may be quite happy to
Severity 1 88 58 31 17 15 11
have high severity on incidents on some services but
Severity 2 1891 1491 1489 1447 1765 1720 for specific services that are critical to them they may
not tolerate any outages during their on line day
At first it appears to show an improvement in the
stability of the service incident volumes have When ever reports are produced the area responsible
decreased and the number of green days (days with must review them and understand what the report is
no new severity 1 incident starting) have increased actually telling the audience. They may know what
from 6 to 21 almost a four fold improvement. they mean to say but will it be clear to the recipients?
Whilst the area producing the report may know the Any data should reflect the true position it’s very easy
message it wants to deliver will the recipient to produce nice pictures and graphs but if that’s all
understand what they are being told or in some cases they are then they are useless and are a waste of time
does the area producing the report understand what it and resources.
4. There are a number of lessons to be learnt from this In this particular case the dashboard is manually
type of reporting. updated so whilst it appears to be in real time it’s
running behind and the delay will be variable. So
1. Incident volume alone even if it lists all incidents is immediately it’s no use as a service dashboard and no
not a measure of service availability or stability it’s one should have any confidence that it’s showing a
a measure of incident volumes. true picture of the client experience.
2. Reporting focusing on a limited selection of
severities actually gives no meaningful data on To use the car speedometer analogy this is showing a
the clients services speed that you were traveling at, at some point in the
3. In this instance there is no indication that we have past. Although we can’t clarify how long ago you were
more than 2 severities and if we do it fails to show traveling at that speed
how many low severity high volume incidents are
really going on behind the scenes. Once again this dashboard is driven by the fixation
4. Without some form of user impact such as lost with severity one incidents so it’s not even taking into
business hours the impact to the end user is not account the full user experience.
clear
Armed with these basic background facts it becomes
What Does Our Dashboard Really Say? clear that the people going around with a smile on
their face because the dashboard is green are
Dashboards are used across a wide cross section of working under an illusion, and at any minute a reality
organizations. Some of these are real time and some that they have no idea about could come crashing in.
are historic. Even though they are in common use the
message they are trying to portray needs to be clearly Is the Availability Figure All It Seems
defined.
It’s clear that when working with incident volumes that
When a person buys a car would they settle for a a significant amount of ambiguity can be introduced.
dashboard that told them the car was moving but gave When organizations start looking at availability figures
no indication of the speed, revs, temperature oil the scope for manipulation increases. This brings with
pressure? Would they accept a single warning light it increased danger of misinterpretation and
saying fault? It’s unlikely, yet that as exactly what misunderstanding.
companies often accept within the service
management world The first point to remember with availability figures is
that available is just what it says, what organizations
need to remember is just because something is
TOP 10 SERVICE
available doesn’t mean it’s usable or fit for purpose.
Service 6
Service 1 Last incident date and summary So if you have an availability figure you need to have
Last incident date and
some supporting measures that show how the
Service 2 Service 7 application is performing.
Last incident date and Last incident date and
Service 3 Service 8
Because it shows as available it may not be usable for
Last incident date and Last incident date and any number of reason, it might be running too slow
Service 4 Service 9
due to server capacity or the network links may not be
Last incident date and Last incident date and working so no can actually use the application.
Service 5 Service 10
Last incident date and Last incident date and When an organization puts service management in
place they always want some SLAs implemented
This is a dashboard that I’m sure everyone in the quickly, normally these are around an availability
particular organization that uses it is very proud of. figure partly because its easy to measure. The
This is a copy of a service dashboard that is in place requested required level that is often focused on is
and projected across multiple sites on plasma screens 99%. How this will be measured and why this figure is
and when everything is green everyone looks very arrived at are often a mystery and remain so to those
happy, involved but over time they become ingrained in the
organizations culture.
But what is it really saying, to understand that we
need to look behind the scenes at what goes together The 99% availability figure is often requested because
to make up the color coding. Once we can see that the client really wants to say they want it available all
we can then understand if everyone who sees this the time they need (100%) it but are too polite or too
should be smiling or crying. scared because of the potential cost to ask so they
5. use 99% as an acceptable alternative. level can show the service available but the user
experience will be frustrating especially if they are in
The technical areas are too polite to challenge the sales related roles that pay commission because their
requirement why do you need this level of availability potential earning time has been reduced.
what will it allow you to deliver.
Looking at the availability report below there are some
In theory the setting of availability levels is the first interesting deviations which should lead the recipient
step in a process that expects to increase and question if the reporting has been manipulated. This
improve the number of measures that will sit behind is a real example of an availability report and for the
the availability measure going forward. two years it was used no one challenge the data
Normally it’s the number of measures that’s focused
on not the quality of those measures or the impact to Service Potential Availably Fully Infrastructure
the users. Availability Target Available Partially
Available
A 27478380 99% 99.01 % 99.8 %
The other thing that happens is that whilst the Mins
intention is there to add to the measures in practice B 27478380 99% 95.18 % 100.00%
nothing is actually put in place and organizations are Mins
left with an availability figure that is used as the only C 43200 Mins 100.00% 100.00%
measure of the service. Then because it’s all they D 358Hrs 97.24% 99.32%
have they try to use it to measure everything. E 720 Hrs 100.00% 100.00%
F 176 Hrs 100.00% 100.00%
Even when we look at availability figures whilst 99% is
the target we need to understand the business G 455:30 Hrs 99.2% 100.00% 100.00%
requirements to establish if this is required and how it H 720 Hrs 100.00% 100.00%
will be measured. It should be simple but when you I 176 Hrs 98 % 99.21% 100.00%
measure availability there are a number of questions J 1795200 Mins 100.00% 100.00%
that need to be asked,do we take 24 hours a day or
do we look at the hours that users are in the office or
on the production line. It seems obvious but if it’s not There is a reason that some of services are reported
clearly defined then the figures can badly out of line in minutes and some in hours and that is to ensure
with reality. that the report as much as possible showed green.
The same is true of the partially available figure.
A 30 minute outage over a 24hour period will give
99.3 % availability There were a number of options available to show this
data.
The same 30 minutes over a 9 to 5 working day gives
93.75% availability We could have done length of incident against 24
hour period or against the working day. Just doing
If there is a business critical session say10am till 1pm incident length against the working day meant the
then we only have 62.5% availability for this particular figures stayed red. Using a 24 hour period would be
part of the day too obvious.
So it’s possible to have a report showing high So the calculation was based on minutes in the on line
availability when we actually lost nearly 40% of the day 480 multiplied by days in the month that the
available service time during our critical business devices were due to be operational say 23 multiplied
hours. by the number of devices in the network approx 2489.
This meant spending time manually reviewing all
It’s all depends on how you take the data, which area incidents look for phrases such as “50% of devices
you work in, what you want to show and how good unavailable” in order to find out how many devises
your relationship and understanding of the users may still have been working during any incident.
requirements really are, coupled with the honesty to
tell it as it is. Using this method it was often possible to get the
report showing green.
That’s just a basic availability figure using a full outage
and 3 options, how complex could this get when It could be argued that this was an accurate reflection
dealing with an outage impacting a limited number of of the overall position even if the drivers were
users. completely wrong. In fact this report started to hide
significant hardware issues within the end to end
Just from these figures we can see that the service design that were having a major impact of the clients
6. ability to work effectively
In large organizations this may require the
From these examples it can be seen that it is easy for development of two separate reporting lines one from
the gap between the IT areas view of the world and the IT area the other from the business area. These
the user experience to grow into a chasm very quickly then need to be reviewed to identify variations, once
and if the communication lines are not set up correctly any differences have been identified the causes need
it can take a long time before it is identified and then to be understood and used to develop an accurate
even longer before actions are taken to improve single view.
things from a reporting perspective and more
importantly from a user experience perspective This approach although initially painful will eventually
lead to a closer working relationship and drive a more
Knowing who the “User” is makes a Difference open and honest debate from both sides allowing an
accurate picture of service availability to be supplied
It may seem obvious but it’s worth stating what is that all areas recognize.
meant by a user. “User” can mean different things to
different people and areas. As this can help Lesson about Targets Learnt the Hard Way
understand where the gap between theory and reality
starts. The availability figures show green
Regardless of the fact the IT may be in house or The response times are green
outsourced. A user is the person sat in front of the PC.
It is not the business area that they work in. But the service is a disaster
It is the failure to engage all the way down the supply How can all the statements be true the MI shows
chain that often leads to targets being defined at the green so everything must be working?
wrong level. When the engagement stops at the high
or mid tier management there is a greater chance that This highlights the problems when MI production is
targets will be set that can be met but will not provide based around incorrect assumptions, of those
the users with the level of service they need. perhaps the biggest and simplest mistake is to fail to
remain engaged with the client to be aware of
Whilst working at that level isn’t in its self wrong it changes in the clients working patterns.
does start to bring in areas of uncertainty. The
managers working at that those higher levels own the Even if the pitfalls highlighted early around lack of
area but they are often unaware in any great depth of user input are avoided, once everything is in place,
the tasks completed by the area or the challenges processes must be implemented to remain closely
facing the users on the shop floor. Neither should we linked to the client in order to understand when their
expect them to have a micro management view. requirements change. The relationship will also be
required as the reporting measures are refined to add
Operating at that higher level it is possible to set up improved value to the client area.
any number of metrics. However if these are not set
in conjunction with shop floor users and checked on a Once availability figures are in place and acceptable
regular basis they can be meaningless from the day the next step is to move onto the performance
they are first produced. measures and that’s when the fun really starts as we
try to align the figures with the user experience. As
Sadly in many cases the fact that they are unfit for with all measures once in place the trick is to ensure
purpose is not understood and the organization uses they remain meaningful
them as measures and often holds them up as proof
that the IT systems are delivering high quality service. When responses times are initially fixed they are often
picked to give certain and in most cases significant
What’s really sad is that both the IT areas and the margin for error.
business unit managers both think things are
performing satisfactorily, whilst users struggles to do Often a figure is chosen that may not be operationally
their job acceptable but is signed up to on the basis that at the
time of signing the application was performing at a
To break this mould and identify this kind of MI considerably better level than the measure.
requires both the IT area and the client to challenge
the information they are using and develop methods By setting figures in such away it leaves the gate open
to ensure the information being provided matches the for significant deteriorations in performance to take
users’ real day to day experience. place whilst the figures still show the application
7. performing within agreed guidelines. It is these high profile incidents that drive a reaction
from the senior management within a business area
In simple terms we sign up to a response time of 15 but they may not solve the issue for the users on the
seconds per transaction at the time the application ground. If they don’t fix the issues for the users on the
works at 1 to 2 seconds per transaction. ground then they will not have fixed the issues for the
senior management within the business but they will
An upper level for users to work effectively is 5 have masked the symptoms.
seconds per transaction so the transaction response
time can get to twice the unusable level and still meet Targets Driving the Wrong Behaviors
the agreed management figures.
Once you have everything in place the setting and
Whilst this may appear outrageous it has happened in agreeing of targets is equally important to ensure that
the past and it will happen in the future. you meet user requirements, the danger lies in targets
The fact that this happens is normally because such as the one below, versions of which are, and will
engagement has not been made at the correct level to be in the future used in organizations across the globe
understand the response times that are required to with the intention of improving performance.
allow the application to be usable.
CURRENT STATE
The service may well run satisfactorily however if
eventually pressure starts to build either more users We have 1800 outstanding problem records as at
are added or the functionality of the applications are 01/06/07
changed. Then from a user perspective the
application starts to slow down but the measure still TARGET
show things working within agreed levels.
To reduce this figure to 600 by the end of the year
The other failure is the failure to continue to review
and revise the levels that are required by users. An This widely used target is a classic example of how
application that is initially used in a non business the mismatch between IT targets driving the IT areas
critical area is then used for a business critical task. to deliver at the same time thinking they have made
In this instance a responses time of 30 seconds may an improvement to the user experience.
have been acceptable when it was a non critical
function. At first glance this is an admirable objective and will
improve the level of service being supplied. However
Once the application moves to be a business critical when you look deeper at the target and start to
function the response time may no longer be understand what behaviors this will generate it
acceptable although once again the metrics being becomes clear that whilst the IT support teams will be
reported will still show the application functioning as working hard reduce the volume of problem records
agreed. that are open the improvement to the users may be
negligible.
Another lesson around targets and there use focuses
on the fixation that some organizations have with high Why will all the work possibly fail to deliver? Well
severity incidents, as we saw with the conversation because the IT areas will be working toward their
between the business and the IT area its very easy to targets and fixing those simple quick fixes that will
come up with a target and a resolution that look to drive down the number of outstanding problem
match when in fact they miss each other completely. records they are managing.
For a start we are measuring incident volumes not It is at this point that the missing link becomes clear
user impact so a reduction in incidents may or may the target takes no account of the problem records
not result in less outage. If we have fewer incidents that the business need and want fixing. The business
but they run longer then we may not have had an could well be perfectly happy to have the target
impact on the end users at all. In order to match the reduced to 1700 outstanding records providing the
client requirements we have to be measuring the right 100 problem records that are fixed are the ones that
thing in the first place outage time and user impact. are costing their business the most.
The IT areas producing the service metrics must Fix those key records and you improve service, miss
understand the issues that cause the most damage to them and service remains as it always was no matter
the business, this may not always be the high profile how much effort is made and how many problem
high severity incidents. records you close off.
8. It’s What You Can’t See That Does the Damage reporting all the faults because they are confident hat
something will be done when they call in the next step
Even when IT areas focus on the known problems is to ensure that correct issues are addressed
there are often significant issues that lurk below the
surface that the users haven’t reported. Without the In order to do that it is necessary to take a step back
knowledge of these hidden issues the IT areas can to see the whole picture. This means that in order to
easily spend a significant amount of resource and understand what’s going on we need to obtain as
effort into fixing the wrong issues. much information as possible and not jump to
conclusions as soon as we have the first piece on
To understand this means building an understanding data to hand.
into why users stop reporting issues that actually
impact their ability to work as effectively as possible. Even when you think you know ask some questions
just to make sure you really understand what is going
There are two main reasons firstly they learn the work on making sure you are able to address the root
arounds. Secondly if their initial calls appear to get no cause and not just remove the symptoms in the short
response they start to accept a lower level of service. term.
The other side of that is that when the calls stop The danger with performance reporting is that once
coming in the IT area thinks the issue has gone away we have the first piece of monitoring in place and the
when in fact it’s merely moved out of sight. In reports set up we try to make the one measure work
extreme cases the IT areas may try and claim credit for everything. We fill n the blanks with what limited
for the reduction in calls. knowledge we have and guess at other areas.
If the IT areas are to fix the right issues then a way In order to fully understand what’s happening it
needs to be found to ensure that all the issues are requires a structured questioning approach linked to
called in. This is the point where it becomes apparent sound technical knowledge. The answers to the
that you can’t work on one part of the process in question can come from users or monitoring and
isolation. alerting tools anything that allows a detailed picture to
be built up before actions are taken.
So how can we start to get things to match the users’
real life experience? The first step is to make sure that Without this full picture how can we start to improve
we obtain all the relevant data the first time the call is the service we deliver.
logged to allow effective investigation. This coupled
with a standard repeatable incident logging process Starting to make things work
that allows multiple incidents to be collated and linked.
Can we get this service level management process to
But the help desk is driven by time targets, we’re back work effectively?
to targets and behaviors again the IT help desk needs
to change their targets to extend the incident handling Yes we can, but each area must work together in
time to allow all relevant information to be collected as partnership technical areas need understand the
the incident is occurring. business areas drivers and limitations working closely
with them to understand the impact of any changes
This way all the required information is obtained all they make.
calls are logged and the right problems get fixed as
the business adds it’s impact to each problem. In the same way the business must keep in contact
with the technical areas and ensure they are informed
This does take longer and may require more resource of any changes to their working practices that may
initially however this is short term pain for long term impact the technology.
gain.
This means letting the technical areas know all the
As the right problems are identified and fixes delivered changes they make so any potential impact can be
to the problems that the business needs, this allows tested before implementation.
the users to operate more effectively. The volume of
incidents called in will decline and reduce the impact It will also require some fundamental changes in the
on the help desk. way the organization thinks. It must move from I have
a piece of software now what can I do with it, to this is
Making Sure We See the Whole Picture what the client wants how can I get that information
for them.
Once the users have been encouraged to start
9. Working in this way will take longer to set up and may One of the best solutions is to keep changing the
require the use of a number of different pieces of measures to make sure that people don’t get
software, but once in place it will start to deliver complacent and that people don’t start “playing the
effective results system”.
Outside of pure data capture and reporting make sure Targets need to be changed to ensure that we are
that the systems are built with resilience as standard always looking to find where we need to concentrate
and then monitor them to make sure that the our resources to improve service.
resilience is maintained through out the life time of the
application. If reports are green time after time then by all means
continue to collect the data but don’t report it time
So often systems are set up with resilience perhaps after time look for the areas of client impact and report
two servers, over time the capacity requirements those, this will get a response a green report often just
mean that they are constantly running one server at gets filed because at first glance it requires no actions.
full capacity and the other half capacity so if we loose
one server we have no resilience left and service is Know the limitations
lost
Whenever you prepare reports and when you receive
Working together and making sure everyone them it is vital that you fully understand any limitations
understands what has been agreed much like service around the data that you are working with or viewing
levels it’s not complex and it shouldn’t be difficult but it
must be understandable and flexible. Even when you think you know what the figures are
telling if the producers haven’t documented any
Relationships and more limitations then challenge them it could be that a
significant criteria has been missed such as the
Building effective relationships is an important part of figures only include specific types of incidents
getting this right but relationships alone are not going
to get the job done. IT areas also need to understand Equally important is the need to know if the data you
what they want to measure, and the impact this will have is the base data or if someone has already
have the users and the business. manipulated it before you start working with it. This
can have a significant impact on the end result of your
Its about identifying what causes the pain and what reports if the data has already been manipulated in
can be lived with. This isn’t about fixing everything its order to paint a particular picture then anything you do
not about spending $20,000 to fix a $100 problem. afterwards will not necessarily show a true and honest
It’s about finding and fixing the right issues. picture of the service.
The other angle is to establish what it is that we want If you know that data has been manipulated before
to measure and report. Do we want to measure you receive it, go back to the source understand what
availability, usability or both? they have done and why. They may have already
done part of your job for you or alternatively you may
Once the measures have been decided its time to find have to go back to the original data source and start
the right tool to do the job this may mean having from scratch.
different specialized tool sets for each task but if you
want it done right you have to pay for quality. When producing reports always challenge the data
you are using and the results you produce; think what
This again has a high up front cost but the pay back is the recipient may ask and answer those questions
quick as the right issue are highlighted and fixed. yourself. The more you understanding the better the
resulting report and the better the relationship
Once everything is in place and the reports are being between you and the client will become.
produced and distributed there is a further item that
needs monitoring and that is do the reports cause a When we understand the limitations around the data
reaction from the recipients if not then why are they that we use the next step is to decide if as an
being produced? organization we are prepared to invest in processes
and people to insure that right data is collected in the
Keeping it new and alive right format at the earliest possible opportunity?
Having measures in place is one thing making sure In conclusion
that people pay attention to them is the harder part.
• Listen to the users
10. • Understand the business
• Identify what the client needs 3
Severity 1 Incident definition, direct threat or damage
• Identify what you need to, reputation or credibility of the group. Multiple lines
• Continually review the requirements of business or locations critically affected.
• Build partnerships not conflicts
Where do we go from here?
What are the next steps to take in order to deliver
effective measurements moving our reporting even
closure in line with the client’s requirements?
One area that we’d like to pursue is the idea of flexible
service levels through out the day business critical
times and non critical times can have different
availability measure.
Convincing the client and the Service Management
Teams to amend the style of reporting so rather than
reporting on all that’s well we also start looking at
proactively highlighting areas that are starting to
require attention.
Sample Incident Severity Definitions
Severity 1
A direct threat of damage to the image, reputation or
credibility of the group. Multiple lines of business or
locations critically affected.
Severity 2
Significant degradation or outage affecting a line of
business key services or locations
Severity 3
Minor degradation to a key service, business process or
location or a more severe degradation or outage to a
non critical service, business process or location
Severity 4
Small issue with localized scope typically affecting a
single user. Can either be tolerated or worked around
for an extended period of time due to its limited impact
Caveats & Disclaimers
All trademarks and copyrights are acknowledged.
Views and opinions expressed in this paper are those
of the author and do not necessarily reflect those of
Barclays Bank PLC in general.
Bibliography
OGS Service Handbook
Version 2, 2007
1
OGC Service Delivery Handbook
2
An incident that impacts all users of a particular
service