SlideShare a Scribd company logo
1 of 10
Download to read offline
.
     UNDERNEATH THE SPIN A PRACTICAL LOOK AT SERVICE LEVELS
                (IS WHAT YOU SEE WHAT YOU GET?)
                                                   Malcolm Gunn
                                     Service Availability Management Consultant
                                                  Barclays Bank Plc
                                              TEL +44(0)7966224346
                                        E-Mail:malcolm.gunn@barclays.com

                                                        Abstract
              This paper looks at the gap between the service level figures provide by the technical
                 teams and the users actual experience. By using real life examples and reporting
             models, it examines how the gap between reporting and user experience has evolved, as
             well as highlighting potential errors that can be avoided. The paper finishes by looking at
               what steps can be taken in order to ensure that service level reports move closer to
                    matching the user experience and the next stages in reporting development.


Introduction                                                 Service Levels

The setting of service levels appears easy but getting       The OGS handbook states,
those levels to reflect the flexible requirements of the
client is something that seems almost alien to Service       “The improvements in service quality and the
Management.                                                  reduction in service disruption that can be achieved
                                                             through effective Service Level Management (SLM)
Service levels are in common use across                      can ultimately lead to significant financial savings.
organizations as IT areas endeavor to show how               Less time and effort is spent by IT staff in resolving
effective they are at delivering their clients services.     fewer failures and IT Customers are able to perform
                                                                                                                 1
They’re not new and they’re not complex but in many          their business functions without adverse impact.”
cases neither are they an accurate reflection of the
user experience                                              With such a clear definition showing what effective
                                                             service level management can deliver it would be
Even the basics can be a challenge; figures are              realistic to expect that the delivery of service level
normally based around simple easy to report                  reporting to be clear cut, simply defined and a high
measures. Often we use the wrong figures for the             priority. In practice the choice of service levels their
wrong measure, for example Incident volumes to               monitoring and reporting are often far harder to
show availability.                                           establish and deliver in a meaningful manner.

Remember just because something is available                 The problem often lies in how and why the service
doesn’t mean it’s usable. What’s needed is the ability       levels are agreed and measured as with most
to make the levels flexible enough to meet highs and         theoretical processes the practical implementation is
lows in demand. Levels need to be set in such a way          not always as simple as it sounds.
that they allow the client to do their job as effectively
as possible.                                                 If the implementation is not correctly thought out or
                                                             the delivery of the service level reporting isn’t clearly
This paper will look at how easily the gap between the       defined a gap will quickly develop between the
theoretical figures and real life performance can start      Service Level Agreement (SLA) and the user
to widen until the two bare no relationship to each          experience,
other.
                                                             The gap between users and the IT area could equal
In extreme cases the service level reporting will show       be said to be the gap between the business owners
service performance working within agreed                    perspective and the users within that business area.
parameters, when in reality the application is
unusable.                                                    This is true as often the business areas as a whole
                                                             can be unaware of the true performance of the IT they
                                                             are being supplied with. This is because they are
                                                             relying on the data supplied by the IT areas to make
their judgment.                                           Then if that PC is unavailable they are unable to
                                                          operate which results in missed targets and ultimately
Whilst these areas are responsible for the business
users of the applications they are not and nor should     they receive a lower pay package and bonus.
they be expected to be aware of the actual day to day
mechanics of how the applications work. They are          They may be fully capable of performing at a higher
after all buying a service from the IT area and they      level but IT infrastructure has held them back. If the
expect it to work correctly.                              infrastructure and delivery are wrong this will have a
                                                          significant impact on the businesses ability to retain
It is because reporting is often reviewed at such a       top quality staff and eventually to recruit the best as
high level that the mismatches fail to be noticed so      word will travel around the user community.
both the IT areas and business groups both believe
they have put something in place that will allow them     So poor IT will impact the bottom line this means that
to monitor and control the performance of the service.    cost savings in the IT area may have an impact on the
                                                          companies’ bottom line and it may not be the one the
                                                          company expected……
They review the figures on a regular basis and these
will show services are working satisfactory (within       The danger in using incident volumes in this scenario
documented guidelines). Even if in practice the users     is that often companies will address the severity 1
are struggling to deliver an effective service to the     symptoms removing the high severity incidents from
organizations customers.                                  the reports but they don’t fix the underlying root
                                                          cause. This type of target driven management will
So here’s a quick look at how this happens, by looking    work to mask issues that are then waiting in the wings
at a sample conversation that takes place and the         to return and cause more damage later.
actions that are agreed as a result to improve
performance                                               Looking at service management information requires
                                                          the recipient to understand and question to ensure
Business Directors I want “I want improved stability”     that what they are seeing is what they think they are
                                                          seeing sometimes just checking the data behind the
IT Director’s solution “I’ll reduce the number of high    figures can reveal what’s really gong on.
                    2
severity incidents“
                                                          Does What We Document Show What We Mean
The resulting users impact “No real change”
                                                          Does the information we see make sense or is more
How can this happen,                                      information required to make the judgment call.

In practice the conversation will go on for much longer                           3+2=11
and contain a lot of fine complex words but this is
essentially what they say. Perhaps if that’s all they     At first glance the sum appears to be wrong however
said they may realize why things don’t turn out as        if one extra piece of information is supplied and the
expected                                                  recipient has the right background knowledge the sum
                                                          makes perfect sense.
As a result of this meeting much work will be done to
reduce the volume of high severity incidents but          The missing piece of information is that the sum is
nothing will have changed for the user. This is           working in base 4 and suddenly it all makes sense.
because although at first reading it looks as though
the two areas were talking about the same thing they      The same principle applies to service level reports,
are in fact not getting close to talking about the same   whilst they may appear to be showing a particular
thing,                                                    version of the truth unless the recipient has the
                                                          required background knowledge and is provided with
One is talking about business impact the other about      enough relevant information they may interpret the
a particular type of incident and its volume.             reports in a completely different way from that which
                                                          was intended. This doesn’t mean to say that the
The causes of this are many and varied but the impact     report will be giving an untrue picture but it may be a
can be clear and damaging for the business. Imagine       narrow view of the truth
a user facing a business customer who requires a PC
to do their job against which they are measured for       When ever you send or perhaps more importantly
sales and contacts.                                       receive reports challenge do they mean what they
                                                          appear to mean. The copies of the reports on the
next pages illustrate this as a first glance they all                                                                                         has delivered.
appear to be showing stable and available services.
Closer examination will show that perhaps all is not as                                                                                       So what is this report actually telling us? What the
it seems                                                                                                                                      recipient of the report needs to know is what success
                                                                                                                                              are we showing here it looks like we have run without
Do We Really Have a Stable Service                                                                                                            a major incident for 21 days out of 30 days?

There are times when reports are produced for a                                                                                               In this case the red dots only indicate when severity 1
specific purpose in this case the purpose was to show                                                                                         incident starts they could run for 2, 3 or 4 days but
the improvement in the stability of the services                                                                                              only show up on its first day. So a green day could
delivered by the IT area of an organization.                                                                                                  actually have 1 or more severity 1 incidents running,
                                                                                                                                              which may not be what the recipient wants to see?

           Mon                 Tue                    Wed                 Thu                 Fri                 Sat                   Sun
                                                                                                                                              As with all incidents its not just how long they run but
                                                      1                   2                                  4
                                                                                                                                              the damage they cause that the clients is interested
                                                                                          3                                    5
                                                                                                                                              in. The danger with the IT community is that we focus
                                                                                                                                              on volumes rather than impact. Plotting incident
                                                                                                                                              volumes does not tell you the impact on the client it
      6               7                      8                   9                       10                  11           12                  just tells you the number of incidents that have been
                                                                                                                                              logged with the help desk. That may be a shock to
      13                  14                     15                  16                  17                  18                    19
                                                                                                                                              some organizations that spend a lot of time trending
                                                                                                                                              incident volumes.

 20              21                     22                  23                      24                  25               26                   Reporting and trending incident volumes isn’t wrong
                                                                                                                                              and it is often a good indicator of where potential
                                                                                                                                              issues may be impacting the client but in order to be
 27              28                     29                  30                      31
                                                                                                                                              effective even at this basic level some further
                                                                                                                                              categorization will be required.

The only background information that was supplied                                                                                             Unless the data can be split into useful categories
was a description of the symbols                                                                                                              such specific services or types of infrastructure the
• Green tick means no new severity 1 incident
                                               3                                                                                              figures will not show anything other than a total
• Red dot new severity 1 incident.                                                                                                            number of calls logged.

The chart is supplied with table of incident volumes                                                                                          Its not clear from this report which month actually did
                                                                                                                                              the most damage to the client as there is no indication
                                     Sept Oct               Nov Dec                                 Jan Feb                                   of the duration or more importantly impact to the
                                                                                                                                              client.
Severity 1
           6                                      5         2                   1                   7               4
Green Days
                                                                                                                                              This report gives the impression of a stable
Severity 1 66                                     108       131                 104                 82              72                        environment because it focuses on the high profile
Severity 2                           2086 2075 2120 1966                                            2004 1924                                 incidents but how many lower severity incidents are
                                     Mar Apr May Jun                                                Jul             Aug                       eating away at the available up time. This looks good
                                                                                                                                              but shows you almost no actual usable facts.
Severity 1
           3                                      7         15                  17                  20              21
Green Days
                                                                                                                                              From a client viewpoint they may be quite happy to
Severity 1 88                                     58        31                  17                  15              11
                                                                                                                                              have high severity on incidents on some services but
Severity 2                           1891 1491 1489 1447                                            1765 1720                                 for specific services that are critical to them they may
                                                                                                                                              not tolerate any outages during their on line day
At first it appears to show an improvement in the
stability of the service incident volumes have                                                                                                When ever reports are produced the area responsible
decreased and the number of green days (days with                                                                                             must review them and understand what the report is
no new severity 1 incident starting) have increased                                                                                           actually telling the audience. They may know what
from 6 to 21 almost a four fold improvement.                                                                                                  they mean to say but will it be clear to the recipients?

Whilst the area producing the report may know the                                                                                             Any data should reflect the true position it’s very easy
message it wants to deliver will the recipient                                                                                                to produce nice pictures and graphs but if that’s all
understand what they are being told or in some cases                                                                                          they are then they are useless and are a waste of time
does the area producing the report understand what it                                                                                         and resources.
There are a number of lessons to be learnt from this           In this particular case the dashboard is manually
type of reporting.                                             updated so whilst it appears to be in real time it’s
                                                               running behind and the delay will be variable. So
1. Incident volume alone even if it lists all incidents is     immediately it’s no use as a service dashboard and no
   not a measure of service availability or stability it’s     one should have any confidence that it’s showing a
   a measure of incident volumes.                              true picture of the client experience.
2. Reporting focusing on a limited selection of
   severities actually gives no meaningful data on             To use the car speedometer analogy this is showing a
   the clients services                                        speed that you were traveling at, at some point in the
3. In this instance there is no indication that we have        past. Although we can’t clarify how long ago you were
   more than 2 severities and if we do it fails to show        traveling at that speed
   how many low severity high volume incidents are
   really going on behind the scenes.                          Once again this dashboard is driven by the fixation
4. Without some form of user impact such as lost               with severity one incidents so it’s not even taking into
   business hours the impact to the end user is not            account the full user experience.
   clear
                                                               Armed with these basic background facts it becomes
What Does Our Dashboard Really Say?                            clear that the people going around with a smile on
                                                               their face because the dashboard is green are
Dashboards are used across a wide cross section of             working under an illusion, and at any minute a reality
organizations. Some of these are real time and some            that they have no idea about could come crashing in.
are historic. Even though they are in common use the
message they are trying to portray needs to be clearly         Is the Availability Figure All It Seems
defined.
                                                               It’s clear that when working with incident volumes that
When a person buys a car would they settle for a               a significant amount of ambiguity can be introduced.
dashboard that told them the car was moving but gave           When organizations start looking at availability figures
no indication of the speed, revs, temperature oil              the scope for manipulation increases. This brings with
pressure? Would they accept a single warning light             it increased danger of misinterpretation and
saying fault? It’s unlikely, yet that as exactly what          misunderstanding.
companies often accept within the service
management world                                               The first point to remember with availability figures is
                                                               that available is just what it says, what organizations
                                                               need to remember is just because something is
    TOP 10 SERVICE
                                                               available doesn’t mean it’s usable or fit for purpose.
                              Service 6
 Service 1                    Last incident date and summary   So if you have an availability figure you need to have
 Last incident date and
                                                               some supporting measures that show how the
 Service 2                    Service 7                        application is performing.
 Last incident date and       Last incident date and

 Service 3                    Service 8
                                                               Because it shows as available it may not be usable for
 Last incident date and       Last incident date and           any number of reason, it might be running too slow
 Service 4                    Service 9
                                                               due to server capacity or the network links may not be
 Last incident date and       Last incident date and           working so no can actually use the application.
 Service 5                    Service 10
 Last incident date and       Last incident date and           When an organization puts service management in
                                                               place they always want some SLAs implemented
This is a dashboard that I’m sure everyone in the              quickly, normally these are around an availability
particular organization that uses it is very proud of.         figure partly because its easy to measure. The
This is a copy of a service dashboard that is in place         requested required level that is often focused on is
and projected across multiple sites on plasma screens          99%. How this will be measured and why this figure is
and when everything is green everyone looks very               arrived at are often a mystery and remain so to those
happy,                                                         involved but over time they become ingrained in the
                                                               organizations culture.
But what is it really saying, to understand that we
need to look behind the scenes at what goes together           The 99% availability figure is often requested because
to make up the color coding. Once we can see that              the client really wants to say they want it available all
we can then understand if everyone who sees this               the time they need (100%) it but are too polite or too
should be smiling or crying.                                   scared because of the potential cost to ask so they
use 99% as an acceptable alternative.                         level can show the service available but the user
                                                              experience will be frustrating especially if they are in
The technical areas are too polite to challenge the           sales related roles that pay commission because their
requirement why do you need this level of availability        potential earning time has been reduced.
what will it allow you to deliver.
                                                              Looking at the availability report below there are some
In theory the setting of availability levels is the first     interesting deviations which should lead the recipient
step in a process that expects to increase and                question if the reporting has been manipulated. This
improve the number of measures that will sit behind           is a real example of an availability report and for the
the availability measure going forward.                       two years it was used no one challenge the data
Normally it’s the number of measures that’s focused
on not the quality of those measures or the impact to         Service Potential      Availably Fully    Infrastructure
the users.                                                            Availability   Target Available   Partially
                                                                                                        Available
                                                              A       27478380       99%     99.01 %    99.8 %
The other thing that happens is that whilst the                       Mins
intention is there to add to the measures in practice         B       27478380       99%     95.18 %    100.00%
nothing is actually put in place and organizations are                Mins
left with an availability figure that is used as the only     C       43200 Mins             100.00%    100.00%
measure of the service. Then because it’s all they            D       358Hrs                 97.24%     99.32%
have they try to use it to measure everything.                E       720 Hrs                100.00%    100.00%
                                                              F       176 Hrs                100.00%    100.00%
Even when we look at availability figures whilst 99% is
the target we need to understand the business                 G       455:30 Hrs     99.2%   100.00%    100.00%
requirements to establish if this is required and how it      H       720 Hrs                100.00%    100.00%
will be measured. It should be simple but when you            I       176 Hrs        98 %    99.21%     100.00%
measure availability there are a number of questions          J       1795200 Mins           100.00%    100.00%
that need to be asked,do we take 24 hours a day or
do we look at the hours that users are in the office or
on the production line. It seems obvious but if it’s not      There is a reason that some of services are reported
clearly defined then the figures can badly out of line        in minutes and some in hours and that is to ensure
with reality.                                                 that the report as much as possible showed green.
                                                              The same is true of the partially available figure.
A 30 minute outage over a 24hour period will give
99.3 % availability                                           There were a number of options available to show this
                                                              data.
The same 30 minutes over a 9 to 5 working day gives
93.75% availability                                           We could have done length of incident against 24
                                                              hour period or against the working day. Just doing
If there is a business critical session say10am till 1pm      incident length against the working day meant the
then we only have 62.5% availability for this particular      figures stayed red. Using a 24 hour period would be
part of the day                                               too obvious.

So it’s possible to have a report showing high                So the calculation was based on minutes in the on line
availability when we actually lost nearly 40% of the          day 480 multiplied by days in the month that the
available service time during our critical business           devices were due to be operational say 23 multiplied
hours.                                                        by the number of devices in the network approx 2489.
                                                               This meant spending time manually reviewing all
It’s all depends on how you take the data, which area         incidents look for phrases such as “50% of devices
you work in, what you want to show and how good               unavailable” in order to find out how many devises
your relationship and understanding of the users              may still have been working during any incident.
requirements really are, coupled with the honesty to
tell it as it is.                                             Using this method it was often possible to get the
                                                              report showing green.
That’s just a basic availability figure using a full outage
and 3 options, how complex could this get when                It could be argued that this was an accurate reflection
dealing with an outage impacting a limited number of          of the overall position even if the drivers were
users.                                                        completely wrong. In fact this report started to hide
                                                              significant hardware issues within the end to end
Just from these figures we can see that the service           design that were having a major impact of the clients
ability to work effectively
                                                           In large organizations this may require the
From these examples it can be seen that it is easy for     development of two separate reporting lines one from
the gap between the IT areas view of the world and         the IT area the other from the business area. These
the user experience to grow into a chasm very quickly      then need to be reviewed to identify variations, once
and if the communication lines are not set up correctly    any differences have been identified the causes need
it can take a long time before it is identified and then   to be understood and used to develop an accurate
even longer before actions are taken to improve            single view.
things from a reporting perspective and more
importantly from a user experience perspective             This approach although initially painful will eventually
                                                           lead to a closer working relationship and drive a more
Knowing who the “User” is makes a Difference               open and honest debate from both sides allowing an
                                                           accurate picture of service availability to be supplied
It may seem obvious but it’s worth stating what is         that all areas recognize.
meant by a user. “User” can mean different things to
different people and areas. As this can help               Lesson about Targets Learnt the Hard Way
understand where the gap between theory and reality
starts.                                                    The availability figures show green

Regardless of the fact the IT may be in house or           The response times are green
outsourced. A user is the person sat in front of the PC.
It is not the business area that they work in.             But the service is a disaster

It is the failure to engage all the way down the supply    How can all the statements be true the MI shows
chain that often leads to targets being defined at the     green so everything must be working?
wrong level. When the engagement stops at the high
or mid tier management there is a greater chance that      This highlights the problems when MI production is
targets will be set that can be met but will not provide   based around incorrect assumptions, of those
the users with the level of service they need.             perhaps the biggest and simplest mistake is to fail to
                                                           remain engaged with the client to be aware of
Whilst working at that level isn’t in its self wrong it    changes in the clients working patterns.
does start to bring in areas of uncertainty. The
managers working at that those higher levels own the       Even if the pitfalls highlighted early around lack of
area but they are often unaware in any great depth of      user input are avoided, once everything is in place,
the tasks completed by the area or the challenges          processes must be implemented to remain closely
facing the users on the shop floor. Neither should we      linked to the client in order to understand when their
expect them to have a micro management view.               requirements change. The relationship will also be
                                                           required as the reporting measures are refined to add
Operating at that higher level it is possible to set up    improved value to the client area.
any number of metrics. However if these are not set
in conjunction with shop floor users and checked on a      Once availability figures are in place and acceptable
regular basis they can be meaningless from the day         the next step is to move onto the performance
they are first produced.                                   measures and that’s when the fun really starts as we
                                                           try to align the figures with the user experience. As
Sadly in many cases the fact that they are unfit for       with all measures once in place the trick is to ensure
purpose is not understood and the organization uses        they remain meaningful
them as measures and often holds them up as proof
that the IT systems are delivering high quality service.   When responses times are initially fixed they are often
                                                           picked to give certain and in most cases significant
What’s really sad is that both the IT areas and the        margin for error.
business unit managers both think things are
performing satisfactorily, whilst users struggles to do    Often a figure is chosen that may not be operationally
their job                                                  acceptable but is signed up to on the basis that at the
                                                           time of signing the application was performing at a
To break this mould and identify this kind of MI           considerably better level than the measure.
requires both the IT area and the client to challenge
the information they are using and develop methods         By setting figures in such away it leaves the gate open
to ensure the information being provided matches the       for significant deteriorations in performance to take
users’ real day to day experience.                         place whilst the figures still show the application
performing within agreed guidelines.                       It is these high profile incidents that drive a reaction
                                                           from the senior management within a business area
In simple terms we sign up to a response time of 15        but they may not solve the issue for the users on the
seconds per transaction at the time the application        ground. If they don’t fix the issues for the users on the
works at 1 to 2 seconds per transaction.                   ground then they will not have fixed the issues for the
                                                           senior management within the business but they will
An upper level for users to work effectively is 5          have masked the symptoms.
seconds per transaction so the transaction response
time can get to twice the unusable level and still meet    Targets Driving the Wrong Behaviors
the agreed management figures.
                                                           Once you have everything in place the setting and
Whilst this may appear outrageous it has happened in       agreeing of targets is equally important to ensure that
the past and it will happen in the future.                 you meet user requirements, the danger lies in targets
The fact that this happens is normally because             such as the one below, versions of which are, and will
engagement has not been made at the correct level to       be in the future used in organizations across the globe
understand the response times that are required to         with the intention of improving performance.
allow the application to be usable.
                                                           CURRENT STATE
The service may well run satisfactorily however if
eventually pressure starts to build either more users      We have 1800 outstanding problem records as at
are added or the functionality of the applications are     01/06/07
changed. Then from a user perspective the
application starts to slow down but the measure still      TARGET
show things working within agreed levels.
                                                           To reduce this figure to 600 by the end of the year
The other failure is the failure to continue to review
and revise the levels that are required by users. An       This widely used target is a classic example of how
application that is initially used in a non business       the mismatch between IT targets driving the IT areas
critical area is then used for a business critical task.   to deliver at the same time thinking they have made
In this instance a responses time of 30 seconds may        an improvement to the user experience.
have been acceptable when it was a non critical
function.                                                  At first glance this is an admirable objective and will
                                                           improve the level of service being supplied. However
Once the application moves to be a business critical       when you look deeper at the target and start to
function the response time may no longer be                understand what behaviors this will generate it
acceptable although once again the metrics being           becomes clear that whilst the IT support teams will be
reported will still show the application functioning as    working hard reduce the volume of problem records
agreed.                                                    that are open the improvement to the users may be
                                                           negligible.
Another lesson around targets and there use focuses
on the fixation that some organizations have with high     Why will all the work possibly fail to deliver? Well
severity incidents, as we saw with the conversation        because the IT areas will be working toward their
between the business and the IT area its very easy to      targets and fixing those simple quick fixes that will
come up with a target and a resolution that look to        drive down the number of outstanding problem
match when in fact they miss each other completely.        records they are managing.

For a start we are measuring incident volumes not          It is at this point that the missing link becomes clear
user impact so a reduction in incidents may or may         the target takes no account of the problem records
not result in less outage. If we have fewer incidents      that the business need and want fixing. The business
but they run longer then we may not have had an            could well be perfectly happy to have the target
impact on the end users at all. In order to match the      reduced to 1700 outstanding records providing the
client requirements we have to be measuring the right      100 problem records that are fixed are the ones that
thing in the first place outage time and user impact.      are costing their business the most.

The IT areas producing the service metrics must            Fix those key records and you improve service, miss
understand the issues that cause the most damage to        them and service remains as it always was no matter
the business, this may not always be the high profile      how much effort is made and how many problem
high severity incidents.                                   records you close off.
It’s What You Can’t See That Does the Damage                 reporting all the faults because they are confident hat
                                                             something will be done when they call in the next step
Even when IT areas focus on the known problems               is to ensure that correct issues are addressed
there are often significant issues that lurk below the
surface that the users haven’t reported. Without the         In order to do that it is necessary to take a step back
knowledge of these hidden issues the IT areas can            to see the whole picture. This means that in order to
easily spend a significant amount of resource and            understand what’s going on we need to obtain as
effort into fixing the wrong issues.                         much information as possible and not jump to
                                                             conclusions as soon as we have the first piece on
To understand this means building an understanding           data to hand.
into why users stop reporting issues that actually
impact their ability to work as effectively as possible.     Even when you think you know ask some questions
                                                             just to make sure you really understand what is going
There are two main reasons firstly they learn the work       on making sure you are able to address the root
arounds. Secondly if their initial calls appear to get no    cause and not just remove the symptoms in the short
response they start to accept a lower level of service.      term.

The other side of that is that when the calls stop           The danger with performance reporting is that once
coming in the IT area thinks the issue has gone away         we have the first piece of monitoring in place and the
when in fact it’s merely moved out of sight. In              reports set up we try to make the one measure work
extreme cases the IT areas may try and claim credit          for everything. We fill n the blanks with what limited
for the reduction in calls.                                  knowledge we have and guess at other areas.

If the IT areas are to fix the right issues then a way       In order to fully understand what’s happening it
needs to be found to ensure that all the issues are          requires a structured questioning approach linked to
called in. This is the point where it becomes apparent       sound technical knowledge. The answers to the
that you can’t work on one part of the process in            question can come from users or monitoring and
isolation.                                                   alerting tools anything that allows a detailed picture to
                                                             be built up before actions are taken.
So how can we start to get things to match the users’
real life experience? The first step is to make sure that    Without this full picture how can we start to improve
we obtain all the relevant data the first time the call is   the service we deliver.
logged to allow effective investigation. This coupled
with a standard repeatable incident logging process          Starting to make things work
that allows multiple incidents to be collated and linked.
                                                             Can we get this service level management process to
But the help desk is driven by time targets, we’re back      work effectively?
to targets and behaviors again the IT help desk needs
to change their targets to extend the incident handling      Yes we can, but each area must work together in
time to allow all relevant information to be collected as    partnership technical areas need understand the
the incident is occurring.                                   business areas drivers and limitations working closely
                                                             with them to understand the impact of any changes
This way all the required information is obtained all        they make.
calls are logged and the right problems get fixed as
the business adds it’s impact to each problem.               In the same way the business must keep in contact
                                                             with the technical areas and ensure they are informed
This does take longer and may require more resource          of any changes to their working practices that may
initially however this is short term pain for long term      impact the technology.
gain.
                                                             This means letting the technical areas know all the
As the right problems are identified and fixes delivered     changes they make so any potential impact can be
to the problems that the business needs, this allows         tested before implementation.
the users to operate more effectively. The volume of
incidents called in will decline and reduce the impact       It will also require some fundamental changes in the
on the help desk.                                            way the organization thinks. It must move from I have
                                                             a piece of software now what can I do with it, to this is
Making Sure We See the Whole Picture                         what the client wants how can I get that information
                                                             for them.
Once the users have been encouraged to start
Working in this way will take longer to set up and may         One of the best solutions is to keep changing the
require the use of a number of different pieces of             measures to make sure that people don’t get
software, but once in place it will start to deliver           complacent and that people don’t start “playing the
effective results                                              system”.

Outside of pure data capture and reporting make sure           Targets need to be changed to ensure that we are
that the systems are built with resilience as standard         always looking to find where we need to concentrate
and then monitor them to make sure that the                    our resources to improve service.
resilience is maintained through out the life time of the
application.                                                   If reports are green time after time then by all means
                                                               continue to collect the data but don’t report it time
So often systems are set up with resilience perhaps            after time look for the areas of client impact and report
two servers, over time the capacity requirements               those, this will get a response a green report often just
mean that they are constantly running one server at            gets filed because at first glance it requires no actions.
full capacity and the other half capacity so if we loose
one server we have no resilience left and service is           Know the limitations
lost
                                                               Whenever you prepare reports and when you receive
Working together and making sure everyone                      them it is vital that you fully understand any limitations
understands what has been agreed much like service             around the data that you are working with or viewing
levels it’s not complex and it shouldn’t be difficult but it
must be understandable and flexible.                           Even when you think you know what the figures are
                                                               telling if the producers haven’t documented any
Relationships and more                                         limitations then challenge them it could be that a
                                                               significant criteria has been missed such as the
Building effective relationships is an important part of       figures only include specific types of incidents
getting this right but relationships alone are not going
to get the job done. IT areas also need to understand          Equally important is the need to know if the data you
what they want to measure, and the impact this will            have is the base data or if someone has already
have the users and the business.                               manipulated it before you start working with it. This
                                                               can have a significant impact on the end result of your
Its about identifying what causes the pain and what            reports if the data has already been manipulated in
can be lived with. This isn’t about fixing everything its      order to paint a particular picture then anything you do
not about spending $20,000 to fix a $100 problem.              afterwards will not necessarily show a true and honest
It’s about finding and fixing the right issues.                picture of the service.

The other angle is to establish what it is that we want        If you know that data has been manipulated before
to measure and report. Do we want to measure                   you receive it, go back to the source understand what
availability, usability or both?                               they have done and why. They may have already
                                                               done part of your job for you or alternatively you may
Once the measures have been decided its time to find           have to go back to the original data source and start
the right tool to do the job this may mean having              from scratch.
different specialized tool sets for each task but if you
want it done right you have to pay for quality.                When producing reports always challenge the data
                                                               you are using and the results you produce; think what
This again has a high up front cost but the pay back is        the recipient may ask and answer those questions
quick as the right issue are highlighted and fixed.            yourself. The more you understanding the better the
                                                               resulting report and the better the relationship
Once everything is in place and the reports are being          between you and the client will become.
produced and distributed there is a further item that
needs monitoring and that is do the reports cause a            When we understand the limitations around the data
reaction from the recipients if not then why are they          that we use the next step is to decide if as an
being produced?                                                organization we are prepared to invest in processes
                                                               and people to insure that right data is collected in the
Keeping it new and alive                                       right format at the earliest possible opportunity?

Having measures in place is one thing making sure              In conclusion
that people pay attention to them is the harder part.
                                                                   •   Listen to the users
• Understand the business
  • Identify what the client needs                          3
                                                              Severity 1 Incident definition, direct threat or damage
  • Identify what you need                                  to, reputation or credibility of the group. Multiple lines
  • Continually review the requirements                     of business or locations critically affected.
  • Build partnerships not conflicts
Where do we go from here?

What are the next steps to take in order to deliver
effective measurements moving our reporting even
closure in line with the client’s requirements?

One area that we’d like to pursue is the idea of flexible
service levels through out the day business critical
times and non critical times can have different
availability measure.

Convincing the client and the Service Management
Teams to amend the style of reporting so rather than
reporting on all that’s well we also start looking at
proactively highlighting areas that are starting to
require attention.

Sample Incident Severity Definitions

Severity 1
A direct threat of damage to the image, reputation or
credibility of the group. Multiple lines of business or
locations critically affected.

Severity 2
Significant degradation or outage affecting a line of
business key services or locations

Severity 3
Minor degradation to a key service, business process or
location or a more severe degradation or outage to a
non critical service, business process or location

Severity 4
Small issue with localized scope typically affecting a
single user. Can either be tolerated or worked around
for an extended period of time due to its limited impact

Caveats & Disclaimers

All trademarks and copyrights are acknowledged.

Views and opinions expressed in this paper are those
of the author and do not necessarily reflect those of
Barclays Bank PLC in general.

Bibliography

OGS Service Handbook
Version 2, 2007

1
 OGC Service Delivery Handbook
2
 An incident that impacts all users of a particular
service

More Related Content

Similar to 2007 Cmg Paper

Establish a Service Based Costing Model
Establish a Service Based Costing ModelEstablish a Service Based Costing Model
Establish a Service Based Costing Model
Info-Tech Research Group
 
CIHS Top Tip - 4 Tips for better SLA's V2.0
CIHS Top Tip - 4 Tips for better SLA's V2.0CIHS Top Tip - 4 Tips for better SLA's V2.0
CIHS Top Tip - 4 Tips for better SLA's V2.0
Tanya Marshall
 
IT_Crisis_Problem_Management_Whitepaper
IT_Crisis_Problem_Management_WhitepaperIT_Crisis_Problem_Management_Whitepaper
IT_Crisis_Problem_Management_Whitepaper
Chuck Boutcher
 
Virgin Technology: Contrasting Four Potential Business Models
Virgin Technology: Contrasting Four Potential Business ModelsVirgin Technology: Contrasting Four Potential Business Models
Virgin Technology: Contrasting Four Potential Business Models
Carol Sautter Williams
 
At&t the mobile enterprise wireless-vision-whitepaper
At&t the mobile enterprise   wireless-vision-whitepaperAt&t the mobile enterprise   wireless-vision-whitepaper
At&t the mobile enterprise wireless-vision-whitepaper
Enterprise Mobility Solutions
 

Similar to 2007 Cmg Paper (20)

service metrics at ITSMFUSA 2008
service metrics at ITSMFUSA 2008service metrics at ITSMFUSA 2008
service metrics at ITSMFUSA 2008
 
Is What You See What You Get
Is What You See What You GetIs What You See What You Get
Is What You See What You Get
 
Establish a Service Based Costing Model
Establish a Service Based Costing ModelEstablish a Service Based Costing Model
Establish a Service Based Costing Model
 
CIHS Top Tip - 4 Tips for better SLA's V2.0
CIHS Top Tip - 4 Tips for better SLA's V2.0CIHS Top Tip - 4 Tips for better SLA's V2.0
CIHS Top Tip - 4 Tips for better SLA's V2.0
 
4 Tips for Better SLAs.
4 Tips for Better SLAs.4 Tips for Better SLAs.
4 Tips for Better SLAs.
 
A CMDB. A What?
A CMDB. A What?A CMDB. A What?
A CMDB. A What?
 
A Practical Guide to Implementing SLAs
A Practical Guide to Implementing SLAsA Practical Guide to Implementing SLAs
A Practical Guide to Implementing SLAs
 
IT_Crisis_Problem_Management_Whitepaper
IT_Crisis_Problem_Management_WhitepaperIT_Crisis_Problem_Management_Whitepaper
IT_Crisis_Problem_Management_Whitepaper
 
How Good Are You At Managing ITSM?
How Good Are You At Managing ITSM?How Good Are You At Managing ITSM?
How Good Are You At Managing ITSM?
 
Virgin Technology: Contrasting Four Potential Business Models
Virgin Technology: Contrasting Four Potential Business ModelsVirgin Technology: Contrasting Four Potential Business Models
Virgin Technology: Contrasting Four Potential Business Models
 
Soa To The Rescue
Soa To The RescueSoa To The Rescue
Soa To The Rescue
 
Dit yvol3iss37
Dit yvol3iss37Dit yvol3iss37
Dit yvol3iss37
 
Dit yvol5iss13
Dit yvol5iss13Dit yvol5iss13
Dit yvol5iss13
 
SOA vs EDA
SOA vs EDASOA vs EDA
SOA vs EDA
 
Implementing business intelligence
Implementing business intelligenceImplementing business intelligence
Implementing business intelligence
 
At&t the mobile enterprise wireless-vision-whitepaper
At&t the mobile enterprise   wireless-vision-whitepaperAt&t the mobile enterprise   wireless-vision-whitepaper
At&t the mobile enterprise wireless-vision-whitepaper
 
aap3 IT Service Management White Paper
aap3 IT Service Management White Paperaap3 IT Service Management White Paper
aap3 IT Service Management White Paper
 
HSI's Cloud-Hosted Foglight IT Monitoring & APM
HSI's Cloud-Hosted Foglight IT Monitoring & APMHSI's Cloud-Hosted Foglight IT Monitoring & APM
HSI's Cloud-Hosted Foglight IT Monitoring & APM
 
David D'Agostino and Tony Price: Kicking the KPI Habit
David D'Agostino and Tony Price: Kicking the KPI HabitDavid D'Agostino and Tony Price: Kicking the KPI Habit
David D'Agostino and Tony Price: Kicking the KPI Habit
 
The business-case-for-advanced-data-visualization
The business-case-for-advanced-data-visualizationThe business-case-for-advanced-data-visualization
The business-case-for-advanced-data-visualization
 

2007 Cmg Paper

  • 1. . UNDERNEATH THE SPIN A PRACTICAL LOOK AT SERVICE LEVELS (IS WHAT YOU SEE WHAT YOU GET?) Malcolm Gunn Service Availability Management Consultant Barclays Bank Plc TEL +44(0)7966224346 E-Mail:malcolm.gunn@barclays.com Abstract This paper looks at the gap between the service level figures provide by the technical teams and the users actual experience. By using real life examples and reporting models, it examines how the gap between reporting and user experience has evolved, as well as highlighting potential errors that can be avoided. The paper finishes by looking at what steps can be taken in order to ensure that service level reports move closer to matching the user experience and the next stages in reporting development. Introduction Service Levels The setting of service levels appears easy but getting The OGS handbook states, those levels to reflect the flexible requirements of the client is something that seems almost alien to Service “The improvements in service quality and the Management. reduction in service disruption that can be achieved through effective Service Level Management (SLM) Service levels are in common use across can ultimately lead to significant financial savings. organizations as IT areas endeavor to show how Less time and effort is spent by IT staff in resolving effective they are at delivering their clients services. fewer failures and IT Customers are able to perform 1 They’re not new and they’re not complex but in many their business functions without adverse impact.” cases neither are they an accurate reflection of the user experience With such a clear definition showing what effective service level management can deliver it would be Even the basics can be a challenge; figures are realistic to expect that the delivery of service level normally based around simple easy to report reporting to be clear cut, simply defined and a high measures. Often we use the wrong figures for the priority. In practice the choice of service levels their wrong measure, for example Incident volumes to monitoring and reporting are often far harder to show availability. establish and deliver in a meaningful manner. Remember just because something is available The problem often lies in how and why the service doesn’t mean it’s usable. What’s needed is the ability levels are agreed and measured as with most to make the levels flexible enough to meet highs and theoretical processes the practical implementation is lows in demand. Levels need to be set in such a way not always as simple as it sounds. that they allow the client to do their job as effectively as possible. If the implementation is not correctly thought out or the delivery of the service level reporting isn’t clearly This paper will look at how easily the gap between the defined a gap will quickly develop between the theoretical figures and real life performance can start Service Level Agreement (SLA) and the user to widen until the two bare no relationship to each experience, other. The gap between users and the IT area could equal In extreme cases the service level reporting will show be said to be the gap between the business owners service performance working within agreed perspective and the users within that business area. parameters, when in reality the application is unusable. This is true as often the business areas as a whole can be unaware of the true performance of the IT they are being supplied with. This is because they are relying on the data supplied by the IT areas to make
  • 2. their judgment. Then if that PC is unavailable they are unable to operate which results in missed targets and ultimately Whilst these areas are responsible for the business users of the applications they are not and nor should they receive a lower pay package and bonus. they be expected to be aware of the actual day to day mechanics of how the applications work. They are They may be fully capable of performing at a higher after all buying a service from the IT area and they level but IT infrastructure has held them back. If the expect it to work correctly. infrastructure and delivery are wrong this will have a significant impact on the businesses ability to retain It is because reporting is often reviewed at such a top quality staff and eventually to recruit the best as high level that the mismatches fail to be noticed so word will travel around the user community. both the IT areas and business groups both believe they have put something in place that will allow them So poor IT will impact the bottom line this means that to monitor and control the performance of the service. cost savings in the IT area may have an impact on the companies’ bottom line and it may not be the one the company expected…… They review the figures on a regular basis and these will show services are working satisfactory (within The danger in using incident volumes in this scenario documented guidelines). Even if in practice the users is that often companies will address the severity 1 are struggling to deliver an effective service to the symptoms removing the high severity incidents from organizations customers. the reports but they don’t fix the underlying root cause. This type of target driven management will So here’s a quick look at how this happens, by looking work to mask issues that are then waiting in the wings at a sample conversation that takes place and the to return and cause more damage later. actions that are agreed as a result to improve performance Looking at service management information requires the recipient to understand and question to ensure Business Directors I want “I want improved stability” that what they are seeing is what they think they are seeing sometimes just checking the data behind the IT Director’s solution “I’ll reduce the number of high figures can reveal what’s really gong on. 2 severity incidents“ Does What We Document Show What We Mean The resulting users impact “No real change” Does the information we see make sense or is more How can this happen, information required to make the judgment call. In practice the conversation will go on for much longer 3+2=11 and contain a lot of fine complex words but this is essentially what they say. Perhaps if that’s all they At first glance the sum appears to be wrong however said they may realize why things don’t turn out as if one extra piece of information is supplied and the expected recipient has the right background knowledge the sum makes perfect sense. As a result of this meeting much work will be done to reduce the volume of high severity incidents but The missing piece of information is that the sum is nothing will have changed for the user. This is working in base 4 and suddenly it all makes sense. because although at first reading it looks as though the two areas were talking about the same thing they The same principle applies to service level reports, are in fact not getting close to talking about the same whilst they may appear to be showing a particular thing, version of the truth unless the recipient has the required background knowledge and is provided with One is talking about business impact the other about enough relevant information they may interpret the a particular type of incident and its volume. reports in a completely different way from that which was intended. This doesn’t mean to say that the The causes of this are many and varied but the impact report will be giving an untrue picture but it may be a can be clear and damaging for the business. Imagine narrow view of the truth a user facing a business customer who requires a PC to do their job against which they are measured for When ever you send or perhaps more importantly sales and contacts. receive reports challenge do they mean what they appear to mean. The copies of the reports on the
  • 3. next pages illustrate this as a first glance they all has delivered. appear to be showing stable and available services. Closer examination will show that perhaps all is not as So what is this report actually telling us? What the it seems recipient of the report needs to know is what success are we showing here it looks like we have run without Do We Really Have a Stable Service a major incident for 21 days out of 30 days? There are times when reports are produced for a In this case the red dots only indicate when severity 1 specific purpose in this case the purpose was to show incident starts they could run for 2, 3 or 4 days but the improvement in the stability of the services only show up on its first day. So a green day could delivered by the IT area of an organization. actually have 1 or more severity 1 incidents running, which may not be what the recipient wants to see? Mon Tue Wed Thu Fri Sat Sun As with all incidents its not just how long they run but 1 2 4 the damage they cause that the clients is interested 3 5 in. The danger with the IT community is that we focus on volumes rather than impact. Plotting incident volumes does not tell you the impact on the client it 6 7 8 9 10 11 12 just tells you the number of incidents that have been logged with the help desk. That may be a shock to 13 14 15 16 17 18 19 some organizations that spend a lot of time trending incident volumes. 20 21 22 23 24 25 26 Reporting and trending incident volumes isn’t wrong and it is often a good indicator of where potential issues may be impacting the client but in order to be 27 28 29 30 31 effective even at this basic level some further categorization will be required. The only background information that was supplied Unless the data can be split into useful categories was a description of the symbols such specific services or types of infrastructure the • Green tick means no new severity 1 incident 3 figures will not show anything other than a total • Red dot new severity 1 incident. number of calls logged. The chart is supplied with table of incident volumes Its not clear from this report which month actually did the most damage to the client as there is no indication Sept Oct Nov Dec Jan Feb of the duration or more importantly impact to the client. Severity 1 6 5 2 1 7 4 Green Days This report gives the impression of a stable Severity 1 66 108 131 104 82 72 environment because it focuses on the high profile Severity 2 2086 2075 2120 1966 2004 1924 incidents but how many lower severity incidents are Mar Apr May Jun Jul Aug eating away at the available up time. This looks good but shows you almost no actual usable facts. Severity 1 3 7 15 17 20 21 Green Days From a client viewpoint they may be quite happy to Severity 1 88 58 31 17 15 11 have high severity on incidents on some services but Severity 2 1891 1491 1489 1447 1765 1720 for specific services that are critical to them they may not tolerate any outages during their on line day At first it appears to show an improvement in the stability of the service incident volumes have When ever reports are produced the area responsible decreased and the number of green days (days with must review them and understand what the report is no new severity 1 incident starting) have increased actually telling the audience. They may know what from 6 to 21 almost a four fold improvement. they mean to say but will it be clear to the recipients? Whilst the area producing the report may know the Any data should reflect the true position it’s very easy message it wants to deliver will the recipient to produce nice pictures and graphs but if that’s all understand what they are being told or in some cases they are then they are useless and are a waste of time does the area producing the report understand what it and resources.
  • 4. There are a number of lessons to be learnt from this In this particular case the dashboard is manually type of reporting. updated so whilst it appears to be in real time it’s running behind and the delay will be variable. So 1. Incident volume alone even if it lists all incidents is immediately it’s no use as a service dashboard and no not a measure of service availability or stability it’s one should have any confidence that it’s showing a a measure of incident volumes. true picture of the client experience. 2. Reporting focusing on a limited selection of severities actually gives no meaningful data on To use the car speedometer analogy this is showing a the clients services speed that you were traveling at, at some point in the 3. In this instance there is no indication that we have past. Although we can’t clarify how long ago you were more than 2 severities and if we do it fails to show traveling at that speed how many low severity high volume incidents are really going on behind the scenes. Once again this dashboard is driven by the fixation 4. Without some form of user impact such as lost with severity one incidents so it’s not even taking into business hours the impact to the end user is not account the full user experience. clear Armed with these basic background facts it becomes What Does Our Dashboard Really Say? clear that the people going around with a smile on their face because the dashboard is green are Dashboards are used across a wide cross section of working under an illusion, and at any minute a reality organizations. Some of these are real time and some that they have no idea about could come crashing in. are historic. Even though they are in common use the message they are trying to portray needs to be clearly Is the Availability Figure All It Seems defined. It’s clear that when working with incident volumes that When a person buys a car would they settle for a a significant amount of ambiguity can be introduced. dashboard that told them the car was moving but gave When organizations start looking at availability figures no indication of the speed, revs, temperature oil the scope for manipulation increases. This brings with pressure? Would they accept a single warning light it increased danger of misinterpretation and saying fault? It’s unlikely, yet that as exactly what misunderstanding. companies often accept within the service management world The first point to remember with availability figures is that available is just what it says, what organizations need to remember is just because something is TOP 10 SERVICE available doesn’t mean it’s usable or fit for purpose. Service 6 Service 1 Last incident date and summary So if you have an availability figure you need to have Last incident date and some supporting measures that show how the Service 2 Service 7 application is performing. Last incident date and Last incident date and Service 3 Service 8 Because it shows as available it may not be usable for Last incident date and Last incident date and any number of reason, it might be running too slow Service 4 Service 9 due to server capacity or the network links may not be Last incident date and Last incident date and working so no can actually use the application. Service 5 Service 10 Last incident date and Last incident date and When an organization puts service management in place they always want some SLAs implemented This is a dashboard that I’m sure everyone in the quickly, normally these are around an availability particular organization that uses it is very proud of. figure partly because its easy to measure. The This is a copy of a service dashboard that is in place requested required level that is often focused on is and projected across multiple sites on plasma screens 99%. How this will be measured and why this figure is and when everything is green everyone looks very arrived at are often a mystery and remain so to those happy, involved but over time they become ingrained in the organizations culture. But what is it really saying, to understand that we need to look behind the scenes at what goes together The 99% availability figure is often requested because to make up the color coding. Once we can see that the client really wants to say they want it available all we can then understand if everyone who sees this the time they need (100%) it but are too polite or too should be smiling or crying. scared because of the potential cost to ask so they
  • 5. use 99% as an acceptable alternative. level can show the service available but the user experience will be frustrating especially if they are in The technical areas are too polite to challenge the sales related roles that pay commission because their requirement why do you need this level of availability potential earning time has been reduced. what will it allow you to deliver. Looking at the availability report below there are some In theory the setting of availability levels is the first interesting deviations which should lead the recipient step in a process that expects to increase and question if the reporting has been manipulated. This improve the number of measures that will sit behind is a real example of an availability report and for the the availability measure going forward. two years it was used no one challenge the data Normally it’s the number of measures that’s focused on not the quality of those measures or the impact to Service Potential Availably Fully Infrastructure the users. Availability Target Available Partially Available A 27478380 99% 99.01 % 99.8 % The other thing that happens is that whilst the Mins intention is there to add to the measures in practice B 27478380 99% 95.18 % 100.00% nothing is actually put in place and organizations are Mins left with an availability figure that is used as the only C 43200 Mins 100.00% 100.00% measure of the service. Then because it’s all they D 358Hrs 97.24% 99.32% have they try to use it to measure everything. E 720 Hrs 100.00% 100.00% F 176 Hrs 100.00% 100.00% Even when we look at availability figures whilst 99% is the target we need to understand the business G 455:30 Hrs 99.2% 100.00% 100.00% requirements to establish if this is required and how it H 720 Hrs 100.00% 100.00% will be measured. It should be simple but when you I 176 Hrs 98 % 99.21% 100.00% measure availability there are a number of questions J 1795200 Mins 100.00% 100.00% that need to be asked,do we take 24 hours a day or do we look at the hours that users are in the office or on the production line. It seems obvious but if it’s not There is a reason that some of services are reported clearly defined then the figures can badly out of line in minutes and some in hours and that is to ensure with reality. that the report as much as possible showed green. The same is true of the partially available figure. A 30 minute outage over a 24hour period will give 99.3 % availability There were a number of options available to show this data. The same 30 minutes over a 9 to 5 working day gives 93.75% availability We could have done length of incident against 24 hour period or against the working day. Just doing If there is a business critical session say10am till 1pm incident length against the working day meant the then we only have 62.5% availability for this particular figures stayed red. Using a 24 hour period would be part of the day too obvious. So it’s possible to have a report showing high So the calculation was based on minutes in the on line availability when we actually lost nearly 40% of the day 480 multiplied by days in the month that the available service time during our critical business devices were due to be operational say 23 multiplied hours. by the number of devices in the network approx 2489. This meant spending time manually reviewing all It’s all depends on how you take the data, which area incidents look for phrases such as “50% of devices you work in, what you want to show and how good unavailable” in order to find out how many devises your relationship and understanding of the users may still have been working during any incident. requirements really are, coupled with the honesty to tell it as it is. Using this method it was often possible to get the report showing green. That’s just a basic availability figure using a full outage and 3 options, how complex could this get when It could be argued that this was an accurate reflection dealing with an outage impacting a limited number of of the overall position even if the drivers were users. completely wrong. In fact this report started to hide significant hardware issues within the end to end Just from these figures we can see that the service design that were having a major impact of the clients
  • 6. ability to work effectively In large organizations this may require the From these examples it can be seen that it is easy for development of two separate reporting lines one from the gap between the IT areas view of the world and the IT area the other from the business area. These the user experience to grow into a chasm very quickly then need to be reviewed to identify variations, once and if the communication lines are not set up correctly any differences have been identified the causes need it can take a long time before it is identified and then to be understood and used to develop an accurate even longer before actions are taken to improve single view. things from a reporting perspective and more importantly from a user experience perspective This approach although initially painful will eventually lead to a closer working relationship and drive a more Knowing who the “User” is makes a Difference open and honest debate from both sides allowing an accurate picture of service availability to be supplied It may seem obvious but it’s worth stating what is that all areas recognize. meant by a user. “User” can mean different things to different people and areas. As this can help Lesson about Targets Learnt the Hard Way understand where the gap between theory and reality starts. The availability figures show green Regardless of the fact the IT may be in house or The response times are green outsourced. A user is the person sat in front of the PC. It is not the business area that they work in. But the service is a disaster It is the failure to engage all the way down the supply How can all the statements be true the MI shows chain that often leads to targets being defined at the green so everything must be working? wrong level. When the engagement stops at the high or mid tier management there is a greater chance that This highlights the problems when MI production is targets will be set that can be met but will not provide based around incorrect assumptions, of those the users with the level of service they need. perhaps the biggest and simplest mistake is to fail to remain engaged with the client to be aware of Whilst working at that level isn’t in its self wrong it changes in the clients working patterns. does start to bring in areas of uncertainty. The managers working at that those higher levels own the Even if the pitfalls highlighted early around lack of area but they are often unaware in any great depth of user input are avoided, once everything is in place, the tasks completed by the area or the challenges processes must be implemented to remain closely facing the users on the shop floor. Neither should we linked to the client in order to understand when their expect them to have a micro management view. requirements change. The relationship will also be required as the reporting measures are refined to add Operating at that higher level it is possible to set up improved value to the client area. any number of metrics. However if these are not set in conjunction with shop floor users and checked on a Once availability figures are in place and acceptable regular basis they can be meaningless from the day the next step is to move onto the performance they are first produced. measures and that’s when the fun really starts as we try to align the figures with the user experience. As Sadly in many cases the fact that they are unfit for with all measures once in place the trick is to ensure purpose is not understood and the organization uses they remain meaningful them as measures and often holds them up as proof that the IT systems are delivering high quality service. When responses times are initially fixed they are often picked to give certain and in most cases significant What’s really sad is that both the IT areas and the margin for error. business unit managers both think things are performing satisfactorily, whilst users struggles to do Often a figure is chosen that may not be operationally their job acceptable but is signed up to on the basis that at the time of signing the application was performing at a To break this mould and identify this kind of MI considerably better level than the measure. requires both the IT area and the client to challenge the information they are using and develop methods By setting figures in such away it leaves the gate open to ensure the information being provided matches the for significant deteriorations in performance to take users’ real day to day experience. place whilst the figures still show the application
  • 7. performing within agreed guidelines. It is these high profile incidents that drive a reaction from the senior management within a business area In simple terms we sign up to a response time of 15 but they may not solve the issue for the users on the seconds per transaction at the time the application ground. If they don’t fix the issues for the users on the works at 1 to 2 seconds per transaction. ground then they will not have fixed the issues for the senior management within the business but they will An upper level for users to work effectively is 5 have masked the symptoms. seconds per transaction so the transaction response time can get to twice the unusable level and still meet Targets Driving the Wrong Behaviors the agreed management figures. Once you have everything in place the setting and Whilst this may appear outrageous it has happened in agreeing of targets is equally important to ensure that the past and it will happen in the future. you meet user requirements, the danger lies in targets The fact that this happens is normally because such as the one below, versions of which are, and will engagement has not been made at the correct level to be in the future used in organizations across the globe understand the response times that are required to with the intention of improving performance. allow the application to be usable. CURRENT STATE The service may well run satisfactorily however if eventually pressure starts to build either more users We have 1800 outstanding problem records as at are added or the functionality of the applications are 01/06/07 changed. Then from a user perspective the application starts to slow down but the measure still TARGET show things working within agreed levels. To reduce this figure to 600 by the end of the year The other failure is the failure to continue to review and revise the levels that are required by users. An This widely used target is a classic example of how application that is initially used in a non business the mismatch between IT targets driving the IT areas critical area is then used for a business critical task. to deliver at the same time thinking they have made In this instance a responses time of 30 seconds may an improvement to the user experience. have been acceptable when it was a non critical function. At first glance this is an admirable objective and will improve the level of service being supplied. However Once the application moves to be a business critical when you look deeper at the target and start to function the response time may no longer be understand what behaviors this will generate it acceptable although once again the metrics being becomes clear that whilst the IT support teams will be reported will still show the application functioning as working hard reduce the volume of problem records agreed. that are open the improvement to the users may be negligible. Another lesson around targets and there use focuses on the fixation that some organizations have with high Why will all the work possibly fail to deliver? Well severity incidents, as we saw with the conversation because the IT areas will be working toward their between the business and the IT area its very easy to targets and fixing those simple quick fixes that will come up with a target and a resolution that look to drive down the number of outstanding problem match when in fact they miss each other completely. records they are managing. For a start we are measuring incident volumes not It is at this point that the missing link becomes clear user impact so a reduction in incidents may or may the target takes no account of the problem records not result in less outage. If we have fewer incidents that the business need and want fixing. The business but they run longer then we may not have had an could well be perfectly happy to have the target impact on the end users at all. In order to match the reduced to 1700 outstanding records providing the client requirements we have to be measuring the right 100 problem records that are fixed are the ones that thing in the first place outage time and user impact. are costing their business the most. The IT areas producing the service metrics must Fix those key records and you improve service, miss understand the issues that cause the most damage to them and service remains as it always was no matter the business, this may not always be the high profile how much effort is made and how many problem high severity incidents. records you close off.
  • 8. It’s What You Can’t See That Does the Damage reporting all the faults because they are confident hat something will be done when they call in the next step Even when IT areas focus on the known problems is to ensure that correct issues are addressed there are often significant issues that lurk below the surface that the users haven’t reported. Without the In order to do that it is necessary to take a step back knowledge of these hidden issues the IT areas can to see the whole picture. This means that in order to easily spend a significant amount of resource and understand what’s going on we need to obtain as effort into fixing the wrong issues. much information as possible and not jump to conclusions as soon as we have the first piece on To understand this means building an understanding data to hand. into why users stop reporting issues that actually impact their ability to work as effectively as possible. Even when you think you know ask some questions just to make sure you really understand what is going There are two main reasons firstly they learn the work on making sure you are able to address the root arounds. Secondly if their initial calls appear to get no cause and not just remove the symptoms in the short response they start to accept a lower level of service. term. The other side of that is that when the calls stop The danger with performance reporting is that once coming in the IT area thinks the issue has gone away we have the first piece of monitoring in place and the when in fact it’s merely moved out of sight. In reports set up we try to make the one measure work extreme cases the IT areas may try and claim credit for everything. We fill n the blanks with what limited for the reduction in calls. knowledge we have and guess at other areas. If the IT areas are to fix the right issues then a way In order to fully understand what’s happening it needs to be found to ensure that all the issues are requires a structured questioning approach linked to called in. This is the point where it becomes apparent sound technical knowledge. The answers to the that you can’t work on one part of the process in question can come from users or monitoring and isolation. alerting tools anything that allows a detailed picture to be built up before actions are taken. So how can we start to get things to match the users’ real life experience? The first step is to make sure that Without this full picture how can we start to improve we obtain all the relevant data the first time the call is the service we deliver. logged to allow effective investigation. This coupled with a standard repeatable incident logging process Starting to make things work that allows multiple incidents to be collated and linked. Can we get this service level management process to But the help desk is driven by time targets, we’re back work effectively? to targets and behaviors again the IT help desk needs to change their targets to extend the incident handling Yes we can, but each area must work together in time to allow all relevant information to be collected as partnership technical areas need understand the the incident is occurring. business areas drivers and limitations working closely with them to understand the impact of any changes This way all the required information is obtained all they make. calls are logged and the right problems get fixed as the business adds it’s impact to each problem. In the same way the business must keep in contact with the technical areas and ensure they are informed This does take longer and may require more resource of any changes to their working practices that may initially however this is short term pain for long term impact the technology. gain. This means letting the technical areas know all the As the right problems are identified and fixes delivered changes they make so any potential impact can be to the problems that the business needs, this allows tested before implementation. the users to operate more effectively. The volume of incidents called in will decline and reduce the impact It will also require some fundamental changes in the on the help desk. way the organization thinks. It must move from I have a piece of software now what can I do with it, to this is Making Sure We See the Whole Picture what the client wants how can I get that information for them. Once the users have been encouraged to start
  • 9. Working in this way will take longer to set up and may One of the best solutions is to keep changing the require the use of a number of different pieces of measures to make sure that people don’t get software, but once in place it will start to deliver complacent and that people don’t start “playing the effective results system”. Outside of pure data capture and reporting make sure Targets need to be changed to ensure that we are that the systems are built with resilience as standard always looking to find where we need to concentrate and then monitor them to make sure that the our resources to improve service. resilience is maintained through out the life time of the application. If reports are green time after time then by all means continue to collect the data but don’t report it time So often systems are set up with resilience perhaps after time look for the areas of client impact and report two servers, over time the capacity requirements those, this will get a response a green report often just mean that they are constantly running one server at gets filed because at first glance it requires no actions. full capacity and the other half capacity so if we loose one server we have no resilience left and service is Know the limitations lost Whenever you prepare reports and when you receive Working together and making sure everyone them it is vital that you fully understand any limitations understands what has been agreed much like service around the data that you are working with or viewing levels it’s not complex and it shouldn’t be difficult but it must be understandable and flexible. Even when you think you know what the figures are telling if the producers haven’t documented any Relationships and more limitations then challenge them it could be that a significant criteria has been missed such as the Building effective relationships is an important part of figures only include specific types of incidents getting this right but relationships alone are not going to get the job done. IT areas also need to understand Equally important is the need to know if the data you what they want to measure, and the impact this will have is the base data or if someone has already have the users and the business. manipulated it before you start working with it. This can have a significant impact on the end result of your Its about identifying what causes the pain and what reports if the data has already been manipulated in can be lived with. This isn’t about fixing everything its order to paint a particular picture then anything you do not about spending $20,000 to fix a $100 problem. afterwards will not necessarily show a true and honest It’s about finding and fixing the right issues. picture of the service. The other angle is to establish what it is that we want If you know that data has been manipulated before to measure and report. Do we want to measure you receive it, go back to the source understand what availability, usability or both? they have done and why. They may have already done part of your job for you or alternatively you may Once the measures have been decided its time to find have to go back to the original data source and start the right tool to do the job this may mean having from scratch. different specialized tool sets for each task but if you want it done right you have to pay for quality. When producing reports always challenge the data you are using and the results you produce; think what This again has a high up front cost but the pay back is the recipient may ask and answer those questions quick as the right issue are highlighted and fixed. yourself. The more you understanding the better the resulting report and the better the relationship Once everything is in place and the reports are being between you and the client will become. produced and distributed there is a further item that needs monitoring and that is do the reports cause a When we understand the limitations around the data reaction from the recipients if not then why are they that we use the next step is to decide if as an being produced? organization we are prepared to invest in processes and people to insure that right data is collected in the Keeping it new and alive right format at the earliest possible opportunity? Having measures in place is one thing making sure In conclusion that people pay attention to them is the harder part. • Listen to the users
  • 10. • Understand the business • Identify what the client needs 3 Severity 1 Incident definition, direct threat or damage • Identify what you need to, reputation or credibility of the group. Multiple lines • Continually review the requirements of business or locations critically affected. • Build partnerships not conflicts Where do we go from here? What are the next steps to take in order to deliver effective measurements moving our reporting even closure in line with the client’s requirements? One area that we’d like to pursue is the idea of flexible service levels through out the day business critical times and non critical times can have different availability measure. Convincing the client and the Service Management Teams to amend the style of reporting so rather than reporting on all that’s well we also start looking at proactively highlighting areas that are starting to require attention. Sample Incident Severity Definitions Severity 1 A direct threat of damage to the image, reputation or credibility of the group. Multiple lines of business or locations critically affected. Severity 2 Significant degradation or outage affecting a line of business key services or locations Severity 3 Minor degradation to a key service, business process or location or a more severe degradation or outage to a non critical service, business process or location Severity 4 Small issue with localized scope typically affecting a single user. Can either be tolerated or worked around for an extended period of time due to its limited impact Caveats & Disclaimers All trademarks and copyrights are acknowledged. Views and opinions expressed in this paper are those of the author and do not necessarily reflect those of Barclays Bank PLC in general. Bibliography OGS Service Handbook Version 2, 2007 1 OGC Service Delivery Handbook 2 An incident that impacts all users of a particular service