E book creating-unified-it-monitoring-and-management-in-your-environment-chap-1-6

sponsored by
Creating Unified IT
Monitoring and
Management in
Your Environment
Don Jones

Creating Unified IT Monitoring and Management in Your Environment Don Jones

i
Introduction to Realtime Publishers
by Don Jones, Series Editor
For several years now, Realtime has produced dozens and dozens of high‐quality books
that just happen to be delivered in electronic format—at no cost to you, the reader. We’ve
made this unique publishing model work through the generous support and cooperation of
our sponsors, who agree to bear each book’s production expenses for the benefit of our
readers.
Although we’ve always offered our publications to you for free, don’t think for a moment
that quality is anything less than our top priority. My job is to make sure that our books are
as good as—and in most cases better than—any printed book that would cost you $40 or
more. Our electronic publishing model offers several advantages over printed books: You
receive chapters literally as fast as our authors produce them (hence the “realtime” aspect
of our model), and we can update chapters to reflect the latest changes in technology.
I want to point out that our books are by no means paid advertisements or white papers.
We’re an independent publishing company, and an important aspect of my job is to make
sure that our authors are free to voice their expertise and opinions without reservation or
restriction. We maintain complete editorial control of our publications, and I’m proud that
we’ve produced so many quality books over the past years.
I want to extend an invitation to visit us at http://nexus.realtimepublishers.com, especially
if you’ve received this publication from a friend or colleague. We have a wide variety of
additional books on a range of topics, and you’re sure to find something that’s of interest to
you—and it won’t cost you a thing. We hope you’ll continue to come to Realtime for your
far into the future. educational needs
enjoy. Until then,
Don Jones

ii

Introduction to Realtime Publishers ................................................................................................................. i
Ch

apter 1: Managing Your IT Environment: Four Things You’re Doing Wrong ........................... 1
IT Management: How We Got to Where We Are Today ..................................................................... 1
Problem 1: You’re Managing IT in Silos ..................................................................................................... 3
Problem 2: You Aren’t Connecting Your Users, Service Desk, and IT Management ............... 6
Problem 3: You’re Measuring the Wrong Things ................................................................................. 8
Problem 4: You’re Losing Knowledge ..................................................................................................... 12
How Truly Unified Management Can Fix the Problems ................................................................... 13
Summary .............................................................................................................................................................. 14
Ch apter 2: Eliminating the Silos in IT Management ............................................................................... 16
Too Many Tools Means Too Few Solutions ........................................................................................... 16
Domain‐Specific Tools Don’t Facilitate Cooperation ........................................................................ 19
The Cloud Question: Unifying On‐Premise and Off‐Premise Monitoring................................. 21
Missing Pieces .................................................................................................................................................... 23
Not All of IT Is a Problem: Ordering, Routing, and Providing Services ..................................... 27
Coming Up Next… ............................................................................................................................................. 28
Ch apter 3: Connecting Everyone to the IT Management Loop ........................................................... 29
Starting the Loop: Connecting Monitoring to the Service Desk ................................................... 30
Making Changes: How to Find a Change Management Window .................................................. 35
Communicating: How to Bring Users into the Loop .......................................................................... 37
SLAs: Setting and Meeting Realistic Expectations .............................................................................. 39
Thin Tell Me What You Really k ................................................................................................................... 41
When Everyone Doesn’t Need to See Everything: A Multi‐Tenant Approach ........................ 42
Call It a Private Management Cloud: Allocating Costs ...................................................................... 43
Conclusion ........................................................................................................................................................... 44
Coming Up Next… ............................................................................................................................................. 44
Ch apter 4: Monitoring: Look Outside the Data Center .......................................................................... 45
Monitoring Technical Counters vs. the End‐User Experience ...................................................... 45


iii
How the EUE Drives Better SLAs ............................................................................................................... 46

How It’s Done: Synthetic Transactions, Transaction Tracking, and More ............................... 49
Top‐Down Monitoring: From the EUE to the Root Problem ......................................................... 50
Agent vs. Agentless Monitoring .................................................................................................................. 51
Monitoring What Isn’t Yours ....................................................................................................................... 54
Critical Capability: You Need to Monitor Everything ........................................................................ 57
Conclusion ........................................................................................................................................................... 59
Coming Up Next… ............................................................................................................................................. 59
Ch apter 5: Turning Problems into Solutions ............................................................................................. 60
Closing the Loop: Connecting the Service Desk to Monitoring ..................................................... 60
Re taining Knowledge Means Faster Future Resolution .................................................................. 62
Knowledge Bases ......................................................................................................................................... 63
Tickets as Knowledge Base Articles .................................................................................................... 64
Unifying the Knowledge Base ................................................................................................................. 65
Making Tickets an Asset ........................................................................................................................... 69
Pa st Performance Is an Indication of Future Results ........................................................................ 69
It’s the Performance Database ............................................................................................................... 72
Summary .............................................................................................................................................................. 73
Coming Up Next… ............................................................................................................................................. 73
Ch apter 6: Unified Management, Illustrated ............................................................................................. 74
Th e Case Studies ............................................................................................................................................... 74
Detecting and Solving Problems ........................................................................................................... 74
Fulfilling User Orders ................................................................................................................................. 79
A Shopping List for Unified IT Management ......................................................................................... 82
Ways to Buy Your Unified IT ....................................................................................................................... 84
Conclusion ........................................................................................................................................................... 85


iv
Copyright Statement
© 2012 Realtime Publishers. All rights reserved. This site contains materials that have
been created, developed, or commissioned by, and published with the permission of,
Realtime Publishers (the “Materials”) and this site and any such Materials are protected
by international copyright and trademark laws.
THE MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE,
TITLE AND NON-INFRINGEMENT. The Materials are subject to change without notice
and do not represent a commitment on the part of Realtime Publishers its web site
sponsors. In no event shall Realtime Publishers or its web site sponsors be held liable for
technical or editorial errors or omissions contained in the Materials, including without
limitation, for any direct, indirect, incidental, special, exemplary or consequential
damages whatsoever resulting from the use of any information contained in the Materials.
The Materials (including but not limited to the text, images, audio, and/or video) may not
be copied, reproduced, republished, uploaded, posted, transmitted, or distributed in any
way, in whole or in part, except that one copy may be downloaded for your personal, non-
commercial use on a single computer. In connection with such use, you may not modify
or obscure any copyright or other proprietary notice.
The Materials may contain trademarks, services marks and logos that are the property of
third parties. You are not permitted to use these trademarks, services marks or logos
without prior written consent of such third parties.
Realtime Publishers and the Realtime Publishers logo are registered in the US Patent &
Trademark Office. All other product or service names are the property of their respective
owners.
If you have any questions about these terms, or if you would like information about
licensing materials from Realtime Publishers, please contact us via e-mail at
info@realtimepublishers.com.


1
Chapter 1: Managing Your IT Environment:
Four Things You’re Doing Wrong
At the very start of the IT industry, “monitoring” meant having a guy wander around inside
the mainframe looking for burnt‐out vacuum tubes. There wasn’t really a way to locate the
tubes that were working a bit harder than they were designed for, so monitoring—such as
it was—was an entirely reactive affair.
In those days, the “Help desk” was probably that same guy answering the phone when one
of the other dozen or so “computer people” needed a hand feeding punch cards into a
hopper, tracking down a burnt‐out tube, and so on. The concepts of tickets, knowledge
bases, service level agreements (SLAs), and so forth hadn’t yet been invented.
IT management has certainly evolved since those days, but it unfortunately hasn’t evolved
as much as it could or should have. Our tools have definitely become more complex and
more mature, but the way in which we use those tools—our IT management processes—
are in some ways still stuck in the days of reactive tube‐changing.
Some of the philosophies that underpin many organizations’ IT management practices are
really becoming a detriment to the organizations that IT is meant to support. The
discussion in this chapter will revolve around several core themes, which will continue to
drive the subsequent chapters in this book. The goal will be to help change your thinking
about how IT management—particularly monitoring—should work, what value it should
provide to your organization, and how you should go about building a better‐managed IT
environment.
IT Management: How We Got to Where We Are Today
In the earliest days of IT, we dealt with fairly straightforward systems. Even simplistic, by
today’s standards. The IT team often consisted of people who could fix any of the problems
that arose, simply because there weren’t all that many “moving parts.” It’s as if IT was a car:
A machine capable of complexity and of doing many different things, but perfectly
comprehendible, in its entirety, by a single human being.

2

As we started to evolve that IT car into a space shuttle, we gradually needed to allow for
specialization. Individual systems became so complex in and of themselves that we needed
domain‐specific experts to be able to monitor, maintain, and manage each system.
Messaging systems. Databases. Infrastructure components. Directory services. The vendors
who produced these systems, along with third parties, developed tools to help our experts
monitor and manage each system. That’s really where things went wrong. It seemed
perfectly sensible at the time, and indeed there was probably no other way to have done
things, but that establishment of domain‐specific silos—each with their own tools, their
own procedures, and their own expertise—was the seed for what would become a
towering problem inside many IT shops.
Fast forward to today, when our systems are vastly more complex, vastly interconnected,
and increasingly not even hosted within our own data centers. When a user encounters a
problem, they obviously can’t tell us which of our many complex systems is at fault. They
simply tell us what they observe and experience about the problem, which may be the
aggregate result of several systems’ interactions and interdependencies. Our users see a
holistic environment: “IT.” That doesn’t correspond well to what we see on the back end:
databases, servers, directories, files, networks, and more. As a result, we often spend a lot
of time trying to track down the root cause of problems. Worse, we often don’t even see the
problems coming, because the problems only exist when you look at the end result of the
entire environment rather than at individual subsystems. Users feel completely
disconnected from the process, shielded from IT by a sometimes‐helpful‐sometimes‐not
“Help desk.” IT management has a difficult time wrapping their heads around things like
performance, availability, and so on, simply because they’re forced to use metrics that are
specific to each system on the network rather than look at the environment as a whole.
The way we’ve built out our IT organizations has led to very specific business‐level issues,
which have become common concerns and complaints throughout the world:
• IT has difficulty defining and meeting business‐level SLAs. “The messaging server
will be up 99% of the time” isn’t a business‐level SLA; it’s a technical one. “Email will
flow between internal and external users 99% of the time” is a business‐level SLA,
but it can be difficult to measure because that statement involves significantly more
systems than just the email server.
• IT has difficulty proactively predicting problems based on system health, and
remains largely reactive to problems.
• When problems occur, IT often spends far too much time pinpointing the root cause
of the problem.
• IT’s concept of performance and system health is driven by systems—database
servers, directory services, network devices, and so forth—rather than by how users
and the organization as a whole are experiencing the services delivered by those
systems.

3

• IT has a tough time rapidly adopting new technologies that can benefit the business.
Oxymoronically, IT is often the part of the organization most opposed to change,
because change is usually the trigger for problems. Broken systems don’t help
anyone, but an inability to quickly incorporate changes can also be a detriment to
the organization’s competitiveness and flexibility.
• IT has a really tough time adopting new technologies that are significantly outside
the team’s experience or physical reach—most specifically the bevy of outsourced
offerings commonly grouped under the term “cloud computing.” These technologies
and approaches to technology are so different from what’s come before that IT
doesn’t feel confident that they can monitor and manage these new systems. Thus,
they resist implementing these types of systems for fear that doing so will simply
damage the organization.
• Even with modern self‐service Help desk systems, users feel incredibly powerless
and out of touch when it comes to IT.
All of these business‐level problems are the direct result of how we’ve always managed IT.
Our processes for monitoring and managing IT basically have four core problems. Not
every organization has every single one of these, of course, and most organizations are at
least aware of some of these and work hard to correct them. Ultimately, however,
organizations need to ensure that all four of these core problems are addressed. Doing so
will immediately begin to resolve the business‐level issues I’ve outlined.
Problem 1: You’re Managing IT in Silos
Figures 1.1, 1.2, and 1.3 illustrate one of the fundamental problems in IT monitoring and
management today.

Figure 1.1: Windows Performance Monitor.


4

Figure 1.2: SQL Server Performance.

Figure 1.3: Router Performance.
These figures each illustrate a different performance chart for various components of an IT
system. Each of these images was produced using a tool that is more or less specialized for
the exact thing that was being monitored. The tool that produced the router performance
chart, for example, can’t produce the same chart for a database server or even for a router
that’s located on someone else’s network.

5

This is such a core, fundamental problem that many IT experts can’t even recognize that it
is a problem. Using these domain‐specific tools is such an integrated and seemingly natural
part of how IT works that many of us simply can’t imagine a different way. But we need to
move past using these domain‐specific tools as our first line of defense when it comes to
ring and troubleshooting. monito
Why?
One major reason is that these tools keep us all from being on the same page. IT experts
can’t even have meaningful cross‐discipline discussions when these tools become involved.
“I’m looking at the database server, and the performance is at more than 200 TPMs,” one
expert says. “Well, that must be a problem because the router is running well over 10,000
PPMs.” Those two experts don’t even have a common language for performance because
they’re locked into the domain‐specific, deeply‐technical aspects of the technologies they
manage.
Domain‐specific tools also encourage what is probably the worst single practice in all of IT:
looking at systems in isolation. The database guy doesn’t have the slightest idea what
makes a router tick, what constitutes good or bad performance in a messaging server, or
what to look for to see if the directory services infrastructure is running smoothly. So the
database guy puts on a set of blinders and just looks at his database servers. But those
servers don’t exist in a vacuum; they’re impacted by, and they in turn impact, many other
systems. Everything works together, but we can’t see that using domain‐specific tools.
We have to permanently remove the walls between our technical disciplines, breaking
down the silos and getting everyone to work as a single team. In large part, that means
we’re going to have to adopt new tools that enable IT silos to work as a team, putting the
information everyone needs into a common context. Sure, domain‐specific tools will always
have their place, but they can’t be our first line of information.
Case Study
Jerry works for a typical IT department in a midsize company. His specialty is
Windows server administration, and his team includes specialists for Web
applications, Microsoft SQL Server and Oracle, VMware vSphere, and for the
network infrastructure. The company outsources certain enterprise
functionality, including their Customer Relationship Management (CRM) and
email.
Recently, a problem occurred that caused the company’s main Web site to
stop sending customer order confirmation emails. Jerry was initially called to
solve the problem, on the assumption that it was with the company’s
outsourced messaging solution. Jerry discovered, however, that user email
was flowing normally. He passed the problem to the Web specialist, who
confirmed that the Web site was working properly but that emails sent by it
were being rejected. Jerry filed a ticket with the messaging hosting company,
who responded that their systems were in working order and that he should
check the passwords that the Web servers were using.


6
After more than a day of back‐and‐forth with the hosting company and
various experts, the problem was traced to the company’s firewall. It had
recently been upgraded to a new version, and that version was now blocking
outgoing message traffic from the company’s perimeter network, which is
where the Web servers were located. The network infrastructure specialist
was called in to reconfigure the firewall, and the problem was solved.

This narrative precisely demonstrates the problem: By managing our IT teams as domain‐
specific silos, we significantly hinder their ability to work together to solve problems. The
fact that IT experts require domain‐specific tools shouldn’t be a barrier to breaking down
those silos and getting our team to work more efficiently together. This becomes especially
important when pieces of the infrastructure are outsourced; those hosting companies are
an unbreakable silo, as they’re not responsible for any systems other than the ones they
provide to us. However, the dependencies that our systems and processes have on their
systems means our own team still has to be able to monitor and troubleshoot those
outsourced systems as if they were located right in the data center.
Problem 2: You Aren’t Connecting Your Users, Service Desk, and IT
Management
Communication is a key component of making any team work; and the “team” that is your
organization is no exception. In the case of IT, we typically use Help desk systems as our
means of enabling communications—but that isn’t always sufficient. Help desk systems are
almost always built around the concept of reacting to problems, then managing that
reaction; they’re almost by definition not proactive.
For example, how do you tell your users that a given system will have degraded
performance or will be offline for some period of time? Probably through email, which
creates a couple of problems:
• Important messages tend to get lost in the glut of email that users deal with daily
• Users who don’t get the message tend to go the “Help desk route,” which doesn’t
include a means of intercepting their mental process and letting them know that the
“problem” was planned for.

7

Most IT teams do know the things that need to be communicated throughout the
organization, for example:
• SLAs
• ey’re being met The current status of SLAs—whether th
• Planned outages and degraded service
• ices Average response times for specific serv
• Known issues that are being worked on
What most IT teams have a problem with is communicating these items consistently across
the entire organization. Some organizations rely on email, which as I’ve already pointed out
can be inefficient and not consistently effective. Some organizations will use an intranet
Web site, such as a SharePoint portal, to post notices—but these sites aren’t directly
integrated with the Help desk, making it an extra step to keep them updated and requiring
users to remember to check them.
Case Study
Tom works as an inside salesperson for a midsize manufacturing company.
Recently, the application that Tom uses to track prospects and create new
orders started responding very slowly, and over the course of the day,
stopped working completely.
Tom’s initial action was to call his company’s IT Help desk. The Help desk
technician sounded harried and frustrated, and told Tom, “We know, we’re
working on it,” and hung up. Tom had no expectation when the system might
return to normal, and was afraid to bother the Help desk by calling back for
more details.
Over the course of that day, the Help desk logged calls from nearly every
salesperson, each of whom called on their own to find out what was going on.
Eventually, the Help desk simply stopped logging the calls, telling everyone
that, “A ticket is already open,” and disconnecting the call.
Someone on the IT management team eventually sent out an email explaining
that a server had failed and that the application wasn’t expected to be online
until the next morning. Tom wished he had known earlier; although he’d
originally planned to make sales calls all day, if he’d known that the
application would be down for that long, he could have switched to other
activities for the day or even just taken the day of

f.

8

Management communications are equally important, and equally challenging. Providing
frank numbers on service levels, response times, outages, and so forth is crucial in order for
management to make better decisions about IT—but that information can often be difficult
to come by.
Problem 3: You’re Measuring the Wrong Things
This problem is very likely at the heart of everything IT is not doing to help better align
technology with business needs. The following case study outlines the scenario.
Case Study
Shelly works in the Accounting department for her company. Recently, while
trying to close the books for her company, the accounting application began
to react very slowly. She called her company’s IT Help desk to report the
problem.
The Help desk technician listened to her then said that, “Everything on that
server looks fine right now. I’ll open a ticket and ask someone to look at it,
but since we are currently within our service level agreement for response
times, it will be a low‐priority ticket.”
Shelly continued to struggle with the slowly‐responding application.
Eventually, someone was dispatched to her desktop. She demonstrated that
every other application was responding normally. She pointed out that other
people in her department were having similar problems with the application.
The technician made her close all of her applications and then restarted her
computer, to no effect. He shrugged, entered some notes into his smartphone,
and left.
By the next morning, the application’s response times were better, but they
were far from normal. Shelly continued to call the Help desk for updates on
her ticket’s status, but it seemed as if the IT team had given up on trying to fix
the problem—and refused to even admit that there was a problem.

This kind of scenario unfortunately happens all too often in many organizations. It exactly
illustrates what happens when several problems are happening at once: IT is operating as a
set of individual silos rather than as a team, and each silo has its own definition for words
like “slow.” A root issue here is that everyone is measuring the wrong thing. Figure 1.4
shows how the average IT team sees a multi‐component, distributed application.


9

Figure 1.4: IT perspective of a distrib

uted application.

10

They see the components. Domain experts measure the performance of each component
using technical metrics, such as processor utilization, response time, and so forth. When a
component’s performance exceeds certain predefined thresholds, someone in IT pays
attention. Figure 1.5, however, shows how a user sees this same application.

Figure 1.5 er’s perspective of a distributed application.
The user doesn’t—often can’t—see any of the components. They simply see an application,
and either it’s responding the way they expect, or it isn’t. It doesn’t matter a bit to the user
if every single constituent component is running at an “acceptable level of processor
utilization”—whatever that means. They simply care whether the application is working.
This creates a major disconnect between the user population and IT, as Figure 1.6
illustrates.
: Us


11

Figure 1.6: IT vs. user measurements

of performance.

12

Users and IT measure very different things. An IT‐centric SLA might specify a given
response time for queries sent to a database server; that often has little to do with whether
an application is seen as “slow” by users. Worse, as we start to migrate services and
components to “the cloud,” we lose much of our ability to measure those components’
performance the way we do for things that are in our own data center. The result? Nobody
can agree on what an SLA should say.
This all has to change. We have to start measuring things more from a user perspective.
The performance of individual components is important, but only as they contribute to the
total experience that a user perceives. We need to define SLAs that put everyone—users
and IT—on the same page, then manage to those SLAs using tools that enable us to do so.
Some organizations will tell you that they’re moving, or have moved, to a service‐based IT
offering. What that generally means in broad terms is that the organization is seeking to
provide IT as a set of services to the organization’s various departments and users. In many
instances, however, those “service‐oriented” organizations are still focused on components
and devices, which isn’t a service‐oriented approach at all. When your phone line goes
down, you don’t call the phone company (on your cell phone, probably) and start asking
questions about switches and trunk lines—you ask when your dial tone will be back. The
back‐end infrastructure is meaningless to the user. You don’t ask for a service credit based
on how long a particular phone company office will be offline, you ask for that credit based
on how long you went without a dial tone. That's the model IT needs to move toward.
Problem 4: You’re Losing Knowledge
The last problematic practice we’ll look at is the issue of lost institutional knowledge. This
problem is a purely human one, and frankly it’s going to be difficult to address. Here’s a
quick scenario to set the scene.
Case Study
Aaron works for his company’s IT department. He’s been with the company
for 3 years and is responsible for several of the company’s systems and
infrastructure components. One Tuesday, Aaron is contacted by his
company’s IT Help desk. “We’re assigning you a ticket about the Oracle
system,” he’s told. “Once every couple of months it starts acting really weird,
and someone has to fix it.”
“I’m not the Oracle guy,” Aaron says. “That’s Jill.”
“Yeah, but Jill’s out on vacation for 2 weeks. So you’ll have to fix it.”
“I’ve no idea what to do!”
“Well, figure something out. The CEO gets upset when this takes too long to
fix.”

13

Unfortunately, too much knowledge gets wrapped up in the heads of specific individuals. In
fact, it’s a sad truth that many organizations “deal” with this problem by simply
discouraging IT team members to take lengthy vacations, and often resist other activities
that would put them out of touch—such as sending them to conferences and classes to
continue their education and to learn new skills.
More than a few organizations have made halfhearted attempts at building “knowledge
bases,” in a hope that some of this institutional knowledge can be committed to electronic
paper, preserved, and made more accessible. The problem is that IT professionals aren’t
necessarily good writers, so the act of producing the knowledge base is difficult for them. It
also takes time—time the organization is often unwilling to commit, especially in the face
of other daily pressures and demands.
As I said, this is a problem that’s difficult to fix. The IT team realizes it’s a problem, and is
generally willing to fix it—but they’re not tech writers, and often have a limited ability to
fix the problem. You can usually create management requirements that require problems
and solutions be logged in a Help desk ticketing system, but searching through that system
for problems and solutions can often be difficult and time‐consuming—much like searching
for solutions on an Internet search engine, with all of the false “hits” such a search generally
s. produce
But we must find a way to address this problem. Knowledge about the company’s
infrastructure—and how to solve problems—has to be captured and preserved. This
requirement is crucial not only to solving problems faster in the future but also to
eventually preventing those problems by making better IT management decisions.
How Truly Unified Management Can Fix the Problems
This book is going to be all about fixing these four problems, and the means by which I’ll
propose to do so falls under the umbrella term unified management. Essentially, unified
management is all about bringing everything together in one place.
We’ll break down the silos between IT disciplines, putting everyone onto the same console,
getting everyone working from the same data set, and getting everyone working together
on problems. We’ll do that in a way that brings users, IT, and management into a single
viewport of IT service and performance. We’ll create more transparency about things like
service levels, letting users see what’s happening in the environment so that they’re more
informed.
We’ll inform users in a way that’s meaningful to them rather than using invisible, back‐end
technical metrics. We’ll rebuild the entire concept of SLAs into something that’s meaningful
first to users and management, and that can withstand the transition to “hybrid IT” that’s
cloud.” being brought about by outsourcing certain IT services to “the

14

Finally, we’ll find a way to capture information about our environment, including solutions
to problems, to enable faster time‐to‐resolution when problems occur. In addition, this
information will enable management to make smarter decisions about future technology
directions and investments.
We’ll try to do all of this in a way that won’t cost the organization an arm and a leg nor take
half a lifetime to actually implement. That will involve a certain amount of creativity,
including looking at outsourced solutions. The idea of an outsourced solution providing
monitoring for in‐sourced components is fairly innovative, and we’ll see what applicability
it has.
I should point out that much of what we’ll be looking at can work to support the IT
management frameworks that many organizations are adopting these days, including the
ITIL framework that’s become popular in the past few years. You certainly don’t have to be
an ITIL expert to take advantage of the new processes and techniques I’ll suggest—nor do
you even have to think about implementing ITIL (or any other framework) if your
organization isn’t already doing so. If you are using a framework, however, you’ll be
pleased to know that everything I have to propose should fit right into it.
Summary
This chapter has established the four main themes that will drive the remaining chapters in
this book. These core things represent what many experts believe are the biggest and most
fundamental problems with how IT is managed today, and represent the things that we’ll
focus on fixing throughout the remainder of this book. Our focus will be on changing
management philosophies and practices, not on simply picking out new tools—although
new tools may be something you’ll acquire to help support these new practices.
Chapter 2 will focus on the first problematic practice, which is the fact that IT tends to be
managed in domain‐specific silos. We’ll look at the technical reasons organizations have
been more or less forced to manage this way, and explore ways in which you can start to
change that practice.
Chapter 3 will look at connecting people: IT management, your users, your service desk,
and more. Only by bringing everyone into the process can IT better align itself to the needs
of the organization.
Our third problem practice will be the subject of Chapter 4, where we dive into looking
outside the data center for monitoring. The goal will be to solve the problems we’ve
to the organization. discussed in this chapter, further focusing IT on its value

15

Chapter 5 will discuss ways to turn problems into future solutions. Although modern
organizations are fully aware of the need for Help desk tracking and knowledge building,
how those activities are managed as part of the larger IT management process can make a
huge difference in their value‐add to the organization.
We’ll conclude in Chapter 6, with an attempt to visualize an IT environment where these
new, unified management practices are in place. I’ll provide narratives from several case
work in a real environment. studies, helping you see how these modernized practices


16
Chapter 2: Eliminating the Silos in IT
Management
In the previous chapter, I proposed that one of the biggest problems in modern IT is the
fact that we manage our environment in technology‐specific silos: database administrators
are in charge of databases, Windows admins are in charge of their machines, VMware
admins run the virtualization infrastructure, and so forth. I’m not actually proposing that
we change that exact practice—having domain‐specific experts on the team is definitely a
benefit. However, having these domain‐specific experts each using their own unique,
domain‐specific tool definitely creates problems. In this chapter, we’ll explore some of
those problems, and see what we can do to solve them and create a more efficient, unified
IT environment.
Too Many Tools Means Too Few Solutions
“Comparing apples to oranges” is an apt phrase when it comes to how we manage
performance, troubleshooting, and other core processes in IT. Tell an Exchange Server
administrator that there’s a performance problem with the messaging system, and he’ll
likely jump right into Windows’ Performance Monitor, perhaps with a pre‐created counter
set that focuses on disk throughput, processor utilization, RPC request count, and so
forth—as shown in Figure 2.1.

Figure 2.1: Monitoring Exchange.

17

If the Exchange administrator can’t find anything wrong with the server, he might pass the
problem over to someone else. Perhaps it will be the Active Directory administrator
because Active Directory plays such a crucial role in Exchange’s operation and
performance. Out comes the Active Directory administrator’s favorite performance tool,
perhaps similar to the one shown in Figure 2.2. This is truly a domain‐specific tool, with
special displays and measurements that relate specifically to Active Directory.

Figure 2.2: Monitoring Active Directory.
If Active Directory looks fine, then the problem might be passed over to the network
infrastructure specialist. Out comes another tool, this one designed to look at the
performance of the organization’s routers (see Figure 2.3).


18

Figure 2.3: Monitoring router performance.
Combined, all of these tools have led these three specialists to the same decision:
Everything’s working fine. In spite of the fact that Exchange is clearly, from the users’ point
of view, not working fine, there’s no evidence that points to a problem.
Simply put, this is a “too many tools, too few answers” problem. In today’s complex IT
environments, performance—along with other characteristics like availability and
scalability—are the result of many components interacting with each other and working
together. You can’t manage IT by simply looking at one component; you have to look at
entire systems of interacting, interdependent components.
Our reliance on domain‐specific tools holds us back from finding the answers to our IT
problems. That reliance also holds us back when it comes time to grow the environment,
manage service level agreements (SLAs), and other core tasks. I’ve actually seen instances
where domain‐specific tools acted almost as blinders, preventing an expert who should
have been able to solve a problem, or at least identify it, from doing so as quickly as he or
she might have done.

19

Case Study
Heather is a database administrator for her organization. She’s responsible
for the entire database server, including the database software, the operating
system (OS), and the physical hardware.
One day she receives a ticket indicating that users are experiencing sharply
reduced performance from the application that uses her database. She whips
out her monitoring tools, and doesn’t see a problem. The server’s CPU is
idling along, disk throughput is well within norms, and memory consumption
is looking good. In fact, she notices that the amount of workload being sent to
the server is lower than she’s used to seeing. That makes her suspect the
network is having traffic jams, so she re‐assigns the ticket to the company’s
infrastructure team. That team quickly re‐assigns the ticket right back to her,
assuring her that the network is looking a bit congested, but it’s all traffic
coming from her server.
Heather looks again, and sees that the server’s network interface is humming
along with a bit more traffic than usual. Digging deeper, she finally realizes
that the server is experiencing a high level of CRC errors, and is thus having
to retransmit a huge number of packets. Clients experience this problem as a
general slowdown because it takes longer for undamaged packets to reach
their computers.
Heather’s focus on her specific domain expertise led her to “toss the problem
over the wall” to the infrastructure team, wasting time. Because she wasn’t
accustomed to looking at her server’s network interface, she didn’t check it
as part of her routine performance troubleshooting process.
Domain‐Specific Tools Don’t Facilitate Cooperation
If the components of our complex IT systems are cooperative and interdependent, our IT
professionals are often anything but. In other words, IT management tends to encourage
the silos that are built around specific technology domains. There’s the database
administration group, the Active Directory group, the infrastructure group, and so forth.
Even companies that practice “matrix management,” in which multiple domain experts are
os around each technical domain. grouped into a functional team, still tend to accept the sil

20

There are two major reasons that these silos persist, and almost any IT professional can
describe them to you:
• “I don’t know anything about that.” Each domain expert is an expert in his technical
area. The database administrator isn’t proficient at monitoring or managing routers,
and doesn’t especially want to work with them anyway. There’s little real value in
extensive technical cross‐training for most organizations, simply because their staff
doesn’t have the time. Devoting time to secondary and tertiary disciplines reduces
the amount of time available for their primary job responsibilities.
• “I don’t want anyone messing with my stuff.” IT professionals want to do a good job,
and they’re keenly aware that most problems come about as the result of change.
Allow someone to change something, and you’re asking for trouble. If someone
changes something in your part of the environment, and you don’t know about their
activity, you’ll have a harder time fixing any resulting problems.
Both of these reasons are completely valid, and I’m in no way suggesting that everyone on
the IT team become an expert in every technology that the organization must support.
minor adjHowever, the attitudes reflected in these two perspectives require some ustment.
One reason I keep coming back to domain‐specific tools is because they encourage this kind
of walled‐garden separation, and do nothing to encourage even the most cursory
cooperation between IT specialists. Cooperation, when it exists, comes about through good
human working relationships—and those relationships often struggle with the fact that
each specialist is looking at a different set of data and working from a different “sheet of
music,” so to speak. I’ve been in environments and seen administrators spend hours
arguing about whose “fault” something was, each pointing to their own domain‐specific
tools as “evidence.”
Case Study
Dan is an Active Directory administrator for his company, and is responsible
for around two dozen domain controllers, each of which runs in a virtual
machine. Peg is responsible for the organization’s virtual server
infrastructure, and manages the physical hosts that run all of the virtual
machines.
One afternoon, Peg gets a call from Dan. Dan’s troubleshooting a performance
problem on some of the domain controllers, and suspects that something is
consuming resources on the virtualization host that his domain controllers
need.


21
Peg opens her virtual server console and assures Dan that the servers aren’t
maxed out on either physical CPU or memory, and that disk throughput is
well within expected levels. Dan counters by pointing to his Active Directory
monitoring tools, which show maxed‐out processor and memory statistics,
and lengthening disk queues that indicate data isn’t being written to and read
from disk as quickly as it should be. Peg insists that the physical servers are
fine. Dan asks if the virtual machines settings have been reconfigured to
provide fewer resources to them, and Peg tells him no.
The two go back and forth like this for hours. They’re each looking at
different tools, which are telling them completely different things. Because
they’re not able to speak a common technology language, they’re not able to
work together to solve the problem.
We don’t need to have every IT staffer be an expert in every IT technology; we do need to
make it easier for specialists to cooperate with one another on things like performance,
scalability, availability, and so forth. That’s difficult to do with domain‐specific tools. The
router administrator doesn’t want a set of database performance‐monitoring tools, and the
database administrator doesn’t especially want the router admin to have those tools.
Having domain‐specific tools for someone else’s technical specialization is exactly how the
two attitudes I described earlier come into play.
Ultimately, the problem can be solved by having a unified tool set. Get everyone’s
performance information onto the same screen. That way, everyone is playing from the
same rule book, looking at the same data—and that data reflects the entire, interdependent
environment. Everyone will be able to see where the problem lies, then they can pull out
the domain‐specific tools to start fixing the actual problem area, if needed.
The Cloud Question: Unifying On‐Premise and Off‐Premise M
This concept of a unified monitoring console becomes even more important as
organizations begin shifting more of their IT infrastructure into “the cloud.”
onitoring
The Cloud Is Nothing New
I have to admit that I’m not a big fan of “the cloud” as a term. It’s very sales‐
and‐marketing flavored, and the fact is that it isn’t a terribly new concept.
Organizations have outsourced IT elements for years. Probably the most‐
outsourced component is Web hosting, either outsourcing single Web sites
into a shared‐hosting environment, or outsourcing collocated servers into
someone else’s data center.


22
For the purposes of this discussion, “the cloud” simply refers to some IT
element being outsourced in a way that abstracts the underlying
infrastructure. For example, if you have collocated servers in a hosting
company’s data center, you don’t usually have details about their internal
network architecture, their Internet connectivity, their routers, and so
forth—the data center is the piece you’re paying to have abstracted for you.
In a modern cloud computing model like Windows Azure or Amazon Elastic
Cloud, you don’t have any idea what physical hosts are running your virtual
machines—that physical server level is what you’re paying to have
abstracted, along with supporting elements like storage, networking, and so
on. For a Software as a Service (SaaS) offering, you don’t even know what
virtual machines might be involved in running the software because you’re
paying to have the entire underlying infrastructure abstracted.
Regardless which bits of your infrastructure wind up in some outsourced service
provider’s hands, those bits are still a part of your business. Critical business applications
and processes rely on those bits functioning. You simply have less control over them, and
typically have less insight into how well they’re running at any given time.
This is where domain‐specific tools fall apart completely. Sure, part of the whole point of
outsourcing is to let someone else worry about performance—but outsourced IT still
supports your business, so you at least need the ability to see how the performance of
outsourced elements is affecting the rest of your environment. If nothing else, you need the
ability to authoritatively “point the finger” at the specific cause of a problem—even if that
cause is an outsourced IT element, and you can’t directly effect a solution. This is where
unified monitoring truly earns a place within the IT environment. For example, Figure 2.4
shows a very simple “unified dashboard” that shows the overall status of several
components of the infrastructure—including several outsourced components, such as
mazon Web Services. A


23

Figure 2.4: Unified monitoring dashboard.
The idea is to be able to tell, at a glance, where performance is failing, to drill through for
more details, and then to either start fixing the problem—if it exists on your end of the
cloud—or escalate the problem to someone who can.
Let’s be very clear on one thing: Any organization that’s outsourcing any portion of its
business IT environment and cannot monitor the basic performance of those outsourced
elements is going to be in big trouble when something eventually goes wrong. Sure, you
have SLAs with your outsourcing partners—but read those SLAs. Typically, they only
commit to a refund of whatever fees you pay if the SLA isn’t met. That does nothing to
compensate you for lost business that results from the unmet SLA. It’s in your best
interests, then, to keep a close watch on performance. That way, when it starts to go bad,
you can immediately contact your outsourcing partner and get someone working on a fix so
that the impact on your business can at least be minimized.
Missing Pieces
There’s another problem when it comes to performance monitoring and management,
scalability planning, and so forth: missing pieces. Our technology‐centric approach to IT
tends to give us a myopic view of our environment. For example, consider the diagram in
Figure 2.5. This is a typical (if simplified) diagram that any IT administrator might create to
help visualize the components of a particular application.


24

Figure 2.5: Application diagram.
The problem is that there are obviously missing pieces. For example, where’s the
infrastructure? Whoever created this diagram clearly doesn’t have to deal with the
infrastructure—routers and switches and so forth—so they didn’t include it. It’s assumed,
almost abstracted like an outsourced component of the infrastructure. Maybe Figure 2.6 is
a more accurate depiction of the environment.


25

Figure 2.6: Expanded application diagram.
And even with this diagram, there are still probably missing pieces. This reality is probably
one of the biggest dangers in IT management today: We forget about pieces that are outside
our purview.

26

Again, this is where a unified monitoring system can create an advantage. Rather than
focusing on a single area of technology—like servers—it can be technology‐agnostic,
focusing on everything. There’s no need to leave something out simply because it doesn’t fit
within the tool’s domain of expertise; everything can be included.
In fact, an even better approach is to focus on unified monitoring tools that can actually go
out and find the components in the environment. Software doesn’t have to make the same
assumptions, or have the same technology prejudices, as humans. A unified monitoring
console doesn’t care if you happen to be a Hyper‐V expert, or if you prefer Cisco routers
over some other brand. It can simply take the environment as it is, discovering the various
components and constructing a real, accurate, and complete diagram of the environment. It
can then start monitoring those components (perhaps prompting you for credentials for
each component, if needed), enabling you to get that complete, all‐in‐one, unified
dashboard. I’ve been in environments where not using this kind of auto‐discovery became a
real problem.
Case Study
Terry is responsible for the infrastructure components that support his
company’s primary business application. Those components include routers,
switches, database servers, virtualization hosts, messaging servers, and even
an outsourced SaaS sales management application. Terry’s heard about the
unified monitoring idea, and his organization has invested in a service that
provides unified monitoring for the environment. Terry’s carefully
configured each and every component so that everything shows up in the
monitoring solution’s dashboard.
One afternoon, the entire application goes down. Terry leaps to the unified
monitoring console, and sees several “alarm” indications. He drills down and
discovers that the connection to the SaaS application is unavailable. Drilling
further, he sees that the router for that connection is working fine, and that
the firewall is up and responsive. He’s at a complete loss.
Several hours of manual troubleshooting and wire‐tracing reveal something
about the environment that Terry didn’t know: There’s a router on the other
side of the firewall as well, and it’s failed. Normal Internet communications
are still working because those travel through a different connection, but the
connection that carries the SaaS application’s traffic is offline. The “extra”
router is actually a legacy component that pretty much everyone had
forgotten about.
A monitoring solution capable of automated discovery wouldn’t have
“forgotten,” though. It could have detected the extra router and included it in
Terry’s dashboard, making it much easier for him to spot the problem. In fact,
it might have prompted him to replace or remove that router much earlier,
once he realized it existed.

27

Discovery can also help identify components that don’t fit neatly within our technology
silos, and that don’t “belong” to anyone. Infrastructure components like routers and
switches are commonly‐used examples of these “orphan” components because not every
organization maintains a dedicated infrastructure specialist to support these devices.
However, legacy applications and servers, specialty equipment, and other components can
all be overlooked when they’re not anyone’s specific area of responsibility. Discovery helps
keep us from overlooking them.
Not All of IT Is a Problem: Ordering, Routing, and Providing Services
Most organizations tend to get into the habit of thinking of their IT department as “fire
fighters.” IT exists to solve problems. That isn’t true, of course, and any organization
probably (hopefully) depends more on IT to carry out day‐to‐day tasks and requests more
than they rely on them to solve problems. But the day‐to‐day tasks are easy to overlook,

whereas “fire fighting” gets everyone’s attention.
The result of this way of thinking is that IT management tends to focus on tools that help
make problem‐solving easier. Unified monitoring is exactly that kind of tool: If nothing ever
went wrong, we wouldn’t need it. It’s there to make problem‐solving faster, primarily in the
rform d availabilityareas of pe ance an . Right?
Not quite. Truly unified management also entails making day‐to‐day IT tasks easier for
everyone involved. Users, for example, need to order and receive routine services, from
simple password resets and account unlocks to new hardware and software requests. I’ll
make what some consider to be a bold statement and say that those routine requests
should be treated in the exact same way as a problem. Look at any IT management
framework, such as ITIL, and you’ll find that concept runs throughout: Routine IT requests
should be part of a unified management process, which also includes problem‐solving.
Consider some of these broad functional capabilities that a unified management (versus
mere “monitoring”) can offer both to problem‐solving activities and to routine IT services:
• Workflow—When problems arise, following a structured process, or workflow, can
help make problem‐solving more consistent and efficient. Similarly, structured
workflows can help make routine IT services more efficient and consistent. The
workflows will be different for problem‐solving and for various routine services, but
having the ability to manage and monitor workflows can be a real benefit.
• Approvals—Workflows should include approvals. This capability is most obvious
for routine services like hardware and software requests, security requests, and so
on—but it can be just as important for problem solving. Not every problem can be
fixed by changing a setting or rebooting a device; sometimes you’ll need to make a
more significant change, and having the ability to formally route approval to make
that change is a benefit.

28

• Routing. The specialist who fixes a problem is usually the last one to hear about it.
Front‐line resources, such as your Help desk and your end users, are the first
“responders.” Being able to select a problem category and have a ticket routed to the
right individual helps speed problem resolution. The same is true for routine
services: Things get done quicker when the right person has the request. Automated
routing capabilities can help get the right person on the job more quickly and more
accurately.
• Self‐service. Reducing phone calls and manual email juggling is crucial to achieving
better efficiency. Self‐service can help do that for both problems and routine
requests. When users experience a problem, self‐service can allow them to submit
tickets as well as help them solve the problem on their own, through a knowledge
base. When users need routine service, self‐service helps them submit that request
without having to engage additional IT services.
• Service catalog. Part of self‐service is the ability to create an “online store” for
services that users can request.
There are more capabilities, of course, but we’ll cover them in upcoming chapters. These
are simply some of the basic capabilities that we need in order to make both routine IT
requests and problem‐solving more consistent and efficient.
Coming Up Next…
This chapter has been about breaking down the silos between technology specialties, or at
least building doorways between them. That helps to solve one of the major problems in
modern IT monitoring and management. The next chapter will tackle a somewhat more
complicated problem: Keeping everyone in the management loop. It’s about improving
communications. Unfortunately, communications are too often a voluntary, secondary
exercise—we have to make an effort to communicate, and when we’re really feeling the
pressure, it’s easy to want to put that effort elsewhere. So we need to adopt processes and
tools that make communications more automatic, helping keep everyone in the loop
without requiring a massive secondary effort to do so.


29
Chapter 3: Connecting Everyone to the IT
Management Loop
IT management has for too long involved discrete, disconnected processes that often leave
key participants wondering what’s going on. Bringing everyone—users, managers, IT
professionals, and more—into the loop can create significant benefits as well as reduce the
tendency to fall back into discipline‐based silos. This is where the integration between
monitoring and service desk truly happens, and these concepts deliver the most critical,
central themes discussed throughout this book. It’s all about communication—ways to
ent. better achieve communication as well as create opportunities for continuous improvem
Users sometimes perceive their IT department as out‐of‐touch, ivory‐tower geeks with
poor people skills. Whether or not that’s true depends on the actual IT team members, but
the perception, fair or not, often exists. That’s because IT can too often be the last ones to
know about things that users perceive as problems. Sure, the server might me humming
along within specs, but the order‐entry application is incredibly slow. IT says that email is
working fine, but I’ve been waiting on an incoming purchase order for an hour—the email
system can’t possibly be working correctly!
IT has its own unique problems to deal with, and they sometimes involve a disconnect with
management. Finding windows in which to make approved changes, for example, can be
incredibly tricky. Simply coordinating the changes that are proposed, approved, under
development, ready for implementation, and so forth can be difficult. Many organizations
have adopted change management frameworks, such as those proposed by ITIL, that
outline specific processes for reviewing and approving changes. Physically coordinating
that process, however, can seem like herding cats. It’s even worse when IT has been
divided into silos: The database team might have a change scheduled for tonight, but that
change is going to conflict with the power supply changes being implemented by the data
enter team. We need to get everyone on the same page. c


30
Starting the Loop: Connecting Monitoring to the Service Desk
Most organizations today have a ticket‐based system for coordinating IT activities. These
organizations also usually have monitoring systems in place to watch their IT systems and
alert them to any problems. Too few organizations, however, have connected these two
systems. Ideally, that’s what you want: A single, integrated IT management system that can
detect problems and then automatically open tickets for the appropriate individuals. If the
email server is down, the appropriate administrator should get a ticket. Those tickets, of
course, should include notifications via text message, email, or whatever other medium is
t. appropriate so that alerted individuals know they have an aler
That auto‐assignment—you might even choose to call it auto‐routing—of tickets needs to
be pretty intelligent. Different systems, in different locations, at different times, all might
change how the ticket is created, thus changing who is assigned to work the problem.
Tickets should be as complete as possible, meaning as many fields as possible should be
automatically populated—you shouldn’t have to rely on a Help desk, or someone else, to fill
in the details. Those details might include the affected server’s information. Figure 3.1
shows what this kind of auto‐generated ticket might look like, with several key bits of
information pre‐populated by the system.

Figure 3.1: Automaticallygenerated tickets

in response to alarms.

31

The idea is to have a service desk solution—that’s the software that helps coordinate and
manage IT activities, often through tickets—working with the monitoring solution, thus
creating a truly integrated response to IT problems.
This is all intended to provide specific benefits. First and foremost is faster problem
resolution. By not waiting for users to inform you of a problem, you’re getting started on
solving the problem faster. By having pre‐populated tickets, the IT team is able to work
more quickly because they’re starting with more information.
There’s a bit more depth that can be added, if you have the right service desk software in
place. Frameworks like ITIL encourage root cause analysis, meaning your team should focus
not only on solving today’s specific problem but also on making the overall environment
more stable and problem‐resistant. To that end, a service desk solution can define two
types of problems: global issues and specific incidents.
Specific incidents might be day‐to‐day problems like, “Email moving slowly throughout the
organization,” “Order entry application operating slowly,” and so forth. Those might all be
tied to a global issue of “Unexplained network slowdowns,” which could be examined and
solved—perhaps locating a router that was overheating and dropping more packets than
usual.
Sometimes, specific incidents might not be entirely solved until the overarching global
issue is solved. By tracking those individual incidents along with the global issue, you can
help keep your users and managers more informed. For example, once that overheating
router is discovered and replaced, everyone affected by an associated specific issue could
be notified: “Hey, we think we’ve found the root cause for all the slowdowns, so things
should be better from here on out.” Figure 3.2 shows how a single global problem can be
attached to multiple incidents.


32

Figure 3.2: Relating multiple incidents to a single problem.
I’ve used a couple of keywords in the forgoing discussion and want to take a moment to
specific define ally them in the context of this book:
• An incident is something that happens in the environment, such as a failed server or
ion. a slow applicat
• IT staff create problem records to help manage the incident. Problems may in fact be
associated with multiple incidents, as in the case of that overheating router, which
caused multiple disparate failures throughout the environment.
I’m going to start using those two terms more consistently from here on. Hopefully, some of
the benefits of combining monitoring with problem solving will become clear. For example,
more simplistic Help desk solutions allow multiple tickets to be opened against what is
essentially the exact same issue. That can result in a lot of duplicated effort, as multiple IT
team members attempt to work the issues on their own. It can also result in a lot of
paperwork because solving the root cause then requires technicians to spend time
laboriously closing each ticket. With a more sophisticated system in place, everything can
be consolidated into a single, managed problem record. Doing so creates additional
benefits, such as identifying solutions or workarounds, which I’ll discuss in upcoming
chapters.

33

Problems and incidents, however, aren’t the only reason that users interact with IT.
Hopefully, they’re not even the major reason your users interact with IT! Aside from
reporting incidents, users also need to request routine services: advice, new hardware
requests, routine change requests, access requests, and so forth. These interactions should
be managed through a more formal workflow in which users submit their request, have it
assigned to the appropriate technician after being approved, and be able to track the status
st. of their reque
For a ex mple:
1. A user might visit a Web site to browse a “catalog” of items they can request, such as
access to systems, changes to hardware, and so forth.
2. A user selects an item from the catalog, and provides whatever details are necessary
to complete the request.
ending
proval.
3. A ticket is created in the service desk that represents the user’s request. Dep
upon the request, the ticket might first be routed to the user’s manager for ap
4. Once approved, the ticket would be automatically routed to the appropriate
technician or IT team for completion.
5. The user would receive status updates, perhaps via email, throughout this process,
keeping them informed of its progress. The status updates would include a
“completed” update once the request was finished.
By using the same ticket‐based system employed for problem‐solving to address routine
requests, IT technicians can rely on a single interface to manage their workload. Figure 3.3
shows what a routine request ticket might look like.


34

Figure 3.3: Routine requests can also be made into tickets.
Even better, IT management can rely on all IT work being documented and tracked in a
single system, enabling management to stay informed through reports, dashboards, and
other mechanisms. Figure 3.4 shows an example of what such a report might look like.


35

Figure 3.4: Management reports become more effective when they include all IT
workload.
The idea is to keep everyone in the loop: users remain informed, IT remains informed,
management remains informed. Much of the burden of keeping everyone informed is
handled by the software, which can send email updates and other kinds of notifications so
that everyone is aware of what’s happening at all times.
Making Changes: How to Find a Change Management Window
Large, multi‐discipline IT departments have inherent problems. In the previous chapter, I
discussed the problem of silo‐based problem solving, where domain experts spend time
passing a problem back and forth because everyone is looking at different tools and data to
determine whether the problem is “theirs.” We’re certainly not going to get rid of domain
experts, so the solution is to get tools that could put everything into a single console in
order to unify everyone’s efforts.

36

Another problem created by those silos relates to change management. At the start of this
chapter, I outlined one of those problems: The database team is ready to implement a
change, but it’s going to be in conflict with a change being implemented by another group.
Managing change windows is becoming increasingly difficult. Not only are applications and
services needed round‐the‐clock, creating tiny change windows in the first place, but the
varying needs of different experts creates contention for those already‐small windows.
“Boss, we’d have that fix in place, but we can only implement it at night. It’s going to take 4
hours, which just fits inside the window management allows us. But all this week, other
teams have been using the window, and the changes they’re making are blocking us from
doing anything at the same time.” It’s not an unusual situation. It gets tough for
management to even track what changes are pending and to slot them into the shrinking
time that’s available to make them.
The lack of visibility into these windows, and the contention for them, makes it impossible
to even make a management decision. For example, if management could see the number of
changes stacked up, and see the contention, they might decide to expand the window for a
period of time in order to get the changes implemented. They might not decide to do that,
but they’d be consciously making a decision rather than remaining ignorant of the actual
problem.
The solution, of course, is software that facilitates the coordination of departments. Think
about it: If you’re using a service desk solution to track tickets, then tickets can be created
for proposed changes. Those tickets would be assigned to a technician, routed for reviews
and approvals, and so forth, all via some workflow you designed. That’s an excellent way to
support ITIL processes, by the way. The tickets themselves can then feed a unified
calendar, built right into the service desk, which allows change planners to schedule
activities. They can see agreed maintenance windows, manage contention between
conflicting changes, and so forth. By getting this information into a familiar calendar form,
they can also make decisions about whether to widen maintenance windows if doing so is
necessary and beneficial to the organization. Figure 3.5 shows a change management
calendar.


37

Figure 3.5: Managing change schedules in a calendar view.
This is just another way to help keep everyone in the loop. Management now has a clear
visual depiction of change and schedule contention. Such a calendar could even be made
available to users so that they could see what changes were scheduled and plan their own
activities accordingly.
Communicating: How to Bring Users into the Loop
The idea of keeping users informed certainly isn’t new, but many organizations that have
attempted to better engage their users haven’t met with unqualified success. Too often,
“keep users in the loop” solutions take the form of self‐service Web portals, where users
can log in to check the status of their tickets or to check the status of a particular service.
That’s all well and good, but Web portals like that don’t always fall within the natural
workflow of a user. For example, most users, when confronted with some kind of problem,
don’t necessarily think to check a Web site and see if something’s wrong—they call the
Help desk.

38

Users do, however, spend a lot of time in their email inbox. Why not make that your
channel for communication? Organizations don’t use this method of communication in part
because doing so could easily become a time burden for your IT team. “So on top of solving
the problem, I have to send out hourly update emails with the status of the problem?”
Sounds like a Dilbert cartoon!
In reality, a good service desk solution can do it for you. Sending an email update when a
user’s ticket is updated, for example, is an easy operation for a piece of software. Such
emails can be informative, and help users feel comfortable that their request is being
handled. Figure 3.6 shows what one might look like.

Figure 3.6: Keeping users informed wi

th detailed emails.

39

What’s more compelling is a service desk solution that can actually accept requests via
email rather than expecting users to go to a self‐service Web portal and open a ticket. Face
it: Your users are more likely to pick up the phone than visit a Web site, unless you’ve
placed significant artificial barriers in the way, like complex voice menus in the phone
system. Users are more likely to send an email. If your service desk, rather than a human
technician, can receive those emails and use them to create a ticket, you’ve truly created a
system your users are likely to embrace. Such tickets could still be auto‐assigned and –
routed, helping the right technician to start working the problem more quickly.
Even for your users’ routine, non‐problem requests, email updates can be valuable. When
their request is approved, rejected, underway, completed, and so forth, an email update
helps keep users informed without additional human effort.
Note
I want to emphasize that self‐service portals are a good thing. They can
provide a rich user experience, help guide users to self‐service solutions, and
more. They just shouldn’t be the only means of communicating with users.
SLAs: Setting and Meeting Realistic Expectations
Unless you’ve been living under a rock for the past decade or so, Service Level Agreements
(SLAs) are probably pretty familiar to you. These are, in their simplest form, an agreement
by the IT team to provide a specific level of performance or availability for a specific service
or application. “The email service will be available 99.999% of the time on an annualized
basis” is an example of a very simple SLA.
But SLAs can get complicated quickly. You can’t just pull a number out of thin air; what
level of service can you reasonably provide? What level of service have you historically
provided, and is that meeting the business’ needs? Once established, how do you track the
SLA to make sure you’re actually meeting it—and ideally get some kind of notification
when you’re in danger of breaking the agreement?
SLAs might not be the only type of agreement you need to define and track. Some
organizations also use underpinning contracts (UCs) or operational level agreements (OLAs)
for different in‐ and out‐sourced services; these often support SLAs.
A well‐built service desk and monitoring solution can help you handle these agreements
more precisely. You’ll start by defining top‐level SLAs, then creating and managing UCs and
OLAs as appropriate.
Once defined, the solution should be able to track ongoing performance and availability,
perhaps offering a simple dashboard—like the one shown in Figure 3.7—that illustrates
your compliance with your SLAs. You might also have more comprehensive and detailed
reports on SLA metrics.


40

Figure 3.7: Managing SLAs with ataglance dashboards.
Most importantly, however, the solution needs to provide you with the ability to define
rules for your SLAs so that tickets can be created—and auto‐assigned to the appropriate
technicians—when SLAs are in danger of being broken. Further, the solution should
support escalation rules so that if an SLA that is in danger of being broken is not corrected
within a certain amount of time, the solution can automatically call for backup, summoning
additional technicians, notifying management, and so forth.
There’s also a strong need to recognize that no SLA is perfect. Sometimes, for whatever
reason, the business will decide to take a service offline. Perhaps it’s for a software upgrade
or for some kind of infrastructure maintenance. In those cases, you’re not breaking the SLA;
you’re agreeing—along with whatever part of the business will be affected—to temporarily
suspend the SLA to get the work done. A service desk solution should support these types of
exceptions, including SLAs that are only valid during certain hours, holiday exceptions,
agreed‐upon reduced service windows, maintenance windows, and so forth.
The idea is to automate SLA definition and management—and to automate the notifications
that go with SLAs. If an SLA is broken, you might agree that the affected business users will
receive an automatic notification. That lets them know that IT knows about the problem
and is working on it—without forcing users to visit a self‐service portal and open a ticket.
That kind of proactive response can go a long way toward improving IT‐user relationships,
and in helping IT be viewed as responsive to, and supportive of, business requirements.

E book creating-unified-it-monitoring-and-management-in-your-environment-chap-1-6

E book creating-unified-it-monitoring-and-management-in-your-environment-chap-1-6

Recommended

Recommended

More Related Content

Similar to E book creating-unified-it-monitoring-and-management-in-your-environment-chap-1-6

Similar to E book creating-unified-it-monitoring-and-management-in-your-environment-chap-1-6 (20)

Recently uploaded

Recently uploaded (20)

E book creating-unified-it-monitoring-and-management-in-your-environment-chap-1-6