E book creating-unified-it-monitoring-and-management-in-your-environment-chap-1-6

sponsored by
Creating Unified IT
Monitoring and
Management in
Your Environment
Don Jones

Creating Unified IT Monitoring and Management in Your Environment Don Jones

i
Introduction to Realtime Publishers
by Don Jones, Series Editor
For several years now, Realtime has produced dozens and dozens of high‐quality books
that just happen to be delivered in electronic format—at no cost to you, the reader. We’ve
made this unique publishing model work through the generous support and cooperation of
our sponsors, who agree to bear each book’s production expenses for the benefit of our
readers.
Although we’ve always offered our publications to you for free, don’t think for a moment
that quality is anything less than our top priority. My job is to make sure that our books are
as good as—and in most cases better than—any printed book that would cost you $40 or
more. Our electronic publishing model offers several advantages over printed books: You
receive chapters literally as fast as our authors produce them (hence the “realtime” aspect
of our model), and we can update chapters to reflect the latest changes in technology.
I want to point out that our books are by no means paid advertisements or white papers.
We’re an independent publishing company, and an important aspect of my job is to make
sure that our authors are free to voice their expertise and opinions without reservation or
restriction. We maintain complete editorial control of our publications, and I’m proud that
we’ve produced so many quality books over the past years.
I want to extend an invitation to visit us at http://nexus.realtimepublishers.com, especially
if you’ve received this publication from a friend or colleague. We have a wide variety of
additional books on a range of topics, and you’re sure to find something that’s of interest to
you—and it won’t cost you a thing. We hope you’ll continue to come to Realtime for your
far into the future. educational needs
enjoy. Until then,
Don Jones

ii

Introduction to Realtime Publishers ................................................................................................................. i
Ch

apter 1: Managing Your IT Environment: Four Things You’re Doing Wrong ........................... 1
IT Management: How We Got to Where We Are Today ..................................................................... 1
Problem 1: You’re Managing IT in Silos ..................................................................................................... 3
Problem 2: You Aren’t Connecting Your Users, Service Desk, and IT Management ............... 6
Problem 3: You’re Measuring the Wrong Things ................................................................................. 8
Problem 4: You’re Losing Knowledge ..................................................................................................... 12
How Truly Unified Management Can Fix the Problems ................................................................... 13
Summary .............................................................................................................................................................. 14
Ch apter 2: Eliminating the Silos in IT Management ............................................................................... 16
Too Many Tools Means Too Few Solutions ........................................................................................... 16
Domain‐Specific Tools Don’t Facilitate Cooperation ........................................................................ 19
The Cloud Question: Unifying On‐Premise and Off‐Premise Monitoring................................. 21
Missing Pieces .................................................................................................................................................... 23
Not All of IT Is a Problem: Ordering, Routing, and Providing Services ..................................... 27
Coming Up Next… ............................................................................................................................................. 28
Ch apter 3: Connecting Everyone to the IT Management Loop ........................................................... 29
Starting the Loop: Connecting Monitoring to the Service Desk ................................................... 30
Making Changes: How to Find a Change Management Window .................................................. 35
Communicating: How to Bring Users into the Loop .......................................................................... 37
SLAs: Setting and Meeting Realistic Expectations .............................................................................. 39
Thin Tell Me What You Really k ................................................................................................................... 41
When Everyone Doesn’t Need to See Everything: A Multi‐Tenant Approach ........................ 42
Call It a Private Management Cloud: Allocating Costs ...................................................................... 43
Conclusion ........................................................................................................................................................... 44
Coming Up Next… ............................................................................................................................................. 44
Ch apter 4: Monitoring: Look Outside the Data Center .......................................................................... 45
Monitoring Technical Counters vs. the End‐User Experience ...................................................... 45


iii
How the EUE Drives Better SLAs ............................................................................................................... 46

How It’s Done: Synthetic Transactions, Transaction Tracking, and More ............................... 49
Top‐Down Monitoring: From the EUE to the Root Problem ......................................................... 50
Agent vs. Agentless Monitoring .................................................................................................................. 51
Monitoring What Isn’t Yours ....................................................................................................................... 54
Critical Capability: You Need to Monitor Everything ........................................................................ 57
Conclusion ........................................................................................................................................................... 59
Coming Up Next… ............................................................................................................................................. 59
Ch apter 5: Turning Problems into Solutions ............................................................................................. 60
Closing the Loop: Connecting the Service Desk to Monitoring ..................................................... 60
Re taining Knowledge Means Faster Future Resolution .................................................................. 62
Knowledge Bases ......................................................................................................................................... 63
Tickets as Knowledge Base Articles .................................................................................................... 64
Unifying the Knowledge Base ................................................................................................................. 65
Making Tickets an Asset ........................................................................................................................... 69
Pa st Performance Is an Indication of Future Results ........................................................................ 69
It’s the Performance Database ............................................................................................................... 72
Summary .............................................................................................................................................................. 73
Coming Up Next… ............................................................................................................................................. 73
Ch apter 6: Unified Management, Illustrated ............................................................................................. 74
Th e Case Studies ............................................................................................................................................... 74
Detecting and Solving Problems ........................................................................................................... 74
Fulfilling User Orders ................................................................................................................................. 79
A Shopping List for Unified IT Management ......................................................................................... 82
Ways to Buy Your Unified IT ....................................................................................................................... 84
Conclusion ........................................................................................................................................................... 85


iv
Copyright Statement
© 2012 Realtime Publishers. All rights reserved. This site contains materials that have
been created, developed, or commissioned by, and published with the permission of,
Realtime Publishers (the “Materials”) and this site and any such Materials are protected
by international copyright and trademark laws.
THE MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE,
TITLE AND NON-INFRINGEMENT. The Materials are subject to change without notice
and do not represent a commitment on the part of Realtime Publishers its web site
sponsors. In no event shall Realtime Publishers or its web site sponsors be held liable for
technical or editorial errors or omissions contained in the Materials, including without
limitation, for any direct, indirect, incidental, special, exemplary or consequential
damages whatsoever resulting from the use of any information contained in the Materials.
The Materials (including but not limited to the text, images, audio, and/or video) may not
be copied, reproduced, republished, uploaded, posted, transmitted, or distributed in any
way, in whole or in part, except that one copy may be downloaded for your personal, non-
commercial use on a single computer. In connection with such use, you may not modify
or obscure any copyright or other proprietary notice.
The Materials may contain trademarks, services marks and logos that are the property of
third parties. You are not permitted to use these trademarks, services marks or logos
without prior written consent of such third parties.
Realtime Publishers and the Realtime Publishers logo are registered in the US Patent &
Trademark Office. All other product or service names are the property of their respective
owners.
If you have any questions about these terms, or if you would like information about
licensing materials from Realtime Publishers, please contact us via e-mail at
info@realtimepublishers.com.


1
Chapter 1: Managing Your IT Environment:
Four Things You’re Doing Wrong
At the very start of the IT industry, “monitoring” meant having a guy wander around inside
the mainframe looking for burnt‐out vacuum tubes. There wasn’t really a way to locate the
tubes that were working a bit harder than they were designed for, so monitoring—such as
it was—was an entirely reactive affair.
In those days, the “Help desk” was probably that same guy answering the phone when one
of the other dozen or so “computer people” needed a hand feeding punch cards into a
hopper, tracking down a burnt‐out tube, and so on. The concepts of tickets, knowledge
bases, service level agreements (SLAs), and so forth hadn’t yet been invented.
IT management has certainly evolved since those days, but it unfortunately hasn’t evolved
as much as it could or should have. Our tools have definitely become more complex and
more mature, but the way in which we use those tools—our IT management processes—
are in some ways still stuck in the days of reactive tube‐changing.
Some of the philosophies that underpin many organizations’ IT management practices are
really becoming a detriment to the organizations that IT is meant to support. The
discussion in this chapter will revolve around several core themes, which will continue to
drive the subsequent chapters in this book. The goal will be to help change your thinking
about how IT management—particularly monitoring—should work, what value it should
provide to your organization, and how you should go about building a better‐managed IT
environment.
IT Management: How We Got to Where We Are Today
In the earliest days of IT, we dealt with fairly straightforward systems. Even simplistic, by
today’s standards. The IT team often consisted of people who could fix any of the problems
that arose, simply because there weren’t all that many “moving parts.” It’s as if IT was a car:
A machine capable of complexity and of doing many different things, but perfectly
comprehendible, in its entirety, by a single human being.

2

As we started to evolve that IT car into a space shuttle, we gradually needed to allow for
specialization. Individual systems became so complex in and of themselves that we needed
domain‐specific experts to be able to monitor, maintain, and manage each system.
Messaging systems. Databases. Infrastructure components. Directory services. The vendors
who produced these systems, along with third parties, developed tools to help our experts
monitor and manage each system. That’s really where things went wrong. It seemed
perfectly sensible at the time, and indeed there was probably no other way to have done
things, but that establishment of domain‐specific silos—each with their own tools, their
own procedures, and their own expertise—was the seed for what would become a
towering problem inside many IT shops.
Fast forward to today, when our systems are vastly more complex, vastly interconnected,
and increasingly not even hosted within our own data centers. When a user encounters a
problem, they obviously can’t tell us which of our many complex systems is at fault. They
simply tell us what they observe and experience about the problem, which may be the
aggregate result of several systems’ interactions and interdependencies. Our users see a
holistic environment: “IT.” That doesn’t correspond well to what we see on the back end:
databases, servers, directories, files, networks, and more. As a result, we often spend a lot
of time trying to track down the root cause of problems. Worse, we often don’t even see the
problems coming, because the problems only exist when you look at the end result of the
entire environment rather than at individual subsystems. Users feel completely
disconnected from the process, shielded from IT by a sometimes‐helpful‐sometimes‐not
“Help desk.” IT management has a difficult time wrapping their heads around things like
performance, availability, and so on, simply because they’re forced to use metrics that are
specific to each system on the network rather than look at the environment as a whole.
The way we’ve built out our IT organizations has led to very specific business‐level issues,
which have become common concerns and complaints throughout the world:
• IT has difficulty defining and meeting business‐level SLAs. “The messaging server
will be up 99% of the time” isn’t a business‐level SLA; it’s a technical one. “Email will
flow between internal and external users 99% of the time” is a business‐level SLA,
but it can be difficult to measure because that statement involves significantly more
systems than just the email server.
• IT has difficulty proactively predicting problems based on system health, and
remains largely reactive to problems.
• When problems occur, IT often spends far too much time pinpointing the root cause
of the problem.
• IT’s concept of performance and system health is driven by systems—database
servers, directory services, network devices, and so forth—rather than by how users
and the organization as a whole are experiencing the services delivered by those
systems.

3

• IT has a tough time rapidly adopting new technologies that can benefit the business.
Oxymoronically, IT is often the part of the organization most opposed to change,
because change is usually the trigger for problems. Broken systems don’t help
anyone, but an inability to quickly incorporate changes can also be a detriment to
the organization’s competitiveness and flexibility.
• IT has a really tough time adopting new technologies that are significantly outside
the team’s experience or physical reach—most specifically the bevy of outsourced
offerings commonly grouped under the term “cloud computing.” These technologies
and approaches to technology are so different from what’s come before that IT
doesn’t feel confident that they can monitor and manage these new systems. Thus,
they resist implementing these types of systems for fear that doing so will simply
damage the organization.
• Even with modern self‐service Help desk systems, users feel incredibly powerless
and out of touch when it comes to IT.
All of these business‐level problems are the direct result of how we’ve always managed IT.
Our processes for monitoring and managing IT basically have four core problems. Not
every organization has every single one of these, of course, and most organizations are at
least aware of some of these and work hard to correct them. Ultimately, however,
organizations need to ensure that all four of these core problems are addressed. Doing so
will immediately begin to resolve the business‐level issues I’ve outlined.
Problem 1: You’re Managing IT in Silos
Figures 1.1, 1.2, and 1.3 illustrate one of the fundamental problems in IT monitoring and
management today.

Figure 1.1: Windows Performance Monitor.


4

Figure 1.2: SQL Server Performance.

Figure 1.3: Router Performance.
These figures each illustrate a different performance chart for various components of an IT
system. Each of these images was produced using a tool that is more or less specialized for
the exact thing that was being monitored. The tool that produced the router performance
chart, for example, can’t produce the same chart for a database server or even for a router
that’s located on someone else’s network.

5

This is such a core, fundamental problem that many IT experts can’t even recognize that it
is a problem. Using these domain‐specific tools is such an integrated and seemingly natural
part of how IT works that many of us simply can’t imagine a different way. But we need to
move past using these domain‐specific tools as our first line of defense when it comes to
ring and troubleshooting. monito
Why?
One major reason is that these tools keep us all from being on the same page. IT experts
can’t even have meaningful cross‐discipline discussions when these tools become involved.
“I’m looking at the database server, and the performance is at more than 200 TPMs,” one
expert says. “Well, that must be a problem because the router is running well over 10,000
PPMs.” Those two experts don’t even have a common language for performance because
they’re locked into the domain‐specific, deeply‐technical aspects of the technologies they
manage.
Domain‐specific tools also encourage what is probably the worst single practice in all of IT:
looking at systems in isolation. The database guy doesn’t have the slightest idea what
makes a router tick, what constitutes good or bad performance in a messaging server, or
what to look for to see if the directory services infrastructure is running smoothly. So the
database guy puts on a set of blinders and just looks at his database servers. But those
servers don’t exist in a vacuum; they’re impacted by, and they in turn impact, many other
systems. Everything works together, but we can’t see that using domain‐specific tools.
We have to permanently remove the walls between our technical disciplines, breaking
down the silos and getting everyone to work as a single team. In large part, that means
we’re going to have to adopt new tools that enable IT silos to work as a team, putting the
information everyone needs into a common context. Sure, domain‐specific tools will always
have their place, but they can’t be our first line of information.
Case Study
Jerry works for a typical IT department in a midsize company. His specialty is
Windows server administration, and his team includes specialists for Web
applications, Microsoft SQL Server and Oracle, VMware vSphere, and for the
network infrastructure. The company outsources certain enterprise
functionality, including their Customer Relationship Management (CRM) and
email.
Recently, a problem occurred that caused the company’s main Web site to
stop sending customer order confirmation emails. Jerry was initially called to
solve the problem, on the assumption that it was with the company’s
outsourced messaging solution. Jerry discovered, however, that user email
was flowing normally. He passed the problem to the Web specialist, who
confirmed that the Web site was working properly but that emails sent by it
were being rejected. Jerry filed a ticket with the messaging hosting company,
who responded that their systems were in working order and that he should
check the passwords that the Web servers were using.


6
After more than a day of back‐and‐forth with the hosting company and
various experts, the problem was traced to the company’s firewall. It had
recently been upgraded to a new version, and that version was now blocking
outgoing message traffic from the company’s perimeter network, which is
where the Web servers were located. The network infrastructure specialist
was called in to reconfigure the firewall, and the problem was solved.

This narrative precisely demonstrates the problem: By managing our IT teams as domain‐
specific silos, we significantly hinder their ability to work together to solve problems. The
fact that IT experts require domain‐specific tools shouldn’t be a barrier to breaking down
those silos and getting our team to work more efficiently together. This becomes especially
important when pieces of the infrastructure are outsourced; those hosting companies are
an unbreakable silo, as they’re not responsible for any systems other than the ones they
provide to us. However, the dependencies that our systems and processes have on their
systems means our own team still has to be able to monitor and troubleshoot those
outsourced systems as if they were located right in the data center.
Problem 2: You Aren’t Connecting Your Users, Service Desk, and IT
Management
Communication is a key component of making any team work; and the “team” that is your
organization is no exception. In the case of IT, we typically use Help desk systems as our
means of enabling communications—but that isn’t always sufficient. Help desk systems are
almost always built around the concept of reacting to problems, then managing that
reaction; they’re almost by definition not proactive.
For example, how do you tell your users that a given system will have degraded
performance or will be offline for some period of time? Probably through email, which
creates a couple of problems:
• Important messages tend to get lost in the glut of email that users deal with daily
• Users who don’t get the message tend to go the “Help desk route,” which doesn’t
include a means of intercepting their mental process and letting them know that the
“problem” was planned for.

7

Most IT teams do know the things that need to be communicated throughout the
organization, for example:
• SLAs
• ey’re being met The current status of SLAs—whether th
• Planned outages and degraded service
• ices Average response times for specific serv
• Known issues that are being worked on
What most IT teams have a problem with is communicating these items consistently across
the entire organization. Some organizations rely on email, which as I’ve already pointed out
can be inefficient and not consistently effective. Some organizations will use an intranet
Web site, such as a SharePoint portal, to post notices—but these sites aren’t directly
integrated with the Help desk, making it an extra step to keep them updated and requiring
users to remember to check them.
Case Study
Tom works as an inside salesperson for a midsize manufacturing company.
Recently, the application that Tom uses to track prospects and create new
orders started responding very slowly, and over the course of the day,
stopped working completely.
Tom’s initial action was to call his company’s IT Help desk. The Help desk
technician sounded harried and frustrated, and told Tom, “We know, we’re
working on it,” and hung up. Tom had no expectation when the system might
return to normal, and was afraid to bother the Help desk by calling back for
more details.
Over the course of that day, the Help desk logged calls from nearly every
salesperson, each of whom called on their own to find out what was going on.
Eventually, the Help desk simply stopped logging the calls, telling everyone
that, “A ticket is already open,” and disconnecting the call.
Someone on the IT management team eventually sent out an email explaining
that a server had failed and that the application wasn’t expected to be online
until the next morning. Tom wished he had known earlier; although he’d
originally planned to make sales calls all day, if he’d known that the
application would be down for that long, he could have switched to other
activities for the day or even just taken the day of

f.

8

Management communications are equally important, and equally challenging. Providing
frank numbers on service levels, response times, outages, and so forth is crucial in order for
management to make better decisions about IT—but that information can often be difficult
to come by.
Problem 3: You’re Measuring the Wrong Things
This problem is very likely at the heart of everything IT is not doing to help better align
technology with business needs. The following case study outlines the scenario.
Case Study
Shelly works in the Accounting department for her company. Recently, while
trying to close the books for her company, the accounting application began
to react very slowly. She called her company’s IT Help desk to report the
problem.
The Help desk technician listened to her then said that, “Everything on that
server looks fine right now. I’ll open a ticket and ask someone to look at it,
but since we are currently within our service level agreement for response
times, it will be a low‐priority ticket.”
Shelly continued to struggle with the slowly‐responding application.
Eventually, someone was dispatched to her desktop. She demonstrated that
every other application was responding normally. She pointed out that other
people in her department were having similar problems with the application.
The technician made her close all of her applications and then restarted her
computer, to no effect. He shrugged, entered some notes into his smartphone,
and left.
By the next morning, the application’s response times were better, but they
were far from normal. Shelly continued to call the Help desk for updates on
her ticket’s status, but it seemed as if the IT team had given up on trying to fix
the problem—and refused to even admit that there was a problem.

This kind of scenario unfortunately happens all too often in many organizations. It exactly
illustrates what happens when several problems are happening at once: IT is operating as a
set of individual silos rather than as a team, and each silo has its own definition for words
like “slow.” A root issue here is that everyone is measuring the wrong thing. Figure 1.4
shows how the average IT team sees a multi‐component, distributed application.


9

Figure 1.4: IT perspective of a distrib

uted application.

10

They see the components. Domain experts measure the performance of each component
using technical metrics, such as processor utilization, response time, and so forth. When a
component’s performance exceeds certain predefined thresholds, someone in IT pays
attention. Figure 1.5, however, shows how a user sees this same application.

Figure 1.5 er’s perspective of a distributed application.
The user doesn’t—often can’t—see any of the components. They simply see an application,
and either it’s responding the way they expect, or it isn’t. It doesn’t matter a bit to the user
if every single constituent component is running at an “acceptable level of processor
utilization”—whatever that means. They simply care whether the application is working.
This creates a major disconnect between the user population and IT, as Figure 1.6
illustrates.
: Us


11

Figure 1.6: IT vs. user measurements

of performance.

12

Users and IT measure very different things. An IT‐centric SLA might specify a given
response time for queries sent to a database server; that often has little to do with whether
an application is seen as “slow” by users. Worse, as we start to migrate services and
components to “the cloud,” we lose much of our ability to measure those components’
performance the way we do for things that are in our own data center. The result? Nobody
can agree on what an SLA should say.
This all has to change. We have to start measuring things more from a user perspective.
The performance of individual components is important, but only as they contribute to the
total experience that a user perceives. We need to define SLAs that put everyone—users
and IT—on the same page, then manage to those SLAs using tools that enable us to do so.
Some organizations will tell you that they’re moving, or have moved, to a service‐based IT
offering. What that generally means in broad terms is that the organization is seeking to
provide IT as a set of services to the organization’s various departments and users. In many
instances, however, those “service‐oriented” organizations are still focused on components
and devices, which isn’t a service‐oriented approach at all. When your phone line goes
down, you don’t call the phone company (on your cell phone, probably) and start asking
questions about switches and trunk lines—you ask when your dial tone will be back. The
back‐end infrastructure is meaningless to the user. You don’t ask for a service credit based
on how long a particular phone company office will be offline, you ask for that credit based
on how long you went without a dial tone. That's the model IT needs to move toward.
Problem 4: You’re Losing Knowledge
The last problematic practice we’ll look at is the issue of lost institutional knowledge. This
problem is a purely human one, and frankly it’s going to be difficult to address. Here’s a
quick scenario to set the scene.
Case Study
Aaron works for his company’s IT department. He’s been with the company
for 3 years and is responsible for several of the company’s systems and
infrastructure components. One Tuesday, Aaron is contacted by his
company’s IT Help desk. “We’re assigning you a ticket about the Oracle
system,” he’s told. “Once every couple of months it starts acting really weird,
and someone has to fix it.”
“I’m not the Oracle guy,” Aaron says. “That’s Jill.”
“Yeah, but Jill’s out on vacation for 2 weeks. So you’ll have to fix it.”
“I’ve no idea what to do!”
“Well, figure something out. The CEO gets upset when this takes too long to
fix.”

13

Unfortunately, too much knowledge gets wrapped up in the heads of specific individuals. In
fact, it’s a sad truth that many organizations “deal” with this problem by simply
discouraging IT team members to take lengthy vacations, and often resist other activities
that would put them out of touch—such as sending them to conferences and classes to
continue their education and to learn new skills.
More than a few organizations have made halfhearted attempts at building “knowledge
bases,” in a hope that some of this institutional knowledge can be committed to electronic
paper, preserved, and made more accessible. The problem is that IT professionals aren’t
necessarily good writers, so the act of producing the knowledge base is difficult for them. It
also takes time—time the organization is often unwilling to commit, especially in the face
of other daily pressures and demands.
As I said, this is a problem that’s difficult to fix. The IT team realizes it’s a problem, and is
generally willing to fix it—but they’re not tech writers, and often have a limited ability to
fix the problem. You can usually create management requirements that require problems
and solutions be logged in a Help desk ticketing system, but searching through that system
for problems and solutions can often be difficult and time‐consuming—much like searching
for solutions on an Internet search engine, with all of the false “hits” such a search generally
s. produce
But we must find a way to address this problem. Knowledge about the company’s
infrastructure—and how to solve problems—has to be captured and preserved. This
requirement is crucial not only to solving problems faster in the future but also to
eventually preventing those problems by making better IT management decisions.
How Truly Unified Management Can Fix the Problems
This book is going to be all about fixing these four problems, and the means by which I’ll
propose to do so falls under the umbrella term unified management. Essentially, unified
management is all about bringing everything together in one place.
We’ll break down the silos between IT disciplines, putting everyone onto the same console,
getting everyone working from the same data set, and getting everyone working together
on problems. We’ll do that in a way that brings users, IT, and management into a single
viewport of IT service and performance. We’ll create more transparency about things like
service levels, letting users see what’s happening in the environment so that they’re more
informed.
We’ll inform users in a way that’s meaningful to them rather than using invisible, back‐end
technical metrics. We’ll rebuild the entire concept of SLAs into something that’s meaningful
first to users and management, and that can withstand the transition to “hybrid IT” that’s
cloud.” being brought about by outsourcing certain IT services to “the

14

Finally, we’ll find a way to capture information about our environment, including solutions
to problems, to enable faster time‐to‐resolution when problems occur. In addition, this
information will enable management to make smarter decisions about future technology
directions and investments.
We’ll try to do all of this in a way that won’t cost the organization an arm and a leg nor take
half a lifetime to actually implement. That will involve a certain amount of creativity,
including looking at outsourced solutions. The idea of an outsourced solution providing
monitoring for in‐sourced components is fairly innovative, and we’ll see what applicability
it has.
I should point out that much of what we’ll be looking at can work to support the IT
management frameworks that many organizations are adopting these days, including the
ITIL framework that’s become popular in the past few years. You certainly don’t have to be
an ITIL expert to take advantage of the new processes and techniques I’ll suggest—nor do
you even have to think about implementing ITIL (or any other framework) if your
organization isn’t already doing so. If you are using a framework, however, you’ll be
pleased to know that everything I have to propose should fit right into it.
Summary
This chapter has established the four main themes that will drive the remaining chapters in
this book. These core things represent what many experts believe are the biggest and most
fundamental problems with how IT is managed today, and represent the things that we’ll
focus on fixing throughout the remainder of this book. Our focus will be on changing
management philosophies and practices, not on simply picking out new tools—although
new tools may be something you’ll acquire to help support these new practices.
Chapter 2 will focus on the first problematic practice, which is the fact that IT tends to be
managed in domain‐specific silos. We’ll look at the technical reasons organizations have
been more or less forced to manage this way, and explore ways in which you can start to
change that practice.
Chapter 3 will look at connecting people: IT management, your users, your service desk,
and more. Only by bringing everyone into the process can IT better align itself to the needs
of the organization.
Our third problem practice will be the subject of Chapter 4, where we dive into looking
outside the data center for monitoring. The goal will be to solve the problems we’ve
to the organization. discussed in this chapter, further focusing IT on its value

15

Chapter 5 will discuss ways to turn problems into future solutions. Although modern
organizations are fully aware of the need for Help desk tracking and knowledge building,
how those activities are managed as part of the larger IT management process can make a
huge difference in their value‐add to the organization.
We’ll conclude in Chapter 6, with an attempt to visualize an IT environment where these
new, unified management practices are in place. I’ll provide narratives from several case
work in a real environment. studies, helping you see how these modernized practices


16
Chapter 2: Eliminating the Silos in IT
Management
In the previous chapter, I proposed that one of the biggest problems in modern IT is the
fact that we manage our environment in technology‐specific silos: database administrators
are in charge of databases, Windows admins are in charge of their machines, VMware
admins run the virtualization infrastructure, and so forth. I’m not actually proposing that
we change that exact practice—having domain‐specific experts on the team is definitely a
benefit. However, having these domain‐specific experts each using their own unique,
domain‐specific tool definitely creates problems. In this chapter, we’ll explore some of
those problems, and see what we can do to solve them and create a more efficient, unified
IT environment.
Too Many Tools Means Too Few Solutions
“Comparing apples to oranges” is an apt phrase when it comes to how we manage
performance, troubleshooting, and other core processes in IT. Tell an Exchange Server
administrator that there’s a performance problem with the messaging system, and he’ll
likely jump right into Windows’ Performance Monitor, perhaps with a pre‐created counter
set that focuses on disk throughput, processor utilization, RPC request count, and so
forth—as shown in Figure 2.1.

Figure 2.1: Monitoring Exchange.

17

If the Exchange administrator can’t find anything wrong with the server, he might pass the
problem over to someone else. Perhaps it will be the Active Directory administrator
because Active Directory plays such a crucial role in Exchange’s operation and
performance. Out comes the Active Directory administrator’s favorite performance tool,
perhaps similar to the one shown in Figure 2.2. This is truly a domain‐specific tool, with
special displays and measurements that relate specifically to Active Directory.

Figure 2.2: Monitoring Active Directory.
If Active Directory looks fine, then the problem might be passed over to the network
infrastructure specialist. Out comes another tool, this one designed to look at the
performance of the organization’s routers (see Figure 2.3).


18

Figure 2.3: Monitoring router performance.
Combined, all of these tools have led these three specialists to the same decision:
Everything’s working fine. In spite of the fact that Exchange is clearly, from the users’ point
of view, not working fine, there’s no evidence that points to a problem.
Simply put, this is a “too many tools, too few answers” problem. In today’s complex IT
environments, performance—along with other characteristics like availability and
scalability—are the result of many components interacting with each other and working
together. You can’t manage IT by simply looking at one component; you have to look at
entire systems of interacting, interdependent components.
Our reliance on domain‐specific tools holds us back from finding the answers to our IT
problems. That reliance also holds us back when it comes time to grow the environment,
manage service level agreements (SLAs), and other core tasks. I’ve actually seen instances
where domain‐specific tools acted almost as blinders, preventing an expert who should
have been able to solve a problem, or at least identify it, from doing so as quickly as he or
she might have done.

19

Case Study
Heather is a database administrator for her organization. She’s responsible
for the entire database server, including the database software, the operating
system (OS), and the physical hardware.
One day she receives a ticket indicating that users are experiencing sharply
reduced performance from the application that uses her database. She whips
out her monitoring tools, and doesn’t see a problem. The server’s CPU is
idling along, disk throughput is well within norms, and memory consumption
is looking good. In fact, she notices that the amount of workload being sent to
the server is lower than she’s used to seeing. That makes her suspect the
network is having traffic jams, so she re‐assigns the ticket to the company’s
infrastructure team. That team quickly re‐assigns the ticket right back to her,
assuring her that the network is looking a bit congested, but it’s all traffic
coming from her server.
Heather looks again, and sees that the server’s network interface is humming
along with a bit more traffic than usual. Digging deeper, she finally realizes
that the server is experiencing a high level of CRC errors, and is thus having
to retransmit a huge number of packets. Clients experience this problem as a
general slowdown because it takes longer for undamaged packets to reach
their computers.
Heather’s focus on her specific domain expertise led her to “toss the problem
over the wall” to the infrastructure team, wasting time. Because she wasn’t
accustomed to looking at her server’s network interface, she didn’t check it
as part of her routine performance troubleshooting process.
Domain‐Specific Tools Don’t Facilitate Cooperation
If the components of our complex IT systems are cooperative and interdependent, our IT
professionals are often anything but. In other words, IT management tends to encourage
the silos that are built around specific technology domains. There’s the database
administration group, the Active Directory group, the infrastructure group, and so forth.
Even companies that practice “matrix management,” in which multiple domain experts are
os around each technical domain. grouped into a functional team, still tend to accept the sil

20

There are two major reasons that these silos persist, and almost any IT professional can
describe them to you:
• “I don’t know anything about that.” Each domain expert is an expert in his technical
area. The database administrator isn’t proficient at monitoring or managing routers,
and doesn’t especially want to work with them anyway. There’s little real value in
extensive technical cross‐training for most organizations, simply because their staff
doesn’t have the time. Devoting time to secondary and tertiary disciplines reduces
the amount of time available for their primary job responsibilities.
• “I don’t want anyone messing with my stuff.” IT professionals want to do a good job,
and they’re keenly aware that most problems come about as the result of change.
Allow someone to change something, and you’re asking for trouble. If someone
changes something in your part of the environment, and you don’t know about their
activity, you’ll have a harder time fixing any resulting problems.
Both of these reasons are completely valid, and I’m in no way suggesting that everyone on
the IT team become an expert in every technology that the organization must support.
minor adjHowever, the attitudes reflected in these two perspectives require some ustment.
One reason I keep coming back to domain‐specific tools is because they encourage this kind
of walled‐garden separation, and do nothing to encourage even the most cursory
cooperation between IT specialists. Cooperation, when it exists, comes about through good
human working relationships—and those relationships often struggle with the fact that
each specialist is looking at a different set of data and working from a different “sheet of
music,” so to speak. I’ve been in environments and seen administrators spend hours
arguing about whose “fault” something was, each pointing to their own domain‐specific
tools as “evidence.”
Case Study
Dan is an Active Directory administrator for his company, and is responsible
for around two dozen domain controllers, each of which runs in a virtual
machine. Peg is responsible for the organization’s virtual server
infrastructure, and manages the physical hosts that run all of the virtual
machines.
One afternoon, Peg gets a call from Dan. Dan’s troubleshooting a performance
problem on some of the domain controllers, and suspects that something is
consuming resources on the virtualization host that his domain controllers
need.


21
Peg opens her virtual server console and assures Dan that the servers aren’t
maxed out on either physical CPU or memory, and that disk throughput is
well within expected levels. Dan counters by pointing to his Active Directory
monitoring tools, which show maxed‐out processor and memory statistics,
and lengthening disk queues that indicate data isn’t being written to and read
from disk as quickly as it should be. Peg insists that the physical servers are
fine. Dan asks if the virtual machines settings have been reconfigured to
provide fewer resources to them, and Peg tells him no.
The two go back and forth like this for hours. They’re each looking at
different tools, which are telling them completely different things. Because
they’re not able to speak a common technology language, they’re not able to
work together to solve the problem.
We don’t need to have every IT staffer be an expert in every IT technology; we do need to
make it easier for specialists to cooperate with one another on things like performance,
scalability, availability, and so forth. That’s difficult to do with domain‐specific tools. The
router administrator doesn’t want a set of database performance‐monitoring tools, and the
database administrator doesn’t especially want the router admin to have those tools.
Having domain‐specific tools for someone else’s technical specialization is exactly how the
two attitudes I described earlier come into play.
Ultimately, the problem can be solved by having a unified tool set. Get everyone’s
performance information onto the same screen. That way, everyone is playing from the
same rule book, looking at the same data—and that data reflects the entire, interdependent
environment. Everyone will be able to see where the problem lies, then they can pull out
the domain‐specific tools to start fixing the actual problem area, if needed.
The Cloud Question: Unifying On‐Premise and Off‐Premise M
This concept of a unified monitoring console becomes even more important as
organizations begin shifting more of their IT infrastructure into “the cloud.”
onitoring
The Cloud Is Nothing New
I have to admit that I’m not a big fan of “the cloud” as a term. It’s very sales‐
and‐marketing flavored, and the fact is that it isn’t a terribly new concept.
Organizations have outsourced IT elements for years. Probably the most‐
outsourced component is Web hosting, either outsourcing single Web sites
into a shared‐hosting environment, or outsourcing collocated servers into
someone else’s data center.


22
For the purposes of this discussion, “the cloud” simply refers to some IT
element being outsourced in a way that abstracts the underlying
infrastructure. For example, if you have collocated servers in a hosting
company’s data center, you don’t usually have details about their internal
network architecture, their Internet connectivity, their routers, and so
forth—the data center is the piece you’re paying to have abstracted for you.
In a modern cloud computing model like Windows Azure or Amazon Elastic
Cloud, you don’t have any idea what physical hosts are running your virtual
machines—that physical server level is what you’re paying to have
abstracted, along with supporting elements like storage, networking, and so
on. For a Software as a Service (SaaS) offering, you don’t even know what
virtual machines might be involved in running the software because you’re
paying to have the entire underlying infrastructure abstracted.
Regardless which bits of your infrastructure wind up in some outsourced service
provider’s hands, those bits are still a part of your business. Critical business applications
and processes rely on those bits functioning. You simply have less control over them, and
typically have less insight into how well they’re running at any given time.
This is where domain‐specific tools fall apart completely. Sure, part of the whole point of
outsourcing is to let someone else worry about performance—but outsourced IT still
supports your business, so you at least need the ability to see how the performance of
outsourced elements is affecting the rest of your environment. If nothing else, you need the
ability to authoritatively “point the finger” at the specific cause of a problem—even if that
cause is an outsourced IT element, and you can’t directly effect a solution. This is where
unified monitoring truly earns a place within the IT environment. For example, Figure 2.4
shows a very simple “unified dashboard” that shows the overall status of several
components of the infrastructure—including several outsourced components, such as
mazon Web Services. A


23

Figure 2.4: Unified monitoring dashboard.
The idea is to be able to tell, at a glance, where performance is failing, to drill through for
more details, and then to either start fixing the problem—if it exists on your end of the
cloud—or escalate the problem to someone who can.
Let’s be very clear on one thing: Any organization that’s outsourcing any portion of its
business IT environment and cannot monitor the basic performance of those outsourced
elements is going to be in big trouble when something eventually goes wrong. Sure, you
have SLAs with your outsourcing partners—but read those SLAs. Typically, they only
commit to a refund of whatever fees you pay if the SLA isn’t met. That does nothing to
compensate you for lost business that results from the unmet SLA. It’s in your best
interests, then, to keep a close watch on performance. That way, when it starts to go bad,
you can immediately contact your outsourcing partner and get someone working on a fix so
that the impact on your business can at least be minimized.
Missing Pieces
There’s another problem when it comes to performance monitoring and management,
scalability planning, and so forth: missing pieces. Our technology‐centric approach to IT
tends to give us a myopic view of our environment. For example, consider the diagram in
Figure 2.5. This is a typical (if simplified) diagram that any IT administrator might create to
help visualize the components of a particular application.


24

Figure 2.5: Application diagram.
The problem is that there are obviously missing pieces. For example, where’s the
infrastructure? Whoever created this diagram clearly doesn’t have to deal with the
infrastructure—routers and switches and so forth—so they didn’t include it. It’s assumed,
almost abstracted like an outsourced component of the infrastructure. Maybe Figure 2.6 is
a more accurate depiction of the environment.


25

Figure 2.6: Expanded application diagram.
And even with this diagram, there are still probably missing pieces. This reality is probably
one of the biggest dangers in IT management today: We forget about pieces that are outside
our purview.

26

Again, this is where a unified monitoring system can create an advantage. Rather than
focusing on a single area of technology—like servers—it can be technology‐agnostic,
focusing on everything. There’s no need to leave something out simply because it doesn’t fit
within the tool’s domain of expertise; everything can be included.
In fact, an even better approach is to focus on unified monitoring tools that can actually go
out and find the components in the environment. Software doesn’t have to make the same
assumptions, or have the same technology prejudices, as humans. A unified monitoring
console doesn’t care if you happen to be a Hyper‐V expert, or if you prefer Cisco routers
over some other brand. It can simply take the environment as it is, discovering the various
components and constructing a real, accurate, and complete diagram of the environment. It
can then start monitoring those components (perhaps prompting you for credentials for
each component, if needed), enabling you to get that complete, all‐in‐one, unified
dashboard. I’ve been in environments where not using this kind of auto‐discovery became a
real problem.
Case Study
Terry is responsible for the infrastructure components that support his
company’s primary business application. Those components include routers,
switches, database servers, virtualization hosts, messaging servers, and even
an outsourced SaaS sales management application. Terry’s heard about the
unified monitoring idea, and his organization has invested in a service that
provides unified monitoring for the environment. Terry’s carefully
configured each and every component so that everything shows up in the
monitoring solution’s dashboard.
One afternoon, the entire application goes down. Terry leaps to the unified
monitoring console, and sees several “alarm” indications. He drills down and
discovers that the connection to the SaaS application is unavailable. Drilling
further, he sees that the router for that connection is working fine, and that
the firewall is up and responsive. He’s at a complete loss.
Several hours of manual troubleshooting and wire‐tracing reveal something
about the environment that Terry didn’t know: There’s a router on the other
side of the firewall as well, and it’s failed. Normal Internet communications
are still working because those travel through a different connection, but the
connection that carries the SaaS application’s traffic is offline. The “extra”
router is actually a legacy component that pretty much everyone had
forgotten about.
A monitoring solution capable of automated discovery wouldn’t have
“forgotten,” though. It could have detected the extra router and included it in
Terry’s dashboard, making it much easier for him to spot the problem. In fact,
it might have prompted him to replace or remove that router much earlier,
once he realized it existed.

27

Discovery can also help identify components that don’t fit neatly within our technology
silos, and that don’t “belong” to anyone. Infrastructure components like routers and
switches are commonly‐used examples of these “orphan” components because not every
organization maintains a dedicated infrastructure specialist to support these devices.
However, legacy applications and servers, specialty equipment, and other components can
all be overlooked when they’re not anyone’s specific area of responsibility. Discovery helps
keep us from overlooking them.
Not All of IT Is a Problem: Ordering, Routing, and Providing Services
Most organizations tend to get into the habit of thinking of their IT department as “fire
fighters.” IT exists to solve problems. That isn’t true, of course, and any organization
probably (hopefully) depends more on IT to carry out day‐to‐day tasks and requests more
than they rely on them to solve problems. But the day‐to‐day tasks are easy to overlook,

whereas “fire fighting” gets everyone’s attention.
The result of this way of thinking is that IT management tends to focus on tools that help
make problem‐solving easier. Unified monitoring is exactly that kind of tool: If nothing ever
went wrong, we wouldn’t need it. It’s there to make problem‐solving faster, primarily in the
rform d availabilityareas of pe ance an . Right?
Not quite. Truly unified management also entails making day‐to‐day IT tasks easier for
everyone involved. Users, for example, need to order and receive routine services, from
simple password resets and account unlocks to new hardware and software requests. I’ll
make what some consider to be a bold statement and say that those routine requests
should be treated in the exact same way as a problem. Look at any IT management
framework, such as ITIL, and you’ll find that concept runs throughout: Routine IT requests
should be part of a unified management process, which also includes problem‐solving.
Consider some of these broad functional capabilities that a unified management (versus
mere “monitoring”) can offer both to problem‐solving activities and to routine IT services:
• Workflow—When problems arise, following a structured process, or workflow, can
help make problem‐solving more consistent and efficient. Similarly, structured
workflows can help make routine IT services more efficient and consistent. The
workflows will be different for problem‐solving and for various routine services, but
having the ability to manage and monitor workflows can be a real benefit.
• Approvals—Workflows should include approvals. This capability is most obvious
for routine services like hardware and software requests, security requests, and so
on—but it can be just as important for problem solving. Not every problem can be
fixed by changing a setting or rebooting a device; sometimes you’ll need to make a
more significant change, and having the ability to formally route approval to make
that change is a benefit.

28

• Routing. The specialist who fixes a problem is usually the last one to hear about it.
Front‐line resources, such as your Help desk and your end users, are the first
“responders.” Being able to select a problem category and have a ticket routed to the
right individual helps speed problem resolution. The same is true for routine
services: Things get done quicker when the right person has the request. Automated
routing capabilities can help get the right person on the job more quickly and more
accurately.
• Self‐service. Reducing phone calls and manual email juggling is crucial to achieving
better efficiency. Self‐service can help do that for both problems and routine
requests. When users experience a problem, self‐service can allow them to submit
tickets as well as help them solve the problem on their own, through a knowledge
base. When users need routine service, self‐service helps them submit that request
without having to engage additional IT services.
• Service catalog. Part of self‐service is the ability to create an “online store” for
services that users can request.
There are more capabilities, of course, but we’ll cover them in upcoming chapters. These
are simply some of the basic capabilities that we need in order to make both routine IT
requests and problem‐solving more consistent and efficient.
Coming Up Next…
This chapter has been about breaking down the silos between technology specialties, or at
least building doorways between them. That helps to solve one of the major problems in
modern IT monitoring and management. The next chapter will tackle a somewhat more
complicated problem: Keeping everyone in the management loop. It’s about improving
communications. Unfortunately, communications are too often a voluntary, secondary
exercise—we have to make an effort to communicate, and when we’re really feeling the
pressure, it’s easy to want to put that effort elsewhere. So we need to adopt processes and
tools that make communications more automatic, helping keep everyone in the loop
without requiring a massive secondary effort to do so.


29
Chapter 3: Connecting Everyone to the IT
Management Loop
IT management has for too long involved discrete, disconnected processes that often leave
key participants wondering what’s going on. Bringing everyone—users, managers, IT
professionals, and more—into the loop can create significant benefits as well as reduce the
tendency to fall back into discipline‐based silos. This is where the integration between
monitoring and service desk truly happens, and these concepts deliver the most critical,
central themes discussed throughout this book. It’s all about communication—ways to
ent. better achieve communication as well as create opportunities for continuous improvem
Users sometimes perceive their IT department as out‐of‐touch, ivory‐tower geeks with
poor people skills. Whether or not that’s true depends on the actual IT team members, but
the perception, fair or not, often exists. That’s because IT can too often be the last ones to
know about things that users perceive as problems. Sure, the server might me humming
along within specs, but the order‐entry application is incredibly slow. IT says that email is
working fine, but I’ve been waiting on an incoming purchase order for an hour—the email
system can’t possibly be working correctly!
IT has its own unique problems to deal with, and they sometimes involve a disconnect with
management. Finding windows in which to make approved changes, for example, can be
incredibly tricky. Simply coordinating the changes that are proposed, approved, under
development, ready for implementation, and so forth can be difficult. Many organizations
have adopted change management frameworks, such as those proposed by ITIL, that
outline specific processes for reviewing and approving changes. Physically coordinating
that process, however, can seem like herding cats. It’s even worse when IT has been
divided into silos: The database team might have a change scheduled for tonight, but that
change is going to conflict with the power supply changes being implemented by the data
enter team. We need to get everyone on the same page. c


30
Starting the Loop: Connecting Monitoring to the Service Desk
Most organizations today have a ticket‐based system for coordinating IT activities. These
organizations also usually have monitoring systems in place to watch their IT systems and
alert them to any problems. Too few organizations, however, have connected these two
systems. Ideally, that’s what you want: A single, integrated IT management system that can
detect problems and then automatically open tickets for the appropriate individuals. If the
email server is down, the appropriate administrator should get a ticket. Those tickets, of
course, should include notifications via text message, email, or whatever other medium is
t. appropriate so that alerted individuals know they have an aler
That auto‐assignment—you might even choose to call it auto‐routing—of tickets needs to
be pretty intelligent. Different systems, in different locations, at different times, all might
change how the ticket is created, thus changing who is assigned to work the problem.
Tickets should be as complete as possible, meaning as many fields as possible should be
automatically populated—you shouldn’t have to rely on a Help desk, or someone else, to fill
in the details. Those details might include the affected server’s information. Figure 3.1
shows what this kind of auto‐generated ticket might look like, with several key bits of
information pre‐populated by the system.

Figure 3.1: Automaticallygenerated tickets

in response to alarms.

31

The idea is to have a service desk solution—that’s the software that helps coordinate and
manage IT activities, often through tickets—working with the monitoring solution, thus
creating a truly integrated response to IT problems.
This is all intended to provide specific benefits. First and foremost is faster problem
resolution. By not waiting for users to inform you of a problem, you’re getting started on
solving the problem faster. By having pre‐populated tickets, the IT team is able to work
more quickly because they’re starting with more information.
There’s a bit more depth that can be added, if you have the right service desk software in
place. Frameworks like ITIL encourage root cause analysis, meaning your team should focus
not only on solving today’s specific problem but also on making the overall environment
more stable and problem‐resistant. To that end, a service desk solution can define two
types of problems: global issues and specific incidents.
Specific incidents might be day‐to‐day problems like, “Email moving slowly throughout the
organization,” “Order entry application operating slowly,” and so forth. Those might all be
tied to a global issue of “Unexplained network slowdowns,” which could be examined and
solved—perhaps locating a router that was overheating and dropping more packets than
usual.
Sometimes, specific incidents might not be entirely solved until the overarching global
issue is solved. By tracking those individual incidents along with the global issue, you can
help keep your users and managers more informed. For example, once that overheating
router is discovered and replaced, everyone affected by an associated specific issue could
be notified: “Hey, we think we’ve found the root cause for all the slowdowns, so things
should be better from here on out.” Figure 3.2 shows how a single global problem can be
attached to multiple incidents.


32

Figure 3.2: Relating multiple incidents to a single problem.
I’ve used a couple of keywords in the forgoing discussion and want to take a moment to
specific define ally them in the context of this book:
• An incident is something that happens in the environment, such as a failed server or
ion. a slow applicat
• IT staff create problem records to help manage the incident. Problems may in fact be
associated with multiple incidents, as in the case of that overheating router, which
caused multiple disparate failures throughout the environment.
I’m going to start using those two terms more consistently from here on. Hopefully, some of
the benefits of combining monitoring with problem solving will become clear. For example,
more simplistic Help desk solutions allow multiple tickets to be opened against what is
essentially the exact same issue. That can result in a lot of duplicated effort, as multiple IT
team members attempt to work the issues on their own. It can also result in a lot of
paperwork because solving the root cause then requires technicians to spend time
laboriously closing each ticket. With a more sophisticated system in place, everything can
be consolidated into a single, managed problem record. Doing so creates additional
benefits, such as identifying solutions or workarounds, which I’ll discuss in upcoming
chapters.

33

Problems and incidents, however, aren’t the only reason that users interact with IT.
Hopefully, they’re not even the major reason your users interact with IT! Aside from
reporting incidents, users also need to request routine services: advice, new hardware
requests, routine change requests, access requests, and so forth. These interactions should
be managed through a more formal workflow in which users submit their request, have it
assigned to the appropriate technician after being approved, and be able to track the status
st. of their reque
For a ex mple:
1. A user might visit a Web site to browse a “catalog” of items they can request, such as
access to systems, changes to hardware, and so forth.
2. A user selects an item from the catalog, and provides whatever details are necessary
to complete the request.
ending
proval.
3. A ticket is created in the service desk that represents the user’s request. Dep
upon the request, the ticket might first be routed to the user’s manager for ap
4. Once approved, the ticket would be automatically routed to the appropriate
technician or IT team for completion.
5. The user would receive status updates, perhaps via email, throughout this process,
keeping them informed of its progress. The status updates would include a
“completed” update once the request was finished.
By using the same ticket‐based system employed for problem‐solving to address routine
requests, IT technicians can rely on a single interface to manage their workload. Figure 3.3
shows what a routine request ticket might look like.


34

Figure 3.3: Routine requests can also be made into tickets.
Even better, IT management can rely on all IT work being documented and tracked in a
single system, enabling management to stay informed through reports, dashboards, and
other mechanisms. Figure 3.4 shows an example of what such a report might look like.


35

Figure 3.4: Management reports become more effective when they include all IT
workload.
The idea is to keep everyone in the loop: users remain informed, IT remains informed,
management remains informed. Much of the burden of keeping everyone informed is
handled by the software, which can send email updates and other kinds of notifications so
that everyone is aware of what’s happening at all times.
Making Changes: How to Find a Change Management Window
Large, multi‐discipline IT departments have inherent problems. In the previous chapter, I
discussed the problem of silo‐based problem solving, where domain experts spend time
passing a problem back and forth because everyone is looking at different tools and data to
determine whether the problem is “theirs.” We’re certainly not going to get rid of domain
experts, so the solution is to get tools that could put everything into a single console in
order to unify everyone’s efforts.

36

Another problem created by those silos relates to change management. At the start of this
chapter, I outlined one of those problems: The database team is ready to implement a
change, but it’s going to be in conflict with a change being implemented by another group.
Managing change windows is becoming increasingly difficult. Not only are applications and
services needed round‐the‐clock, creating tiny change windows in the first place, but the
varying needs of different experts creates contention for those already‐small windows.
“Boss, we’d have that fix in place, but we can only implement it at night. It’s going to take 4
hours, which just fits inside the window management allows us. But all this week, other
teams have been using the window, and the changes they’re making are blocking us from
doing anything at the same time.” It’s not an unusual situation. It gets tough for
management to even track what changes are pending and to slot them into the shrinking
time that’s available to make them.
The lack of visibility into these windows, and the contention for them, makes it impossible
to even make a management decision. For example, if management could see the number of
changes stacked up, and see the contention, they might decide to expand the window for a
period of time in order to get the changes implemented. They might not decide to do that,
but they’d be consciously making a decision rather than remaining ignorant of the actual
problem.
The solution, of course, is software that facilitates the coordination of departments. Think
about it: If you’re using a service desk solution to track tickets, then tickets can be created
for proposed changes. Those tickets would be assigned to a technician, routed for reviews
and approvals, and so forth, all via some workflow you designed. That’s an excellent way to
support ITIL processes, by the way. The tickets themselves can then feed a unified
calendar, built right into the service desk, which allows change planners to schedule
activities. They can see agreed maintenance windows, manage contention between
conflicting changes, and so forth. By getting this information into a familiar calendar form,
they can also make decisions about whether to widen maintenance windows if doing so is
necessary and beneficial to the organization. Figure 3.5 shows a change management
calendar.


37

Figure 3.5: Managing change schedules in a calendar view.
This is just another way to help keep everyone in the loop. Management now has a clear
visual depiction of change and schedule contention. Such a calendar could even be made
available to users so that they could see what changes were scheduled and plan their own
activities accordingly.
Communicating: How to Bring Users into the Loop
The idea of keeping users informed certainly isn’t new, but many organizations that have
attempted to better engage their users haven’t met with unqualified success. Too often,
“keep users in the loop” solutions take the form of self‐service Web portals, where users
can log in to check the status of their tickets or to check the status of a particular service.
That’s all well and good, but Web portals like that don’t always fall within the natural
workflow of a user. For example, most users, when confronted with some kind of problem,
don’t necessarily think to check a Web site and see if something’s wrong—they call the
Help desk.

38

Users do, however, spend a lot of time in their email inbox. Why not make that your
channel for communication? Organizations don’t use this method of communication in part
because doing so could easily become a time burden for your IT team. “So on top of solving
the problem, I have to send out hourly update emails with the status of the problem?”
Sounds like a Dilbert cartoon!
In reality, a good service desk solution can do it for you. Sending an email update when a
user’s ticket is updated, for example, is an easy operation for a piece of software. Such
emails can be informative, and help users feel comfortable that their request is being
handled. Figure 3.6 shows what one might look like.

Figure 3.6: Keeping users informed wi

th detailed emails.

39

What’s more compelling is a service desk solution that can actually accept requests via
email rather than expecting users to go to a self‐service Web portal and open a ticket. Face
it: Your users are more likely to pick up the phone than visit a Web site, unless you’ve
placed significant artificial barriers in the way, like complex voice menus in the phone
system. Users are more likely to send an email. If your service desk, rather than a human
technician, can receive those emails and use them to create a ticket, you’ve truly created a
system your users are likely to embrace. Such tickets could still be auto‐assigned and –
routed, helping the right technician to start working the problem more quickly.
Even for your users’ routine, non‐problem requests, email updates can be valuable. When
their request is approved, rejected, underway, completed, and so forth, an email update
helps keep users informed without additional human effort.
Note
I want to emphasize that self‐service portals are a good thing. They can
provide a rich user experience, help guide users to self‐service solutions, and
more. They just shouldn’t be the only means of communicating with users.
SLAs: Setting and Meeting Realistic Expectations
Unless you’ve been living under a rock for the past decade or so, Service Level Agreements
(SLAs) are probably pretty familiar to you. These are, in their simplest form, an agreement
by the IT team to provide a specific level of performance or availability for a specific service
or application. “The email service will be available 99.999% of the time on an annualized
basis” is an example of a very simple SLA.
But SLAs can get complicated quickly. You can’t just pull a number out of thin air; what
level of service can you reasonably provide? What level of service have you historically
provided, and is that meeting the business’ needs? Once established, how do you track the
SLA to make sure you’re actually meeting it—and ideally get some kind of notification
when you’re in danger of breaking the agreement?
SLAs might not be the only type of agreement you need to define and track. Some
organizations also use underpinning contracts (UCs) or operational level agreements (OLAs)
for different in‐ and out‐sourced services; these often support SLAs.
A well‐built service desk and monitoring solution can help you handle these agreements
more precisely. You’ll start by defining top‐level SLAs, then creating and managing UCs and
OLAs as appropriate.
Once defined, the solution should be able to track ongoing performance and availability,
perhaps offering a simple dashboard—like the one shown in Figure 3.7—that illustrates
your compliance with your SLAs. You might also have more comprehensive and detailed
reports on SLA metrics.


40

Figure 3.7: Managing SLAs with ataglance dashboards.
Most importantly, however, the solution needs to provide you with the ability to define
rules for your SLAs so that tickets can be created—and auto‐assigned to the appropriate
technicians—when SLAs are in danger of being broken. Further, the solution should
support escalation rules so that if an SLA that is in danger of being broken is not corrected
within a certain amount of time, the solution can automatically call for backup, summoning
additional technicians, notifying management, and so forth.
There’s also a strong need to recognize that no SLA is perfect. Sometimes, for whatever
reason, the business will decide to take a service offline. Perhaps it’s for a software upgrade
or for some kind of infrastructure maintenance. In those cases, you’re not breaking the SLA;
you’re agreeing—along with whatever part of the business will be affected—to temporarily
suspend the SLA to get the work done. A service desk solution should support these types of
exceptions, including SLAs that are only valid during certain hours, holiday exceptions,
agreed‐upon reduced service windows, maintenance windows, and so forth.
The idea is to automate SLA definition and management—and to automate the notifications
that go with SLAs. If an SLA is broken, you might agree that the affected business users will
receive an automatic notification. That lets them know that IT knows about the problem
and is working on it—without forcing users to visit a self‐service portal and open a ticket.
That kind of proactive response can go a long way toward improving IT‐user relationships,
and in helping IT be viewed as responsive to, and supportive of, business requirements.


41
Tell Me What You Really Think
IT managers like IT to think of users as “customers.” In some cases, your users might
actually be customers, in the sense of “sending you a check for specific services you
provide” customers. In other cases, your users might be internal users—but still
“customers” in the sense that they consume services you, the IT department, provides, and
that you get paid for your efforts.
A big problem that IT has always struggled with is its perception by its customers. Do
customers think you’re doing a good job? What is a good job?
For this reason, monitoring End‐User Experience (EUE) metrics, which I discussed in the
first chapter, has become a hot trend in the IT industry. You might see that your servers’
performance is within norms, but by the time you throw in old client computers, routers,
network cabling, and everything else involved in delivering a service to users, they have a
completely different perception of the server’s performance. Measuring the EUE is a way to
get some insight into that aggregate perception that your users—your customers, if you
prefer—deal with.
Businesses have traditionally used another important technique to discover their
customers’ perceptions: surveys. Phone your credit card company, and the robot who
answers the phone might inform you that you’ve been selected to participate in a short
satisfaction survey, which will begin when you finish speaking with the agent who is about
to come on the line. Walk into a theme park, and a smiling employee with a tablet computer
asks you a few questions. Look at the register receipt from your last purchase, and you
might find that you’re eligible to win a gift card or other prize if you complete an online
survey about your shopping experience.
Surveys are an effective way of finding out what users really think, and a good service desk
application should provide you with the ability to survey your customers. Perhaps you
want to ask them their opinion after each request that’s completed. Maybe you want to be a
bit less intrusive, and only survey them after every 3 or 4 requests. Whatever you decide, a
service desk solution should be able to automate the process. You might even want to
engage customers in ad‐hoc surveys to further your understanding of their perceptions
about day‐to‐day performance, availability, service levels, and so forth.
Of course, surveys are useless without the ability to aggregate the data and see how you’re
doing. The back end of a survey system must include reporting capabilities, perhaps with
charts and graphs, that help you visualize your customers’ perception of your service.
Compare this report to your SLA compliance report—do you see any differences? If your
SLA shows that you’re doing a great job, and your customer surveys aren’t so glowing, then
maybe your SLAs aren’t set at the right levels—or maybe your SLAs aren’t the only metric
you should be looking at.

42

I’ve worked with a number of customers who have found themselves in exactly that
situation: “Our SLAs are all being met, every day, but our users still don’t think we do a
good job. What’s the problem?” We discovered the answer with a few ad‐hoc surveys that
touched on “soft” issues, such as the IT team’s “attitude” when helping users. Turns out that
the team came across as brusque and sometimes rude. We spent some time with the team,
and discovered that they felt an incredible amount of pressure because of the number of
tickets assigned to them. In the end, the company developed internal metrics to track each
IT member’s workload and efficiency, and worked to bring each person’s workload to a
more manageable level—while continuing to survey those “soft” issues such as attitude.
The moral of the story is that SLAs aren’t the only metric you need to concern yourself
with, and integrated surveying can help reveal critical information to help pinpoint overall
service problems.
When Everyone Doesn’t Need to See Everything: A Multi‐Tenant
Approach
Multitenant is a growing trend amongst IT solution vendors, and for good reason.
Obviously, service providers operate the very definition of multi‐tenant systems. If you’re a
service provider, or perhaps more specifically a Managed Service Provider (MSP), then you
know the importance of having tools that can be customized and partitioned for each of
your customers. Customer A wants these dashboards, while Customer B wants those.
Customer B certainly doesn’t want to see Customer A’s tickets (and Customer A doesn’t
want Customer B to see them!). In the past, it’s been pretty common for such multi‐tenant
features to only be present in solutions that were designed for MSPs.
Today, however, that’s changing. Large, multi‐divisional companies want to deploy
solutions that can serve all of their divisions’ needs without necessarily deploying a unique
solution for each division. That’s where multi‐tenancy can help, enabling a single solution
to be customized, partitioned, and presented to each division as if they were the only ones
using the solution, when in fact the solution is consistently serving everyone. Different
divisions can get a different view of just their portion of the environment. For example,
Division A might see a dashboard, while Division B saw something completely different.
Again, multi‐tenancy isn’t something that every single company or organization is going to
need. However, it’s a nice feature to have in your back pocket if the time comes when you
do need it, so be sure to consider this functionality as you’re evaluating various solutions—
even if multi‐tenancy isn’t an immediate need. Of course, if you’re an MSP, multi‐tenancy is
definitely a must‐have feature.
We’re continuing to support this chapter’s theme of keeping everyone in the loop: The
ability to provide specific, customized, partitioned environments to your varying
customers—whether internal or external—helps keep them more informed and more
ccurately informed. a


43
Call It a Private Management Cloud: Allocating Costs
There’s one more thing we should look at to keep everyone in the loop, and that’s with
regard to their costs: The ability to provide customers with detailed reports on their usage
of the infrastructure, and to potentially bill them for their usage based upon those reports.
Figure 3.8 shows what such a report might look like.

Figure 3.8: Reporting on usagebased metering and billing.
Again, this kind of reporting is an obvious, must‐have feature for MSPs—but it has
increasing applicability to organizations who deal only with internal customers.
One of the key elements of cloud computing is the concept of billing you based on your
actual usage. The cloud provider builds and manages the infrastructure, which is shared
amongst their customers. Each customer then pays for the bits they use. That’s an obvious
and well‐understood model for the public cloud—but it’s becoming a model for the private
cloud as well. Rather than accepting IT as a giant bucket of overhead, companies are
looking more and more at allocating IT’s costs across the consumers of those IT services.
“Marketing wants to spin up a dozen virtual Web servers for a new Web site? Okay—do
they have the budget to pay for it?”
Chargebacks, as they’re called, are certainly nothing new. But monitoring and service desk
solutions are increasingly able to provide the level of detail that you need to actually make
chargebacks work. The technological advancements that have made public clouds possible
can be readily integrated into private data centers for the same purposes: billing (or

allocating costs) for actual usage.
Tying IT costs directly to the consumers of those IT services is a great approach for helping
IT make better business decisions. Rather than putting IT in the role of “gatekeeper” for
who can and cannot have specific services, the organization’s management gets to decide
what money will be spent, by whom, on what services. That’s how it should be. In one
sense, IT has always been an outsourced activity: Although the IT team might be paid by
the organization, they don’t materially participate in the organization’s actual profit‐
making activities. They’re a separate division. Essentially, the business has “outsourced” IT
(albeit to an internal team)—why not have IT deliver usage‐based billing statements just
like any other vendor would be expected to do?

44

It’s just another way of keeping everyone in the loop. Even if you don’t use your usage‐
based billing reports for actual billing or chargeback, they’re a useful way of helping upper
management understand the cost and value of their IT investment. “Yes, you spent a zillion
dollars on IT last quarter but here’s why, and here’s how that investment was consumed by
the organization. If you want to cut back, start by looking at the consumers, and finding
ways to make them consume less.”
Conclusion
This chapter has been all about keeping people in the loop when it comes to IT
management. From keeping users more updated and engaged in the IT process, to keeping
technicians more connected to ongoing events, to keeping management more informed so
that they can make better decisions—it’s been about communications. There’s very little
I’ve discussed in this chapter that any organization couldn’t start doing today, if they were
willing to expend enough effort. The key, however, is in accomplishing these goals with
little or no effort, by using a system of integrated software tools that understand how to do
these things for you.
Coming Up Next…
In the next chapter, we’re going to look at a challenge that’s become more and more
common in IT: key services and IT elements exist outside the data center. Yes, you can call
it “the cloud” or you could simply call it “outsourced services.” Whatever you call them,
they’re still critical to the business, and you need to treat them the same way you treat all of
the in‐house services. You can’t treat them as a separate silo, because then you’ll be forcing
yourself to manage them differently. Of course, monitoring outsourced services is a whole
different ball game than managing in‐house services, so we’ll need to find some clever
solutions.


45
Chapter 4: Monitoring: Look Outside the
Data Center
IT has moved beyond our own data centers, and nearly every organization has at least one
or two outsourced IT services. Although we’re probably always going to have on‐premise
assets to manage and monitor, we need to realize that in most cases, monitoring has to
start outside the data center—both in the sense of accommodating off‐premises services as
well as focusing more closely on what end users are actually experiencing.
Monitoring Technical Counters vs. the End‐User Experience
The traditional IT monitoring approach is what I call inside out: It starts within the data
center and moves outward toward the end user. Figure 4.1 provides a visual for this idea,
illustrating how typical monitoring focuses on the backend: database servers, application
servers, Web servers, cloud services, and so forth. The general reason for this approach is
that we have the best control and insight over what’s inside the data center. If everything
inside the data center is running smoothly, it stands to reason that the end users who
consume the data center’s services will be happy.

Figure 4.1: Monitoring from the data center outward.

46

Most Service Level Agreements (SLAs) derive from this approach: We promise a certain
amount of uptime, and we set up monitoring thresholds around data center‐centric
measurements like CPU utilization, network utilization, disk utilization, and so forth. We
also tend to look at low‐level response times: query response time, disk response time,
network latency, and so on.
There’s something deeply and inherently inaccurate about the underlying assumption of
this approach: Even if you start with a perfect pile of bricks, there’s no guarantee that
you’re going to end up with a stable building in the end. In other words, what end users
experience isn’t merely the sum of the data center’s various metrics. A smoothly‐running
data center usually leads to satisfied users, but that isn’t always the case.
It’s obviously important for us to continue monitoring these data center‐centric
measurements, but those can’t be the only thing we monitor and measure. Current thinking
in the industry is that we need to more directly measure what the end user experiences. In
fact, “end user experience,” or EUE, has become a common term in more forward‐thinking
management circles.
Here’s another way to think of it: Suppose you go to a restaurant to eat. Your steak comes
out cooked wrong, they brought the wrong side items, and the waitress is rude. The
manager, standing back in the kitchen, thinks everything is fine: the steaks are hot, the
veggies are hot, and the waitress smiles at him every time she goes back there. He’s focused
on the backend, with no knowledge of your expectations. Restaurants address this by
having the manager periodically roam around and ask, “Is everything okay?” That’s
monitoring the EUE: Rather than looking at his back‐of‐house metrics, he’s going out into
the cube farm—er, restaurant floor—and testing the waters.
How the EUE Drives Better SLAs
You establish metrics for what the EUE should be for various operations: so many number
of seconds to complete such‐and‐such a transaction, and so forth. When that metric isn’t
met, you start drilling down into the infrastructure to find out why. That’s where more‐
traditional data center monitoring re‐enters the picture. Rather than using query response
time or whatever to derive the end user’s experience, we’re using it to troubleshoot things
when the end user’s experience is clearly not where we need it to be. Figure 4.2 shows how
EUE monitoring sort of reverses the model.


47

Figure 4.2: Monitoring the EUE.
You’ll still have thresholds and other considerations, but they’re set at levels that have
historically been able to deliver an acceptable EUE. As Figure 4.3 shows, a failed EUE is
your cue to start looking at deeper, more technical‐level measurements so that you can see
what’s contributing to the end user problem.


48

Figure 4.3: Tracking the cause of a poor EUE.
In reality, it doesn’t always take a major change in the backend to cascade into a real
problem for the EUE. A database server’s response times slow by a millisecond or two,
resulting in an application server taking an extra half‐second to process a transaction,
resulting in a front‐end server taking an extra second to present the next screen of
information, resulting in the user’s client application taking a couple of extra seconds. Add
up those couple extra seconds over the course of a day, and you’ve lost an hour or so, and
told a lot of customers, “Sorry this is taking so long, the computer is slow today.”
In Figure 4.3, both an internal database server and a cloud computing service are
responding slowly (indicated by the red flags). Neither one might be alarming in and of
itself, but together they’re combining to form a noticeable problem for the end user.
Normally, a minor a fluctuation on the database server might not raise any alarms. It’s the
cascade of effects that result in a poor EUE. Once we definitively know—because we’ve
been watching it—that the EUE is declining, we can start looking for causes. Because we’re
looking for a problem, rather than just routinely monitoring, that minor back‐end
performance decrease will be more noticeable.

49

The ability to measure the EUE lets you create much more realistic SLAs. Instead of telling
users, “We’ll guarantee a query response time of 2 seconds,” you tell them, “Such‐and‐such
a transaction will take no more than 3 seconds to process.” That’s something an end user
can monitor for themselves: “Click enter, and count one‐one thousand, two‐one thousand,
three‐one… ah, it’s done.” That kind of SLA sets an expectation that users can relate to.
They’ll know when “the system is slow” because they’re measuring the same thing you are.
Ideally, you’ll know of slowdowns before the user, or at close to the same time, because
you’ll have tools in place to monitor things from the users’ perspective.
How It’s Done: Synthetic Transactions, Transaction Tracking, and
More
This kind of monitoring isn’t always easy. It’s possible to throw monitoring agents onto end
user computers when they’re all employees, but what about a Web application, where the
end users are actually external customers? They probably wouldn’t be excited about having
you install monitoring agents on their computers just to track your application’s
performance.
Instead, modern monitoring tools rely on techniques like transaction tracking. With this
technique, monitoring components on your end watch an individual transaction as it flows
through your systems, literally measuring the time it takes the transaction to be processed.
This can be done at a variety of levels of detail. For example, tools usually associated with
software performance profiling can get very detailed, tracking transactions through
individual software modules. At a higher systems management level, you might just track
. the transaction’s start‐to‐finish time
Often, this is also done by inserting synthetic transactions into the system. Essentially, a
monitoring system pretends to be a client, then inserts transactions into the system that
will be processed but then later ignored. These allow the monitoring system to more
precisely figure out the actual time‐to‐complete for various transactions. This idea is
illustrated in Figure 4.4.


50

Figure 4.4: Using transaction tracking to monitor the EUE.
There are a lot of variations on these techniques, and a lot of specialized tools that you can
acquire to actually implement them. In the end, though, it’s important to remember that the
entire activity is designed to measure just one thing: the EUE. You’re not, at this point,
trying to figure out what individual systems’ performance is or trying to track down the
root cause of the problem. You’re simply trying to determine whether there is a problem.
Top‐Down Monitoring: From the EUE to the Root Problem
The EUE is intended to be an extremely high‐level diagnostic; it tells you that “something is
wrong.” It won’t tell you what. For that, you’re going to need to go back to the traditional
monitoring you’ve always known and loved, only this time you won’t just be watching in a
xists. vacuum: You’ll be looking for a problem that you know e

51

This is not the time to pull out the domain‐specific monitoring tools—we’ve discussed that
in prior chapters. You still want to stick with a monitoring system that can monitor
everything in a single “pane of glass.” That doesn’t necessarily mean a framework that’s
aggregating domain‐specific tools, either—it means a monitoring system designed
specifically to look at each of your systems. With the right understanding of what
performance should look like at each component level, such a system can quickly tell you
where the problem lies. Then you can dig out the domain‐specific tools to troubleshoot the
particular problem—again, with the knowledge that there is a problem, and that the
component you’re looking at is the one causing the problem.
Deriving the EUE
So why can’t you simply use better thresholds on your back‐end monitoring
to figure out when the EUE is declining? Because EUE focuses on the entire
system. The database can be slower, provided that time doesn’t cascade
through the rest of the system. A slow router doesn’t necessarily mean a slow
EUE, although in combination with other factors, it might be the tipping
point. That’s why you need to look directly at what end users are
experiencing, then go looking for the root cause.
Agent vs. Agentless Monitoring
There’s a lot of argument in the monitoring industry about the best way to monitor. Do you
install agents? Some folks believe so, and feel that agents provide the best and most‐
detailed information. Other folks don’t like installing and maintaining agents throughout
their environment, and correctly point out that not every component of a system can even
have agents installed. Routers, off‐premise services, and so forth typically can’t support
dedicated monitoring software, after all. So one approach is definitely to install agents, as
illustrated in Figure 4.5.


52

Figure 4.5: Monitoring via agents.
You’ll typically have those agents reporting back to some centralized monitoring server or
system. Depending on your approach, you might have agents installed on every system
where they can be installed, potentially even on some end‐user computers for spot‐
monitoring (although that’s pretty unusual).
Some monitoring solutions will let you get away without installing agents on every system,
and might not even need agents on any of your components. As illustrated in Figure 4.6,
these solutions typically use external means of picking up on system performance.


53

Figure 4.6: Agentless monitoring.
Whether agentless monitoring can pick up as much data, or pick up all the data you need,
depends a lot on what kind of components are in your network, and what kind of
monitoring techniques and technologies are in use. It’s a major competitive point between
different vendors, so it’s something to pay close attention to.
That “monitoring provider” in Figure 4.6 is my lead‐in to a key point about modern
applications. You’re almost always going to wind up with some kind of hybrid system that
relies partly on agents and partly on agentless monitoring—because some of what you’ll be
monitoring won’t be in your own data center.


54
Monitoring What Isn’t Yours
The off‐premise stuff is where our traditional monitoring falls part. It’s unlikely that
Amazon is going to give you detailed performance statistics into their Elastic Compute
Cloud (EC2), and it’s unlikely that Microsoft would do so for Windows Azure.
SalesForce.com isn’t going to send you database query response times or Web server CPU
utilization. Even the hosting company where you’ve collocated your own servers isn’t going
to be sending you detailed data about their routers’ dropped packet percentages, or any
other infrastructure‐level statistic.
Yet those numbers matter to you. If you have an application that relies on cloud computing
components, collocated servers, Software as a Service (SaaS) solutions, or any other
outsourced component, then the performance of those components affects your application
performance your EUE. In short, when Amazon feels performance issues, so do your users.
That’s where hybrid monitoring enters the picture. As Figure 4.7 shows, it usually takes the
form of some external monitoring service, which collects key external performance
information from major outsource providers—the “cloud components,” if you will, with the
data collection shown as red lines—and reports back to your central monitoring console, as
indicated by the green line.
This is where a lot of what I’ve been outlining in the previous chapters really starts to come
together:
• You need both your internal systems and any external components monitored in the
same view. There’s no way you can treat your systems as systems if you can’t get all
of the constituent components into a single monitoring space.
• The key competitive point for these hybrid monitoring services is the breadth of
external components that they can monitor. Make sure you’re choosing one that can
. monitor everything you’ve got—including all the outsourced dependencies
• Monitoring the EUE becomes important because there may well be a lot of
fluctuation with your external services and your use of them. For example, you
won’t care that Azure is experiencing slow response times during periods when
your users aren’t relying on that service. You only need to pay attention—and be
alerted—when your users are experiencing a problem.


55

Figure 4.7: Hybrid monitoring.
In fact, this kind of monitoring of external, outsourced components is the key piece that
makes many organizations feel they can’t rely on cloud computing. “How will we manage
it?” they ask. “How will we monitor it?” Along with data security, it’s probably the biggest
question asked when organizations start considering adding “the cloud” to their IT
portfolio. This is how you’ll monitor it: Using specialized monitoring services that add “the
cloud” to your single pane of glass. These tools put outsourced components on the same
r them the same way. level as your on‐premise components, and let you monito

56

What’s interesting is the way in which some vendors are architecting these solutions. Many
of them sell on‐premise monitoring solutions, which look a lot like what’s in Figure 4.7.
They actually monitor the cloud components on their end, but deliver that information to
you; their solution then collects your on‐premise data and presents everything in a
consolidated view.
But it doesn’t have to be that way. As Figure 4.8 shows, you could also go with a hosted
monitoring solution, where your internal performance data is shipped to the cloud (shown
by blue lines), combined with performance information on your outsourced components
(red lines), and presented in a single view via a Web portal or some other tool.

Figure 4.8: Outsourced, hybrid monitoring.

57

It’s an interesting model because it takes much of the responsibility of monitoring out of
your hands, and lets you focus on the services you’re delivering to your users. It isn’t the
right model for every organization, of course—but it’s an interesting option.
Critical Capability: You Need to Monitor Everything
The last really important piece of the puzzle is to make sure you’re monitoring everything.
Every‐everything. Take a look at Figure 4.9, which is the example system I’ve been using all
along. Is anything missing? If we monitored each of the components shown, in some
fashion, would we be monitoring enough?

Figure 4.9: Is this everything you’d need to monitor?
Definitely not. There’s a lot missing from this diagram, and it’s mostly things that can have
a massive impact on performance. Take a look at Figure 4.10—it’s a bit more complete.


58

Figure 4.10: Make sure you’re monitoring everything.
Routers. Switches. Firewalls. Proxy servers. Directory servers. DNS—both internal and
external. And probably a lot more. If your EUE is declining, you need to be able to find the
root cause—and you can only do that if your monitoring solution can see every possible
root cause.
This is why a lot of monitoring solutions these days offer automated discovery, in addition
to letting you add components on your own. Discovery can find the stuff you’re likely to
miss because you’re not thinking of it as “part of the system.” Infrastructure elements like
routers and switches. Dependencies like directory services and DNS. Potential bottlenecks
like firewalls and proxies. It all matters, so it’s important that it all get onto that “single
ane of glass” that you use to monitor overall system health. p


59
Conclusion
Monitoring is the thing that lets us manage SLAs, lets us spot problems brewing, and lets us
keep our systems running the way the business needs them to. But traditional monitoring
isn’t necessarily the only or best way to go about meeting the business’ needs. More
importantly, as businesses start to rely more and more on external components in their
systems, traditional monitoring just can’t get all of the necessary facts into a single view.
Hybrid monitoring can. By using a combination of traditional monitoring techniques, cloud‐
provided performance data, and other techniques, we can get entire systems onto a single
view, into a single dashboard, and into a single focus.
Coming Up Next…
In the next chapter, we’ll address a fundamental problem that all organizations seem to
struggle with: repeatability. In other words, once you’ve solved a problem, how can you
solve it more quickly if it happens again in the future? We’ll look at turning problems into
solutions and improving service delivery in the future.


60
Chapter 5: Turning Problems into Solutions
The satirical news outlet The Onion recently ran a story related to the economy. In it, the
publication claimed that a special kind of scientist called a historian was advancing the
novel idea of looking at the past. “Sometimes,” one pseudo‐historian was quoted, “we can
look at how people tried to solve problems which are similar to those problems we are
having today. We can look and see how their solutions worked, and that can give us an idea
of whether or not the same solution will work for us.” Hah!
Although targeted at politicians who seem to keep making the same mistakes over and
over, The Onion’s jibe is pretty applicable to IT as well. “Look, if this same problem
happened 3 months ago, and we solved it then, perhaps we can solve it more quickly now.
What, exactly, did we do last time? Maybe doing the same thing again will have the same
effect that it did then!”
I’ll put it another way: Perhaps you have children, or at least know someone who does.
Ever tell a kid not to touch the hot pot that’s on the stove? Sure. Did they touch it? Of
course. How many times? Usually just once. That’s because human beings are designed to
learn primarily by making mistakes. Provided we remember the mistake, and that we
remember how to avoid it or solve it, we can do so in the future very quickly. Memory
becomes the key factor, and as we get older, stop touching hot pots and start playing with
computers at work, it sometimes gets harder to remember. This chapter is all about the
final aspect of unified management: Taking problems that we’ve solved, and turning those
into solutions for the future.
Closing the Loop: Connecting the Service Desk to Monitoring
Before we dive into the memory aspect of solving problems, we’ve first got to close the
operational loop in our unified monitoring toolset. Earlier in this book, we discussed that
one aspect of a unified management system is the ability to monitor devices and services,
such as a database server. When a problem condition is monitored, the monitoring system
creates an alert, which is typically shown on a console, and may involve notifying someone
via email or text message. A truly unified system may also open a problem ticket in the
organization’s IT ticket‐tracking system. The ticket enables management to track the
problem and its time to resolution. It also allows the ticket to be passed to different
personnel who collaborate to solve the problem. The ticket can even be pre‐populated with
information germane to the case, helping the person working the problem to get going
more quickly. Figure 5.1 illustrates this first step: The alert showing up in the console, and
the ticket being created from that.


61

Figure 5.1: Getting an alert and opening a ticket.
Eventually, one hopes, the problem will be corrected. At that point, it’s common for the
o close the ticket, marking it as completed. person who completed it t
But what about the alert?
Of course, the real‐time monitoring component of the system will realize that the problem
no longer exists—but that doesn’t necessarily get rid of the alert. Typically, you want alerts
left in‐place until the problem is confirmed as being handled, which means that in addition
to closing the ticket, you’ve also got to go in and clear the alert. This is actually pretty
common in organizations that don’t have a unified management system: Close the ticket in
one system, then log into the monitoring system and mark the alert as handled. In a truly
unified system, however, it would make sense for the closed ticket to also clear the
associated alert because the alert is what created the ticket in the first place. Figure 5.2
shows how this loop can be closed within a single system.


62

Figure 5.2: Closing a ticket clears the original alert.
There’s a perfectly good reason to have alerts and tickets remain separate from each other.
A ticket tends to be an internal‐use‐only type of thing. It contains technical information,
intended for use in solving a problem and for reporting on that resolution process. An alert,
however, is consumable by a wider range of people. An alert might surface in a
companywide dashboard, for example, showing users that a given system is indeed
impacted. You don’t necessarily clear the alert just because the monitoring system is no
longer seeing a problem, because temporary relief from a problem doesn’t necessarily
indicate a resolved problem. You might want the alert to remain in place as a high‐level
indicator that, “we know it’s not working perfectly right now.” But at some point you’ll
want to clear the alert and return the affected system to “operating normally” status;
having that happen automatically as part of closing the ticket can be a convenient way to
keep the two different audiences updated more easily.
Retaining Knowledge Means Faster Future Resolution
Once a problem is solved, it doesn’t go away. At least, you hope it doesn’t go away. As I
pointed out in the beginning of this chapter, every problem solved is a potential turbo‐
boost for solving problems in the future—both that exact same problem as well as related
ones. In other words, you want to retain information about the problem, as well as its
olution, so that it can become useful in the future. s


63
Knowledge Bases
Probably the oldest formal means of retaining this information is the knowledge base.
Originally, these were separate databases, consisting of articles about how to solve
problems. When you have a problem, you first search the knowledge base to see if any clues
exist to help solve the problem.
One of the earliest knowledge bases to be widely distributed was Microsoft’s, who made it
available in the early 1990s on CD. Today, it’s a massive collection of online articles—so
massive, in fact, that there’s actually a knowledge base article on how to query the
knowledge base (shown in Figure 5.3, just in case you don’t believe me).

Figure 5.3: Microsoft Knowledge

Base article.

64

This illustrates just one of the problems with a knowledge base: People have to learn to use
it, and have to remember to use it. Unfortunately, IT professionals aren’t necessarily the
audience most likely to reach for the manual—or a knowledge base—when a problem
crops up. They’re a lot more likely to just dive in and try to use their own knowledge to
solve the problem. Using the knowledge base—“search the KB,” in the vernacular—usually
only happens after they’ve exhausted their internal knowledge. Part of this attitude comes
from their professional competency, part from the poor usability of most knowledge bases,
and part from the fact that knowledge bases can get outdated pretty quickly.
Which illustrates another major problem with a knowledge base: The task of keeping it
updated. Unless you’re careful at the outset to tag articles with things like product versions
and so forth, it can get really easy for the knowledge base to become a repository of
misinformation. Consider a line of business application, version 1.5, that has a particular
problem. You document that in a KB article, then rely upon that knowledge to fix the
problem whenever it arises. Finally, your developers correct the problem in v1.6. Does
anyone go back and update the KB article? No. Even if the KB article indicates that it applies
to v1.5, it doesn’t provide guidance beyond that. Was the problem fixed in v1.6? Will the
same fix procedure work in v1.5? If you’re using v1.6 and the problem occurs again, should
you follow the v1.5 procedure or report the problem as a new one—since the developers
thought they’d fixed it?
All of this presumes, of course, that you’ve addressed the major problem of knowledge
bases: Getting articles into them in the first place. Vendors like Microsoft spend millions of
dollars per year on the salaries of people who do little more than write documentation and
contribute articles to knowledge bases. Are you willing to make that kind of investment?
I’ve seen many companies set up a knowledge base, use it enthusiastically for a few months,
and then let it slide and fall into disuse.
Tickets as Knowledge Base Articles
The first solution to many of the knowledge base’s inherent problems was to simply
discard the separate knowledge base and instead use closed tickets as a form of knowledge
base. This is pretty much what every major ticket‐tracking system these days offers.
This approach solves the primary knowledge base problem of how to get content into the
system, because it simply re‐purposes content that’s already in the system: tickets. With a
good ticketing system, it can also help solve the problem of “what this applies to,” because
your tickets are typically categorized with a specific product or service. So you’ll at least
know when you’re reading an old ticket, what product and version it applied to—although
. you won’t necessarily know if it still applies to the current version of that product or device
Using Help desk tickets as a knowledge base doesn’t solve the problem of getting people to
actually search them for answers. In fact, using Help desk tickets can make the problem
worse. Think about it: Every time a problem arises, a new ticket is created. So when you
search the knowledge base (for example, the old tickets) using a keyword or by just
selecting a product or device, you’re going to get a lot more search results because you’re
going to be looking at every ticket that matches your criteria.

65

Help desk tickets don’t always make a great source of self‐service documentation, either.
Not all IT folks are the best writers in the world, and tickets have a way of gathering… let’s
just call it “informal” language, which you might not want to surface to your end users. For
example, a user who logs onto your self‐service knowledge base, trying to solve a problem
on their own rather than bugging your Help desk, might not be encouraged if the result
they found said something like, “Rebooted stupid user’s computer.” Technicians might also
provide very little in the way of detail. For example, it’s not unusual to see “fixed” as the
resolution to a ticket—not very useful for future reference. But using Help desk tickets as a
knowledge base source isn’t far off from the real solution.
Unifying the Knowledge Base
There are two things you can do to turn more Help desk tickets into useful knowledge base
articles. First, you need some automation. Whenever a new ticket is created, the ticketing
system should automatically go looking for related past tickets, presenting them as
candidates to whatever technician is working the problem. One great example of this
technique is used by the StackOverflow.com site—itself a kind of ticket/knowledge base
combination—when you ask a new question. It automatically searches past questions and
presents them to you in a visually interruptive fashion: They’re inserted below your
question, but above where you’d type the details of your question, as Figure 5.4 shows. It
essentially forces you to review those suggestions so that you can quickly see if your
question has, perhaps, already been answered.

Figure 5.4: Suggested answers to a question.

66

As you begin to type your question’s details, additional suggestions are shown off to the
side, again helping you use the database of past answers rather than requiring you to
explicitly search it in an extra step.
So a unified system can help take that extra step (see Figure 5.5). By including potentially‐
relevant tickets, or links to them, in with a newly‐created ticket, the system can give
technicians a jump start on solving the problem by calling their attention to similar
situations in the past—along with the solutions to those situations.

Figure 5.5: Using old tickets to solve

new problems.

67

In fact, for tickets generated automatically from an alert, the ticketing system can
potentially do a much better job of finding older tickets that actually relate to the problem.
Because the system doesn’t mind taking the extra steps, it can include more detailed search
criteria, such as the nature of the problem, the device or service affected, and so forth. A
technician might not think to include all the detail, which would net them a large set of
search results, which is often what discourages them from searching in the first place. By
getting a narrow result set to begin with, the automatically‐referenced tickets are more
likely to relate to the problem at hand.
Such a system can be made even better if the Help desk system includes a couple of check
boxes in its tickets. When closing a ticket, a technician should be able to independently
indicate:
• Whether this ticket contains a bona fide resolution to the problem. For example,
sometimes the technician might solve the problem by looking at an older ticket,
meaning the current ticket might not contain a lot of detail on how the problem was
fixed. But if the technician has fixed the problem, and has filled in the details of what
was done, then the current ticket can be marked as a “solution” ticket, making it
bubble to the top of future search results.
• Whether this ticket contains an end‐user consumable, self‐service solution. Many
ticket‐tracking systems these days include both “public” and “private” notes fields,
helping to ensure that end users don’t see anything that they might find upsetting or
insulting, while accepting the fact that technicians will put that stuff into a ticket
sometimes. By having a specific indication of which tickets are consumable outside
the IT team, and which ones specifically contain user‐implementable solutions, you
can build a self‐service knowledge base that actually works.
Figure 5.6 shows how a system might implement this—in this case, rather than a checkbox,
the system uses a “visibility” drop‐down to change a ticket from being held to being
published.


68

Figure 5.6: Controlling ticket visibility.
Simply the presence of these check boxes (or other indicators) can help to remind
technicians that documented solutions are desirable. From a management perspective,
organizations might set quotas for technicians: At least 75% of the tickets you close must
either contain a detailed solution or refer to the ticket that does contain the solution.
Metrics like that are manageable through ticket system reports, and can help ensure that
ickets really do begin to serve as a basis for knowledge retention. t


69
Making Tickets an Asset
The overall idea is to take your tickets from being a way of tracking problems and work to
being a complete life cycle for problem‐solving. In order to be effective, tickets‐as‐a‐
solution has to overcome some of the common human behaviors and implementation
issues that have often been hurdles in the past:
• Technicians don’t always search the ticket database—so that should happen
automatically to some degree, with tickets being offered as potential solutions.
• Technicians’ search skills aren’t always that great—so a unified system should,
using the information it already has at its disposal, make a first attempt at finding
relevant tickets.
• Technicians’ writing skills aren’t always a priority—so the system should emphasize
the need for complete solutions, management should focus on that as a metric, and
technicians should be able to offer both “internal” and “externally‐consumable”
versions of a solution, when appropriate.
With the right system—particularly one connected to your monitoring system, making a
truly unified environment—solutions to problems can be just a click away.
Past Performance Is an Indication of Future Results
Another way to use historical data is in developing service level expectations. I’m
deliberately avoiding the phrase “service level agreement” because an SLA is a formal
document that often includes some element of an organization’s politics. A service level
expectation, however, is the level of service that, based upon past performance, you can
realistically expect to achieve in the future. An SLA will ideally be based upon those real‐
world expectations—if you can provide them.
One issue with many organizations’ SLAs is that they’re not actually based in reality.
Someone will either make up an ambitious goal to “look good,” like promising 99.999%
availability and then boldly stating that they will just “manage to that number.” Other
times, someone will take an overly‐cautious approach when establishing service levels,
forcing the organization to expect a lesser level of service than they realistically could.
At fault are the tools we use. This gets all the way back to the first chapter of this book,
when I wrote about the silos that IT tends to work in, and the varying domain‐specific tools
we rely on to troubleshoot and solve problems. Those same domain‐specific tools are what
we use to measure our existing performance levels. Because those tools don’t all speak a
common language or use a common set of metrics, it’s actually really tough to figure out
what our actual service levels are.

70

The bottom line is this: You have an existing environment. All political and internal issues
aside, your existing infrastructure is capable of delivering some level of technically‐
measurable performance. You just need to discover what that is, using a common and
easily‐communicated set of metrics, based upon your infrastructure’s current capability.
You can’t really do that by using a hodgepodge of domain‐specific tools, though—and you
really can’t do it when your infrastructure starts to contain outsourced elements. Start
bringing in cloud computing platforms, collocated servers, software as a service (SaaS)
elements, and so on, and you’ll find that your domain‐specific tools just can’t provide
enough information. So how can you establish a good service level expectation?
This drives right back to the previous chapters in this book. Say you’ve got a complex set of
services and applications—who doesn’t these days? Figure 5.7 shows an infrastructure
offering a lot of different elements, some inside the data center, some out.

Figure 5.7: Modern environments include many different components.

71

You start your measuring at the one place it matters most: the end user. Put some probes,
agents, synthetic transactions, or whatever else you need in place to figure out what your
users are actually seeing, today, in terms of performance. Monitor that over several days
that represent real, normal workload—no fair picking holidays as your day to monitor—
and you’ll know what your infrastructure is actually providing. It stands to reason that you
can’t expect any better but that you also shouldn’t put up with anything worse. If that
service level expectation isn’t as good as your SLA—well, that’s fine. You can start looking
for areas where you can improve, bringing things up to that SLA level.
You’ll also want to capture individual performance from each component—and this is
where things can get tricky. It’s crucial that, at this level of monitoring, you get everything
onto a single console, in a single language, using a single set of metrics. What you’re looking
for is a performance range for each component that represents a normal workday.
Provided each component operates within that observed range, you should be delivering
the end‐user experience that you measured. Those ranges provide the basis for your
monitoring thresholds: Anything outside those ranges is something you need to be alerted
to.
With that service level expectation established, you can start measuring different workload
levels. See how things look on an especially busy day, for example, and what they look like
on a light day (this is where it’s okay to pick a holiday, for example). You’ll start to get a feel
for how your end user experience differs under those different workloads, and how the
elements of your infrastructure change under different workloads.
Making sure that any outsourced elements are included in all of this is, of course, absolutely
crucial. As I’ve pointed out in earlier chapters, monitoring those is a bit different than
monitoring things that live inside your own data center. You’ll either need a unified
monitoring solution that’s truly capable of hybrid monitoring, or you’ll need a special set of
tools to gather performance information on those outsourced pieces.
Notice that I’ve laid out two sets of metrics for you to monitor: performance and workload.
Too often, I see SLAs that don’t take the workload into account. “We will provide a 100ms
response time.” Okay—under what workload, specifically? Because maybe I can give you a
100ms response time under what I consider to be a normal workload, but if you start
loading on additional users and functions, then that response time is obviously going to fall
off. Again, monitoring solutions can help with this by not only measuring performance of
things like processor, memory, disk, and so forth, but also workload, like the number of
transactions being processed, the number of network packets being routed, and so on. It’s
important that your performance expectations include that workload context so that you
an begin to make better service level agreements in the future. c


72
It’s the Performance Database
All of this performance data needs to not only be captured but also stored. That’s where a
lot of monitoring solutions miss the point: They’re monitoring in real‐time, and they’re
alerting to problems, but they’re not always saving the data they see. Let’s expand the
example application to include that performance database—it’s in Figure 5.8.

Figure 5.8: Adding a performance databas

e to the environment.

73

The point of this figure is simply that you need to get performance data from every
component—even the outsourced ones—into that database. Why? Two reasons:
• This database is what’s going to show you what your performance really looks like
on what you consider to be a normal day. This is where your performance
expectations will come from, and it’s hopefully what you’ll use to derive more
realistic and meaningful SLAs.
• This database is what’s going to tell you when your performance is trending away
from previously‐established norms. I’m not referring here to a situation where one
component’s performance goes wonky due to a problem—live monitoring and
alerting will take care of that. The database is there to spot the long‐term trends:
“Hey, did you know that performance is down 1% from last month, which was down
.75% from the month before? At this rate, you’ll be unable to meet your SLAs in 6
months.”
And frankly, a good monitoring solution shouldn’t even show you the prototypical “trend
line” in performance as its first step. The first step should be a simple dashboard: “You’re
meeting your SLA, and based on current trends, will continue to do so for the foreseeable
future.” Or, “You’re meeting your SLA—but barely, and based on trends, you aren’t going to
be able to meet your SLA for more than a month or two.”
From there, you can drill down into graphs and charts that give you more detail so that you
can find the component or components that look like the current bottleneck, and start
making plans to get more capacity in place before you miss your SLA.
Summary
Embracing the past to make a better future—that’s been the theme of this chapter.
Whether you’re gathering ticket‐resolution information so that it can be used to solve
future problems more quickly, or gathering performance information so that you can
establish expectations and predict capacity, it’s all about keeping historical data and
leveraging it to put the organization on a better footing for tomorrow.
Coming Up Next…
In the last chapter of this book, we’re going to step all the way back to the beginning and
look at unified management from a case study perspective. I’ll use my consulting and field
experience to construct a composite case study, drawing elements of unified management
together to show you what a modern, truly‐unified environment can look like. I’ll share
specific problems from each environment, and explain how unified management helped
solve those problems more quickly and effectively.


74
Chapter 6: Unified Management,
Illustrated
In this final chapter of the book, I want to revisit everything from the first five chapters.
However, I’m going to do so in the form of case studies. I’ve been fortunate enough to speak
with several consulting clients of mine who’ve been struggling with the same issues I’ve
outlined, and who’ve recently been trying solutions that follow the basic approach I’ve
described. They’ve agreed to let me share their stories (although they’ve asked that I not
use their names or company names) so that you can get a before‐and‐after look at how this
“unified management” thing should work. Along the way, I’ll also share some of the
challenges and roadblocks they’ve encountered. A switch to unified management isn’t
always going to be hassle‐free, so I think it’s valuable for you to see what they’ve had to
deal with, and how they think they’re going to do so.
This chapter will also include some of the practical information on unified management
that hasn’t made it into the previous chapters. I’ll provide a consolidated shopping list of
unified management features so that when you start examining solutions, you can have that
list in hand to help you. I’ll also look at different purchasing models that vendors are
offering these days to give you an idea what kind of flexibility you might have for acquiring
and implementing a solution.
The Case Studies
A unified management solution has to provide features for what I believe are two distinct
broad use cases. The first is in helping you to react to problems, while the second helps you
manage non‐problem requests—such as requests for changes within the environment. I’m
going to provide two distinct stories for each of these. They’re actually both drawn from the
same consulting customer, although you’ll meet different people from those organizations
in each narrative.
Detecting and Solving Problems
Lisa is a senior systems administrator, responsible primarily for the Windows‐based
systems in her environment. Her counterpart, Peter, is responsible for the company’s Unix‐
and Linux‐based server infrastructure. Both have considerable areas of overlapping
responsibility, as many of the company’s line‐of‐business (LOB) applications rely on both
Windows‐ and *nix‐based resources.

75

“It isn’t just the servers, of course,” Lisa told me. “It’s what’s running on those servers:
databases, Web services, you name it. Someone else supports those different pieces, so
there used to be a lot of time spent arguing about whose fault something was.”
I asked her for an example of how things worked in their environment prior to
implementing a unified management system. She laughed and brought out a file that she’d
clearly held on to for some time. It looked like the text from a Help desk ticket’s notes.
Here’s the complete text, with names edited; I’ve added some [editorial] notes for items
that I had to ask Lisa to explain.
OPENED BY HelpDesk AT 2009‐06‐14 13:34
User states that BOS [an LOB application] is extremely slow. Have several e‐
r BOSDB02 responding slowly to pings. mails about this in the q also. Serve
ASSIGNED TO LHarte [this is Lisa]
NOTES BY LHarte AT 2009‐06‐14 15:26
BOSDB02 is working fine, apart from the fact that SQL is hogging 100% of the
CPU. Passing to DBA.
ASSIGNED TO DShields
NOTES BY DShields AT 2009‐06‐14 16:53
Probably the indexes again, SQL is taking longer to complete queries than it
tonight should. Will schedule indexes to be rebuilt
NOTES BY HelpDesk at 2009‐06‐15 10:44
Still getting calls on this
Indexes rebuilt
ASSIGNED TO HelpDesk
NOTES BY HelpDesk AT 2009‐06‐15 11:34
SDB02 is still slow to ping Still getting calls that BO
SQL is still slow—looks like it is in disk IO. Fragmented disk? Need server
support.
ASSIGNED TO LHarte
NOTES BY LHarte AT 2009‐06‐15 13:47
Server disk shows less than 2% frag—not the problem. IO is slow because
s. Maybe your DB is fragged. I’ll call you. SQL is thrashing the disk

76

The conversation clearly went offline at that point because the next entry simply indicates
“problem resolved.” Unfortunately, there was no official documentation of what went
wrong or what was done to fix it, but Lisa explained. “We kept going back and forth
between us—he’d see something in Performance Monitor that looked like the server was
slow, and bounce it to me, and I’d tell him that it’s because his SQL Server was causing the
problem and bounce it right back. I don’t even have permission to look inside SQL Server,
and he just kept wanting to get the ticket out of his queue.
“In the end, it actually turned out to be a problem with the SAN, which was Peter’s problem.
Something had gone wrong with our main SAN connection and we were on a slower
backup link, and something was wrong with that link’s configuration, so it wasn’t running
at full speed or something. We were seeing it as slow disk IO because Windows obviously
thinks that the SAN is just one big locally‐attached volume. We were running all kinds of
tests on the server and in SQL Server to try and find the problem, but none of our tools
were able to realize that the real problem was further under the hood someplace.”
Peter recalled the incident. “It was weird because there wasn’t anything actually broken, so
none of the tools I use to monitor the SAN gave off any alerts. The problem was a
configuration problem on several of our hosts. The tools don’t see that as broken, of course,
although it was causing them to access the SAN a lot slower than they’re used to.
“The real problem was that this cropped up on about seven machines all at once. We didn’t
correlate the problem at first, because every single machine was affected slightly
differently because they all use the SAN for different purposes. There’s only one major
database on that SAN, but there’s a small Web farm and a file server. So the symptoms the
users saw were different, and the problems were all routed to different people to handle. It
was the file server guys who brought the problem to me. They saw the disk queue length
going up pretty dramatically, and they knew that had to be the SAN, so I got involved.”
“That was the problem we dealt with all the time back then,” Lisa said. “We all focused
specifically on the bit we were responsible for, but these days there are so many
interactions and dependencies that we can’t see from a tool level that we’d get all tied up
when a problem happened.”
I also spoke with Kevin, who manages the company’s Help desk. He says those types of
problems were especially trying for his team because users would keep calling and the
Help desk had no idea what was related and what wasn’t, or what the status of anything
was. “Users would call in with something that sounded new to whoever answered the
phone, so they’d open a new ticket. We were probably slowing down whoever was trying to
fix the problem just by loading new tickets on them for the same problem. But we had no
real communication. If you answered the phone, you looked to see if there was an open
ticket on anything that sounded similar. But there was no one place where we kept track of
all the currently‐open problems. I finally just had a white board installed in the Help desk
office, and outstanding problems would get written up there. So when a call came in you
could at least look to see if the problem was already open, then look up that ticket to see
what was happening and give some status to the user on the phone.”

77

I asked Lisa how things worked now, after the company had implemented a unified
management system. “We’ve been on it for about a year now,” she told me, “and it’s
completely different.” She showed me a ticket for a problem that had occurred recently.
“This is what we see, now.”
ALARM 2011‐06‐14 12:13:42
NODE Windows Server BOSDB02
SQL Server Instance DEFAULT
erver response time exceeds threshold SYMPTOM: SQL S
IP: 10.10.15.212
SQL Server database shows 34% free
SQL Server fragmentation shows <5%
Disk queue length <1
Network utilization <40%
CPU utilization <60%
Memory utilization <75%
RELATED ALARM 2011‐06‐14 12:10:52
NODE Router MBS3667
Interface fault
“Just looking at that, I can start to guess what the problem is.” She showed me the
monitoring console that the entire IT team now worked from, which looks similar to Figure
6.1. “You can see that it’s basically a network diagram. It shows the servers and the services
they run, but it also shows things like routers and switches. So when a server alarms, it’ll
also look for alarms on any dependencies, like a router. In this case, we had a router
interface that was going bad and starting to drop packets. That triggered an alarm right
away to the router guy, but it also alarmed all of the servers that use that router to
communicate because clients—and the monitoring system—saw the servers’ response
times go up. Just having that data in front of us saves a ton of time testing for problems. The
system basically runs a series of basic checks whenever there’s a problem, so it gets those
preliminary steps out of the way for us.”


78

Figure 6.1: Tracing alarms visually.
She said the team spends a lot less time passing problems back and forth because it’s
k. usually much clearer where the problem lies when the system is looking at the entire stac
“This is huge when the problem is actually outside the data center. We have a number of
applications that interface with SalesForce.com, and whenever those guys have a problem,
or more commonly when our ISP gets a little slow, our users see it as ‘our’ application being
slow. But the monitoring system knows about the dependencies, and it’s usually already
alarmed us. We’ll post a message about the affected applications on our end, and start
calling the service provider to log a ticket with them.”

79

Posting a message, Kevin says, has helped the Help desk tremendously. “We have this Web
portal where users can log tickets, and current system status is shown right there. So
before they even open a ticket, they can see that we know there’s a problem. Once we
trained them to trust us on that, they stopped logging duplicate tickets.”
He admits that the training was a big step. “We didn’t do it initially,” he said, “but once
users realized we were being pretty honest and consistent about posting problems, they
started to trust us more. We had a big communications effort, and now there’s even a
mailing list users can add themselves to so that they get a message whenever a system they
use is affected. Being proactive cuts back on the Help desk volume a ton.”
The benefits of a unified management system were pretty clear for this team: faster time to
resolution, less passing the buck, and more proactive communications with their end users.
The biggest challenge they faced?
“A trust thing,” Lisa told me. “We had to learn to trust this new system to monitor
everything as well as we could with the tools we were familiar with. So the first few times
things went wrong, we went right back to what we knew to troubleshoot the problem. Once
we realized that we were seeing the same data, we started trusting the new system more,
and just started relying on it. We’ll still dig out the old tools if we have to dive deep into an
affected system, but by the time we do that we know the problem is in that system, so we’re
not wasting time. You don’t pass the buck to someone else at that point, you stay in that
problem area until you spot the problem.”
Fulfilling User Orders
Kevin provides the link to the other side of the unified management story. “We’re not just
responsible for opening tickets for problems. We also open tickets when routine changes
need to be made.” I asked him to give me an example of how this was handled prior to the
implem , and he pulled out an archived ticket. entation of their unified management system
OPENED BY HelpDesk AT 2010‐08‐12 15:50
User BDOUDS needs a new SharePoint site deployed as
iversitybid. User will be site admin. intranet/projects/un
ASSIGNED TO JHoltz
NOTES BY JHoltz AT 2010‐08‐13 08:27
Sent e‐mail to Bill’s manager confirming. Also sent e‐mail to Special Projects
confirming.
waiting to hear from Special Projects. Bill’s manager, KHICKEY, confirms. Still
eft VM. Still waiting to hear from Special Projects. L
NOTES BY HelpDesk AT 2010‐08‐20 10:34
User is asking for status.


80
Tell him to call Special Projects. I just need them to confirm since this comes
out of their budget.
d BDOUDS as site owner. Special Projects confirmed. Set up site and assigne
STATUS SET TO RESOLVED AT 2010‐08‐22 13:12
“That kind of thing went on all the time. Someone would call us asking for some access or
whatever. We’d assign the ticket to someone in IT, but then they’d spend time figuring out
who was responsible. We used to have a big book,” he added, pointing to a thick three‐ring
binder on his shelf, “that told us who was responsible for pretty much everything. Then
you’d wait and wait to hear back from them. This one took, what, two weeks to resolve?
That’s insane, and the whole time the user is calling us to check on the status, when we’re
eff 10 minutes to do once he got approval.” not the ones holding things up. This took J
And in the world of unified management?
“It’s actually pretty cool,” Kevin said. “Now we have a big online catalog with everything a
user might want. It’s kind of like an online store. They submit their request through there,
and the system opens a ticket automatically. But each item is associated with a workflow,
so IT doesn’t even hear about it until the ticket has been routed through the proper
approvers and been approved. Once we see it, it’s a done deal, so we just implement it. For
some things, we’re even implementing scripts that do the implementation for us, so it’s
completely hands‐off.” The organization worked out, and documented, the desired
workflows for each possible product. Kevin provided an example of that documentation,
shown in Figure 6.2. “This kind of documentation is important because we worked off of
this to implement the workflows. The business owners can come up with these flowcharts
on their own, then we just implement them on the designated products in the catalog.”


81

Figure 6.2: Documented workflow used to drive automated review/approvals for
catalog requests.

82

We were discussing access permissions as an example, so I asked what happened when
those needed to change. “They never did,” Kevin admitted. “Once you had access, you
usually kept it until you left the company. We just didn’t keep track of it. Now, the catalog
keeps track of it. If you don’t need something, you ‘return’ it to the store, it goes through
whatever approvals, and we get a ticket to remove your access. Different managers also
have to occasionally complete an attestation, which is where they review who has access to
their resources and let us know if anyone needs to be removed, or if everyone can stay.
We’re not the gatekeepers anymore.”
I noted that an automated workflow wouldn’t necessarily guarantee a speedy response
time. “Oh, users still have to wait two weeks for approvals, sometimes. But when they
submit their request through the catalog, they can check the status of that request on their
own. They can see that it hasn’t made it to us, and they can take it on themselves to bug
their manager or whatever. We’re totally out of the loop until it’s approved, and they know
that, because the request shows it hasn’t even made it to us yet.” Such a system does a
better job of keeping users informed, and helping them to understand what’s really holding
things up.
A Shopping List for Unified IT Management
I want to use this section to present a list of what I believe are the must‐have features of a
true unified management system. As you’re evaluating solutions, make sure they offer
these features—and make sure the features operate in a way that makes sense for your
environment’s needs.
• Workflow. Unified management solutions should offer workflows that can help
automate responses and service management. Workflow construction should be as
drag‐and‐drop as possible, involving as little programming as possible.
• Agents. I know there’s a huge divide between people who are fine with deploying
agents, and those who hate the idea; I’d suggest looking for a solution that supports
both models. Agentless data collection is fine in some instances, although it can offer
less performance and coverage than an installed agent. I think a hybrid approach is
probably best for most organizations, and unified monitoring solutions ought to
support that.
• Alarm integration. When a problem arises, a unified management solution should
obviously tell the designated individuals; it should also open a Help desk ticket and
automatically search for related alarms from the past. Doing so will help speed up
the time to resolution. This kind of “knowledge automation” is really crucial.
• Approvals. As I’ve pointed out, “tickets” aren’t always for problems—sometimes
they’re for new work, like change requests. A unified management system should
support a review/approval workflow for these requests so that IT can be taken out
of its traditional “gatekeeper” role and instead simply work those tickets that have
been approved for implementation by the business.

83

• Discovery and deployment. A unified management solution should help you
discover manageable nodes and services and deploy any necessary agents to
monitor them. This discovery should happen more or less continuously, or at least
be able to be run regularly, so that changes to your environment can be captured.
• Routing. Tickets—whether for problems or for requests—should be automatically
routed based on custom business rules that you can define. In other words, tickets
should head straight to the correct implementer as quickly as possible.
• Scheduling. A unified management system should have some kind of internal
calendar that lets you schedule maintenance tasks. This functionality helps to
resolve maintenance window conflicts and schedule work to happen at the right
time.
• Catalog. This is a key part of making a unified management solution part of a self‐
service, managed system. In addition, a catalog helps work toward bringing process
compliance—such as ITIL compliance—into your environment. A catalog provides
users with a list of “orderable products,” not unlike shopping at an online Web host.
Users’ “purchases” translate into tickets, which go through review/approval prior to
being passed to IT for implementation.
• Communications. Users need to be able to submit requests, and users and your team
must be able to review them from a familiar place. A Web portal is the traditional
way to enable this communication, but systems that can integrate via users’
inboxes—which they’re in all the time, anyway—is even better.
• Interface. You can’t have too many interfaces into a unified management system,
and whatever solution you pick should offer both Web‐based and mobile‐friendly
versions of its UI.
• Metering. If you’re monitoring actual paying customers, you’ll need the ability to
charge them for what they use. Even if you’re just dealing with internal “customers,”
being able to perform “charge backs” for their IT resource consumption is going to
be critical as business managers advance their management strategies. There’s no
reason for IT to be seen purely as overhead when resources can—and should—be
tracked back to the business components that are consuming them.
• SLAs. A unified management system should assist you in both defining and
monitoring service level agreements (SLAs) based on actual historic trends.
• Trends. A unified management solution should include a performance database that
lets you track historical performance trends. This database can be used to help
define and report on SLAs as well as perform capacity planning.
• Surveys. Closing the loop with your end users is crucial because technical SLAs
aren’t the only way your success is being measured, whether you know it or not.
Being able to poll users helps you define SLAs in their terms, creating more
appropriate expectations.

84

• Reports. Look for reports and dashboards that provide managerial‐ and executive‐
level views of items such as workload, SLA compliance, and so forth. Heck, even
dashboards that can be exposed to end users, helping them see that the
environment is performing as it should, can go a long way toward helping IT be seen
as more responsive and engaged with the business.
• Visualization. Being able to visualize your environment can help make root cause
analysis and problem resolution faster and easier.
• Everything in one place. As I’ve written several times in this guide, a unified
management system’s primary value is unity, or the ability to get all your
performance concerns into a single place, using a single set of metrics, alarms,
identifiers, and so forth. This singular view helps to break down the traditional
domain‐based “silos” that IT is built around, and gets everyone focused on the root
cause of a problem more quickly.
• Knowledge retention. A unified management system should help your organization
retain critical knowledge by turning Help desk tickets into an automated, searchable
knowledge base.
• Pre‐loading information. When an alarm generates a ticket, that ticket should
include whatever details the unified management system can provide: IP addresses,
response times, and so forth. The more information included in the ticket, the less
the responder has to go look up, and the sooner they can start working on resolving
the problem.
This list obviously isn’t comprehensive but provides a starting point. If a potential solution
offers these features and meets your organization’s specific needs, that solution is probably
worth looking at in detail during an evaluation. Make sure you gain not only a “check mark”
on these features but also a detailed explanation of how they’re implemented. Also, ensure
that the implementation is one that will work within your organization’s requirements.
Ways to Buy Your Unified IT
I want to briefly outline different approaches that vendors take for delivering unified
management solutions. Let me emphasize up front that I don’t regard any of these as
“right” or “wrong;” there’s merely “what’s right for you,” which you’ll need to decide on
your own.
Typically, you’ll find that solutions of this kind are priced based on the number of nodes
you need to manage, possibly also incorporating the number of users in your organization.
A “node” is typically defined as any manageable device: a router, a server, and so forth.
Some vendors are more creative than others with this portion of their licensing model;
don’t let a complex model scare you off. In some cases, more complex license models are
actually to your benefit because vendors are trying to precisely accommodate a wide range
of scenarios. You should be more concerned about what you’re licensing.

85

For example, at one end of the spectrum, you’ll find what I call monolithic solutions. With
these, you get—and pay for—every feature that the vendor offers, regardless of which ones
you’ll need right away. I think it’s hugely important to make sure you’re acquiring a
solution that can do everything you want, although I’m not sure you necessarily want to
pay for all of that up front. In some cases, you may want to implement a solution in a
phased approach, licensing just the functionality you need for each phase, thus allowing
yourself to kind of “ramp up” into the full licensing and functionality of a product. The nice
thing about monolithic solutions is that they’re often well‐integrated because everything is
u in a single piece. delivered to yo
There are also pluggable frameworks. I tend to view big frameworks like HP OpenView as
fitting into this kind of model. With these solutions, you buy a base product, then add in the
various bits and pieces you need to speak to your environment. These models offer a ton of
flexibility, of course, and if you’re going with a big enough vendor, you should be able to
find plug‐ins for every bit of functionality you need. These solutions run the risk of
becoming a massive do‐it‐yourself project, though, and the plug‐ins aren’t always as well‐
integrated as you might like. Licensing can also be really, really complex because you’re
g‐ins separatoften licensing the plu ely from the base framework.
Another model is the pay as you go approach. With this model, the solution offers all the
functionality you might ever need, but you don’t “switch it all on” right away. Instead, you
turn on the modules, or functionality, that you need immediately, and you just pay for that.
As you add more responsibility to the solution, you pay a bit more. This setup is a bit more
like a “cloud” model, where you can grow as large as you like but only pay for what you
need right now. You’re not typically dealing with plug‐ins, or if you are, they’re usually all
. delivered by the same solution vendor. I’m seeing more clients considering this approach
The last thing you’ll need to think about is where the solution will live. In this age of “the
cloud,” you actually have a choice of hosting your monitoring and management solution in
your own data center or simply purchasing it as a hosted service that lives in the vendor’s
data center. Either way, the vendor’s agents get installed into your environment. I won’t dig
into the “on‐premise versus hosted” debate; you probably know what’s right for you, and
you can certainly discuss that option with whatever solution vendors you’re investigating.
Regardless which side of that debate you’re on, I think it’s nice to have a solution that offers
both options.
Conclusion
Where, there you have it: unified management. The overall idea behind this book was
simple: really focusing on the straightforward theme of “get everything in one place, and
get everyone on one page.” It’s only “revolutionary” compared with the disjointed approach
that our existing technology tools have more or less forced us into.

86

Of course, I don’t expect you to just rush right out and start switching over to a new
monitoring and management framework. These things can be done in small steps so that
they create less impact on your organization and allow you to learn to use various
techniques and features properly in an organic, rather than disruptive, fashion.
The goal should be there: Stop wasting time with the back‐and‐forth and instead get
yourself onto a single pane of glass for your organization’s top‐level monitoring. Integrate
that with a Help desk system that lets you keep everyone informed and gives you the
need to analyze your IT performance objectively. metrics you
Good luck.

E book creating-unified-it-monitoring-and-management-in-your-environment-chap-1-6

More Related Content

Similar to E book creating-unified-it-monitoring-and-management-in-your-environment-chap-1-6

Recently uploaded

E book creating-unified-it-monitoring-and-management-in-your-environment-chap-1-6