Presented at the University of Iowa IT Tech Forum on May 30, 2013. If the history of system administration resembles firefighting, then the future more closely resembles city planning. As IT systems grow in complexity and scale, system administrators are uniquely positioned to help their organizations “connect the dots” by identifying the information that will improve decision-making and then communicating it in the right context. Topics will include: making service-level metrics part of routine system monitoring, reducing the on-call burden and shifting the balance from reactive to proactive work, anticipating useful diagnostic data, capturing historical data to identify trends, using forecasts to manage capacity and prevent problems before they occur, sharing information throughout the organization using blogs and other media, and how to demonstrate the value of these efforts to senior management.
3. 1995: My first numeric pager
1999: Upgraded to
alphanumeric (fancy!)
Now: iPhone, PagerDuty,
push notifications, e-mail alerts
Tomorrow: Nagios for Google Glass?
19. David S. Shafer
ITS Storage Services
The University of Iowa
david-shafer@uiowa.edu
DavidScottShafer
@DavidSShafer
QUESTIONS?
Editor's Notes
My name is Dave Shafer, and I’m going to talk about a bunch of random things. We’ll see if we can find some connection between them. (I’m serious… This is pretty random.)
There are many ways to spell my last name. You can see the one I use, though I’ll respond to any of them. I’ve received mail for all of these at one time or another. (If we’re being honest, I don’t think “Shafner” is really an option, but that didn’t stop someone using that variation once.)I managed to snag david.shafer@gmail.com early on, during the decade when Gmail was considered Beta. Now all the other David Shafers with Gmail accounts can’t remember their addresses, so when they give somebody the wrong address, I end up getting their mail. I’ve received e-mail for a state legislator in Georgia, a guy who works in craft services in Hollywood, and a racist amateur pilot in Texas.I work with the ITS Storage Services group along with Susanne Branson, Seth Clarke, and Mark Weber. Our group is responsible for 1.5 petabytes of disk storage for applications you use every day, from personal file storage to ICON and institutional databases. I started in ITS 14 years ago as a Unix systems administrator, and 7 years ago I moved from Unix administration to storage services. For the past 14 years, and 4 years before that, I’ve been on-call.
My first pager in 1995 was a numeric pager. It couldn’t even do text. Then I upgraded to an alphanumeric pager when I started in ITS in 1999. I thought that was pretty fancy, and I should’ve stopped there, because sometime later I got a Motorola RAZR and figured out how to check my e-mail on it, and it was all downhill from there. Now I’ve got my iPhone and push notifications from PagerDuty and HawkAlerts and I’m pretty much always tethered to the University.This scenario will sound familiar to many of you: You’re sound asleep, burrowed under the covers getting some well-deserved rest, when it happens. Your phone is trying to get your attention.
(If you’re like me, your phone is repeating the chorus from Foreigner’s 1981 hit, “Urgent”.)You squint to see the time and think, “What could possibly be broken on a Thursday night at 3 a.m.?” But you already know the answer,and it’s bad.You instantly go into troubleshooter mode. Because you are the fixer. You’re going to fix this. You’ll wrestle with servers, you’ll tame wild processes, you’ll beat HTML forms into submission. You will emerge victorious, and you will be the hero.
The problem is that being a hero all the time takes a toll. As budgets get leaner, and systems get more complex, it gets harder to keep up with all the fires.The basic dysfunction is that we spend so much time reacting to incidents, we don’t have time to manage our systems proactively, which would prevent the incidents from happening in the first place and save time in the long run.Wedon’t notice the application that’s been slower for the past month. We don’t see the disks that are filling up twice as fast. We’re fighting fires when we should have been looking for smoke.The worst part is that we know we’re doing it, but we can’t break out of the cycle.In our data centers, we monitor for smoke in a literal sense. But we don’t always monitor for the metaphorical smoke.We have access to so much data, but we don’t use it to make better decisions _or_ to give clarity to the rest of the organization.Today the I.T. hero isn’t the person who puts out the fires. The hero is the person who uses complex data in clever ways to make plans, to explain things to others, and to prevent the fires from happening. The hero is the person who can look at all that data and help themselves and others to connect the dots. (See how I worked the title in there?)
As a first step, you have to take control of incident response. The emergencies will happen. You can’t stop them altogether, but you can make them less painful and start to understand the size of the problem.When I first started in ITS, we used (and still use) a system called Spong, or “Son of Pong” (“pong” being the only appropriate response to a “ping”). That worked great for servers, but not for proprietary storage systems. When a storage system has a problem, it expects to send an e-mail. If you’re lucky, you can get it to e-mail you a text message. As we added more storage systems, this model wasn’t really sustainable.One of the most important things we did was move the alerting function to a service called PagerDuty (www.pagerduty.com), which we started using in February 2012. All of our systems send their alerts to PagerDuty’s servers. Through the PagerDuty web interface, we’re able to define groups of systems, on-call schedules and overrides, escalation policies, and notification methods. When an incident happens, PagerDuty decides who should be notified based on the schedule.It’s not free– the cost is about $16/month per system administrator, but it’s been worth every penny.
This is a graph of incidents received by PagerDuty for the past year. Even as our storage systems have continued to grow, both in size and complexity, the number of alertable incidents has actually decreased – because now we’re better able to 1) filter out the noise, and 2) identify the recurring problems– things that need a long-term solution, instead of a quick fix. PagerDuty has worked so well for us, we’ve expanded it to two other groups in ITS: Core I.T. Facilities, and the DNA Team. (Educational discounts are available, so talk to me if you’d like to try it out.)
The service catalog is an area where we’re still improving. We’ve defined our available storage services, and the pricing. Now that we’ve finished the move to the new data center, we’ll be revisiting the service definitions and we’ll have a new site detailing the infrastructure.Where I think we can also improve is in defining service level objectives, especially performance, and also uptime. We have some rough guidelines– for example, we want our high-performance EMC SAN storage to provide response times at or below 20 milliseconds– but we don’t have a good way to get alerts when we exceed those performance objectives. This will be something we continue to work on over the next year. Also quantifying uptime; I know we’ve had very few outages, and people seem to be happy with our uptime, but I want to put some numbers on it.
This leads to the next set of questions: We have so much information at our disposal, but do we have the right data when we need it, and can we relate it to other things that matter?Imagine you’re staring at a single data point, isolated. Maybe it’s a process completion time. Or a disk I/O rate. Or a page load time. It’s not terribly useful on its own. To make it useful, you have to relate the data in four directions: up, down, backward, and forward.Relate the data up to understand how it affects services. What other things will be impacted? Is it relevant to our service level objectives? Is it something we should communicate to management, other workgroups, or our users? How can we help them understand?Relate down to lower-level measures to understand deeper meaning, causes, and hidden factors. Relate backward in time to understand the data in historical context, establish benchmarks and baselines, and expose trends.And relate the data forward in time with forecasts and projections that help plan for future capacity.(My husband used to work in IT engineering at GoDaddy.com. GoDaddy’s customers are the people who own domain names and web sites, but the engineering team’s ultimate customer was GoDaddy.com owner and CEO Bob Parsons. They were able to summarize every service into one single metric: dollars per minute. When dollars per minute was in line with historical trends and forward-looking projections, the customer– Bob– was happy.)
In a perfect world, we’d have the tools to do all these things. But the reality is that we usually don’t. In the storage world, for instance, every storage platform has a different management console. Here you can see the management interfaces for our EMC, Dell EqualLogic, and NetApp systems. They don’t talk to each other, and they don’t always expose the data we need.And the problem isn’t unique to storage. In 2011, I worked with Steve Troester (ITS Network Services) to look at how different groups in ITS monitored their systems, and where we might be able to improve things. We found every group was doing something different, because the systems they were responsible for were so different. To this day, all of our monitoring tools are separate, so there’s no easy way to find correlations between the networks, storage, servers, databases, and applications. There’s also no easy way to present information in real-time about the status of our services to users.There’s more than one piece of commercial software that promises a unified view of all your systems, across the entire environment, with visibility into every layer of the stack. But it’s all very expensive, and doesn’t work particularly well, and requires a lot of effort to implement. In some cases, you need dedicated full-time staff just to keep everything in sync. (When you hear the sales team mention “a single pane of glass”, run.)
So instead we look to the Internet, because the largest sites out there have also tackled this problem, but they haven’t done it with commercial software. Instead, they each use a customized tool chest of open source and homegrown tools. Because it turns out that sending a text message alert through PagerDuty is a fairly universal need– that’s not something we have to write for ourselves– but maintaining records about storage allocations across EMC, EqualLogic, and NetApp storage systems and reconciling that with the ITS service billing system? That’s more specialized, so we developed our own solution using SQL Server and PowerShell. Susanne Branson is responsible for our storage accounting system, and you can ask her if you’d like to know more about it.
Pingdom is another tool we’ve looked at it, and it’s being used actively by some other groups in ITS. Like PagerDuty, it’s a hosted service with a monthly subscription fee. Pingdom can monitor your web site or your application from their servers around the world, tell you how fast it’s responding, and notify you when problems occur.Our users aren’t always on campus. Because Pingdom is monitoring from off-campus, it can recreate the user’s experience better than anything we could do ourselves. As a result, it gets closer to our goal of measuring the service the way that users experience it, instead of just measuring the back-end technology. You can monitor one system for free with Pingdom, so here we’re using it to monitor one of our monitoring servers.
Most of the open source monitoring packages we’ve investigated haven’t been very useful. Here you can see Cacti, Zabbix, Nagios, and a Nagios derivative called OpsView. They’ve been useful for other things in the past, but not terribly useful for proprietary storage systems. We needed a new approach.
Graphite (https://github.com/graphite-project) is an exampleof something we’re doing with open software source software and locally developed scripts. Graphite offers some advantages over some of the traditional open source monitoring tools. It’s really just a data collection and graphing engine. Using Graphite, we can ingest massive amounts of data from different sources and then make graphs on the fly (literally connecting the dots).We’re just starting to use Graphite with our NetApp storage systems. The upper graphs show increased latency on one of our VMware volumes, and the lower graph shows that one VMware is generating more NFS operations than the rest.We may add our EMC and EqualLogic systems to Graphite in the future. Ask Mark Weber if you’d like to know more.
When you have good data about your services, and you can make the data relevant to others, it becomes a very powerful tool. It’s especially powerful for an introvert like me, because good data, when presented in the right context, can often speak for itself. And it’s a lot easier to ask for things– money, staff, equipment-- when you have data explaining where your capacity has gone, and what’s driving growth.You can highlight where you’ve done well– something that too often goes unnoticed in I.T. And you can also explain where you haven’t done so well. None of us likes to fail, but the best thing you can do in that situation is be absolutely transparent– show you understand the problem, you understand the cause, and you have a plan to prevent it from happening again.Of course it doesn’t stop after you submit that request for a new server. You continue the narrative, keep your eyes on the road ahead, and continue telling the story.
In April, 2011, I started writing a weekly Friday status update for the Storage Services group, which I publish to a blog on the ITS Intranet and e-mail to a dozen people who’ve asked, including most of the ITS Leadership Team. In the two years since I started, I’ve only missed a handful of weeks when I was out of the office. I include updates on our current projects, and any news about our services– good or bad. The whole thing is less than a page, even in a busy week. It’s a chance for me to highlight the work we’re doing and communicate what I think our priorities are. It’s also a great way to cap off the week; I get to review our accomplishments and collect my thoughts going into the weekend. It’s become such an important part of my process, I can’t imagine ever stopping. It’s another way I can use our understanding of our services, and the data we collect, to continue telling a story about the work we do.This talk is also an example of story telling:I’ve shown you data demonstrating how we’re reducing alertable incidentsI’ve told you about new tools we’ve developed and we’re usingI’ve shown you how we’re collecting and using in-depth performance dataI’ve told you about our weekly blog updatesSo even as I’m talking to you, I’m reinforcing the messages about the services we provide.
The bottom line is this: To break out of the reactive cycle, you have to begin using data in a proactive way.Define your services first. Understand what you’re doing, why you’re doing it, and what others expect. This tells you what you need to be monitoring.Then monitor those things-- not just the underlying technologies, but the high-level metrics that matter most to people. Make sure you’re meeting people’s expectations.Communicate frequently, not just when you need something. Highlight your successes, and be transparent about problems. Help people relate to the data.Review your service definitions regularly, listen to feedback, and think about whether your service definitions can be refined.When you start to do all these things, then you help other people to see through the smoke and flames, to appreciate the complexity and the work you’re doing, and in their own heads they can begin to connect the dots as well.
Feel free to send me any feedback, add me on LinkedIn, follow me on Twitter, etc. I’d love to hear your thoughts!