Hello. Today we&#x2019;re going to talk about the effects of three popular technologies on root cause analysis. RCA about ferreting out complexity, which always has to deal with.
Quote shows that IT has faced complexity for a long time. Once we fix problems of the past we get new problems. New technologies bring new complexity. Today, look at 3 technologies effect on problem solving in IT: virtualization, multi-tier, and cloud computing. *** I like the quote here, not so much for what it says, but for when it was said, published over 20 years ago in 1985. IT has always had problems dealing with complexity, and it always will. New technologies have given IT management equal portions of simplicity and difficulty.&#xA0;
In this context, today I'm going to speak to one area of responsibility in IT, root cause analysis, or "RCA." The idea of root cause analysis is just a fancy, academically well documented way of saying "figuring out what's going wrong and start to figure out how to fix it." In complex systems there are so many moving parts and sub-systems that any given piece of software can rarely tell you what's going wrong with it.&#xA0;
As complexity grows, your only way to battle it is discipline and tools, and that's exactly what RCA for IT is: a cyclical approach to finding problems, digging down to their causes, kicking off a fix, and then improving that whole process for next time. Most of us spend our time doing the first few, but we're terrible at giving ourselves time to do the last.
To pick three new technologies, virtualization, &#xA0;multi-tiered applications, and the emergence of cloud computing are all solving problems of the past, but at the same time creating new ones. That's the issue we constantly battle in IT, once we fix today's problems, we have enough time to start creating new ones. The IT department can't avoid this: if it was a problem free task, there wouldn't be a whole department for it and businesses would definitely be lacking one important tool for revenue generation innovation, IT.
Virtualization for all its wonders can create more, newer problems if it's not properly managed and cared for. Consolidating physical servers is fantastic, but this starts two problems: (1.) More to manage - the very virtualization layer and stack itself can introduced new technologies and relationships to manage, interface with, and care for. Here, you need visibility into virtual servers, clusters, and most importantly networks. Additionally, the increased importance of network and storage management needed for virtualization means you must focus on those elements more as well. (2.) Tracking transients - as the use of virtualization matures, IT can expect to see a fair amount of transient servers and resource use. This more dynamic environment requires more tracking of what the configuration and topology across virtual and physical resources is. Here, the basic RCA question of "what changed?" becomes more common and important than ever.
For RCA, the common virtualization trouble-makers are disk space, CPU, memory, networking - usually running out of resources or bad configurations in the virtualization layer. Some common examples along these lines are: * Correlating virtual instances to physical problems - bad physical storage causes bad virtual storage, physical network latency, or just hardware on the fritz.&#xA0; * Configuration drift - occurs more frequently esp. as self-service scenarios arise. There's much more provisioning happening, and it's not yet perfected, so it's easy for the sanctioned configuration to stray, causing problems. Also, simply rebooting a VM might cause the configuration to drift, away from a good state. * Network configuration - DNS problems often crop up with virtual networks, but even virtual machines on physical networks. * VM access management - VM's not being configured with the right access to other systems. * Capacity management - Moving an application to a VM that doesn't fit quite right.&#xA0;
Multi-tiered applications are a popular delivery method that gives IT the ability to deliver a new class of networked, flexible applications that try to avoid the monolithic application problems of the past. With different application delivery methods, of course, comes different management challenges. Namely, by breaking up applications, there are more parts to deal with and, as is typical for RCA problems, more connections and relationships to manage.&#xA0;
The Russian doll nature of multi-tier applications makes Root Cause Analysis seem custom built for diagnosing problems in multi-tier applications. This layered structure to multi-tiered applications also lends itself to RCA tools that can walk those relationships, often pulling multiple incidents in a reverse tree down into one underlying problem.&#xA0;
Issues&#xA0;in multi-tier applications are usually cascading issues from layers below, typically due with resources running out or network connections failing: * The Blizzard of events means you need lots of filtering and help narrowing down when you do RCA. * Memory leaks in one tier can leak up to the top-layers, causing cascading problems that call out for RCA. * The network connection between the middle and database layer can fail because of authentication issues or, as happened to me once, changes in Windows that restrict the number of open TCP/IP connections. Bad DNS configurations often cause communication breakdowns between layers. * Storage issues, with network mounts or shared disks have been the root of countless problems as well. * Cache update issues that hoped to address performance problems are common causes of problems.
Over recent years, cloud computing has emerged grown from a speculative offering to a technology set to change how IT services are delivered both behind and beyond the firewall. This area is still new enough that you see lots of adjectives prefixing the word "cloud," namely "public" and "private."&#xA0;
Public cloud is used to describe compute power and storage available over the Internet, managed as generic server nodes that can scale up and down as needed. Rather than owning the hardware, consumers typically are metered for usage.
"Private cloud" is currently the loosest of the two, meaning both using public cloud technology behind-the-firewall, but also as methodology for delivering IT services using automation, self-service, and virtualiztion that bumps up against the longer term trend of the consumerization of IT.
All of the concerns of virtualization and multi-tier applications come into here, mixed with a potentially even more opaque view into the innards of the stack. Add in self-service options, and you're essentially giving less savvy end-users a loaded gun.
With something as new as cloud computing, the keys for RCA have much to do with setting expectations with the rest of the business unit and having the right kind of tools to diagnose problems. Some issues to plan for and questions to ask are:
* How do you access remote data and data under someone else's control? Can you instrument the cloud as you want, or do you need to use new methods? * Because cloud computing relies on virtualization, many of the same problems exist: configuration drift, physical resource degradation, DNS issues, and so on. * What's the policy for escalating a problem? Once escalated, what's the process for collaborating with the cloud provider? * Look for "status pages" from service providers.
With cross-firewall clouds, clearly network visibility and tools because critical for doing RCA. And in such a chaotic environment, maintaining the discipline of &#xA0;RCA - keeping your cool - is key.
The fundamentals of RCA become important as complexity increases:
* Relationship Tracking - Your process and tools must continually maintain the relationships between the various resources in IT (or "Configuration Items" to use ITIL-speak). For example, multi-tiered applications depend heavily on the separation on databases, middle-ware, and UI layers, connecting each layer over networks or API interfaces. A problem at one of those layers may manifest itself at another layer, or there could be simple but damaging configuration issues that prevent communication between layers * Event & Change Tracking -&#xA0;In addition to having an accurate model of IT and it's current configuration, RCA often depends on being able to answer the question "what changed?" Now-a-days, this is more common than ever. For example, in virtualized- and cloud-driven worlds, the answer to that question can be endless. Virtualization and cloud methodologies encourage frequent changes and even transient resources. Your tools and process need to support this rapid updating of your IT model and configuration; you have to keep pace with how fast modern IT moves. * Ticket Tracking - when it comes to the process of RCA, you're best served by picking an official system of record and notifications. IT's coordination when doing RCA should be centralized in one place, like a ticket system. Additionally, when incident occur, there should be one notification source that IT staff must respond to: otherwise, many different systems can send out alarms, leaving IT staff to judge which to respond to. Additionally, at last a basic ticket tracking system is needed to give staff views into what co-workers are doing to help reduce duplicated efforts. For example, if a cloud-hosted promotional web-sites could be failing to integrate with SalesForce, leading multiple inside sales reps to report incidents, kicking off multiple RCA process around the same problem.&#xA0; * Business Impact - once RCA finds the underlying problem, and the fix is discovered, how will the fix's configuration changes effect not only the original problem, but other parts of IT. It's often too easy to solve one problem but at the same time introduce new ones. CMDBs that properly track configuration and relationships come into play here to validate a change before its applied and afterwards. * How much does RCA cost? How much does RCA save? How much does RCA make. Is RCA justified? Tools to support RCA and BSM
How can you track and validate your success and improvement at RCA. Here are some metrics that may help:
* Mean Time to Isolate - tracking how long it takes to go from the initial signs of failure to the underlying problem. At first, you'll want to create a base-line to know where you stand. Then, as time goes on, you want to see the Mean Time to Isolate shrinking. * SLAs - IT is often accountable to a business through SLAs. The business side may not care about SLAs except on a quarterly basis, but IT must keep up with their own performance to make sure the SLA expectations are properly managed and&#xA0;that time and money can be budgeted to meet SLAs. * Costs - how much does RCA for a problem typically cost? What business impact did the problem have? How much did the fix cost? And so on.&#xA0; * Revenue generation - how much money did finding and solving the problem save, or make the business?&#xA0;
These and other metrics are used for internal tracking and continuous improvement, but they're also used for the more vital function of justifying IT's existence to the rest of the business.&#xA0;A problem in IT may seem disastrous, but if you can tell the business how solving it helps their bottom line, you can having a different discussion than "why are things always breaking?"
Finally, let's bring this back to reality. The conversation I get into with many people in IT is that they're too busy putting out fires to worry about all this high-minded, business-service hoopla. How are they supposed to do the right thing when they barely have enough time to do the wrong thing?&#xA0;
It starts with a metaphor battle. The actual real, putting out fires fire-fighters don't spend all their time spraying water on burning buildings. There's plenty of training, hose coiling, and scouting out of fire-hazards they do. The actual fire-fighters themselves spend much of their time preventing fires and preparing for the best way to solve problems when the occur. More importantly, through studying past fires and passing regulations and laws, the government tries to prevent fires. It's continuous improvement.
You can only get to that metaphoric point if the process you use for fighting fires, RCA, gives you the raw data you need afterwards to justify why IT should get time for hose coiling and fire-hazard hunting and prevention. The aspirational goal of IT has always been to be proactive instead of reactive. The only way to get enough breathing room to be proactive is to prove to the business and higher level management why all this time spent away from fire-fighting will prevent fires, namely, stacks of the business' cash going up in a blaze.
A little bit of helps from tools will help that, but you need a process and a culture to match what the tools and new technologies provide.
Root Cause Analysis - pdated practices for new problems
Updated practices for new problems
Michael Coté / cote@RedMonk.com
a persistent problem
“We also know that the increasing use of
larger and more complex systems
potentially results in a greater number of
problems. Furthermore, many of these
problems have a more series impact on
the business and on the use of systems
than ever before, and they are also in
many cases much more difﬁcult to solve.”
--A Management System for the Information Business
same RCA, new technologies
• Cloud Computing
• More to manage -
virtualization layer, virtual
networks, virtual storage,
• Tracking transients -
removing the constraints
of physical reality
• Correlating virtual instances to
• Conﬁguration drift
• Network conﬁguration
• VM access management
• Capacity management
• Dividing monolithic
• More parts, more
connections to break
• Inter-dependencies cause
• Cascading problems well
suited for RCA
common multi-tier problems
• Blizzard of logs, events, an messages
• Memory leaks and resource limit cascade
• Network connection failure
• Storage issues - network drives and shared
• Cache updates and gardening
• Public vs. private cloud
• Virtualization problems &
• Managing cloud provider
common cloud problems
• Remote access & instrumentation
• Conﬁguration drift, resource degradation,
• Provider escalation policy & process
• Start with “status pages”
Root Cause Analysis in
• Relationship tracking
• Event & change
• Tickets & system of
• Business impact
• Mean time to isolate
• Costs - how much does RCA cost,
• Revenue generation