MD/GAO Hello, I ’m Mike Ducy… Hello, I ’m Greg Opaczewski a Tech Lead at Orbitz WorldWide. I’m a part of development team named Operations Architecture. We develop site health and performance monitoring tools, primarily for operations teams. But also tools for development teams that need to know how their applications are performing in production.
GO I ’m excited to say that two of the major technologies in the Orbitz monitoring platform are now open source software. I encourage you to check out the project sites on launchpad at the URLs listed on the screen. We welcome any feedback you might have for the projects. We will display these again at the end of the presentation as well
MD SOA/Distrubuted architectures create problems for administration, support and development teams. Instrumentation of the various applications can provide valuable insights into how they interact and can ease the administration headaches. Orbitz Worldwide (OWW) operates dozens of applications running on hundreds of servers connected in a multi-layered Jini network. In this kind of environment it can be difficult to obtain consistent, uniform instrumentation at all application process boundaries. It is also not ideal to require each and every application development team to become experts at leveraging an instrumentation API and monitoring tools.
GO The Orbitz technology platform is very large and the business has grown as well. Orbitz WorldWide operates websites around the world in over a dozen locales. These point of sale and internationalization variables can make service operations even more challenging due to the number of key metrics that must be monitored. In 2007 the sales of over $10 Billion in travel products were dependent on the health of our technology platform. Therefore, OWW has made substantial investments in technology to detect problems early and minimize mean-time-to-repair.
MD It can be difficult for a technology organization to commit to provide the level of application instrumentation required to effectively monitor availability, reliability and performance. In this presentation we will examine several myths and explain how our technology overcame them. We ’re huge fans of the Mythbusters show on the Discovery channel, hopefully we have some fans in the audience as well.
GO Some believe that it has to be a time consuming process to apply monitoring code. In many approaches , every method call has to be wrapped with instrumentation code. Additionally, standards need to be defined for how the instrumentation will be applied consistently across a system. This obviously requires additional effort on the part of the development teams as well as technical leaders responsible for ensuring the standards are being followed. At Orbitz we ’ve observed that instrumentation (and monitoring concerns in general) are often the last concerns of developers. Naturally a majority or all of the development cycle producing code for new features.
We addressed this need to make the process of applying instrumentation simple by creating the Extremely Reusable Monitoring API (ERMA). ERMA consists of an API used for instrumenting Java applications and a library used to process the data produced by the instrumentation. This separation of concerns makes it easy for developers to apply the instrumentation without needing to be concerned with the details of how the data will be consumed.
GO Monitor objects in ERMA are Plain Old Java Objects (POJOs). To instrument a transaction, you construct a TransactionMonitor. Upon construction a stopwatch is started, that is used to measure latency. The code to be monitored is surrounded with try/catch/finally blocks. If the business code executes without exception, succeeded is invoked on the TM. However, if an exception is caught, it is recorded in the failedDueTo method. In the finally block, done is invoked in order to stop the stopwatch and pass the Monitor to the MonitoringEngine for processing. This is the handoff point to the processors implemented in the ERMA library.
GO So what I ’ve shown you on the previous slide is ERMA applied explicitly, wrapped around the business logic by a developer. The API is simple enough to use on its own. But we wanted to make the application of monitoring even easier. So we have implemented several techniques of self-instrumentation in order to achieve monitoring of the business objects with a minimal amount of effort.
GO We use the Spring Framework throughout our system. Spring is a popular open-source framework in the Java development community. Spring MVC and Spring Web Flow (SWF) are used in the web application architecture. Both of these frameworks provide hooks that can be used for monitoring. For example, there is a HandlerInterceptor interface in Spring MVC that we ’ve implemented and configured such that each and every web request is intercepted. ERMA is applied to these requests in a consistent and reusable manner. Spring webflow acts as the controller – it allows you to define flows between components such as actions and views in a webapp. WebFlow provides the FlowExecutionListener. By implementing this listener interface, we provide detailed metrics on how users are interacting with these flows in production.
GO Abstraction is another important technique we have used. Orbitz applications are networked together using Jini technology. Jini provides for dynamic service discovery and remote invocation. In a service oriented architecture, applications need a way to find out where the services they depend on are running. Jini provides this as well the ability to add and remove services from the network seamlessly. We created the Orbitz Jini Framework (OJF) in order to abstract away the details of our Jini service network from end developers. The abstraction layer contains a FilterChain facility that we have leveraged for monitoring. ERMA filters are executed both on the client and server side for each and every request. Because OJF is a shared library used consistently across our system, all developers get monitoring of remote method calls for free.
GO In the absence of hooks in the form of APIs that can be leveraged for monitoring, Aspect Oriented Programming (AOP) is another good option for providing reusable monitoring code. Spring provides integration with the popular AspectJ AOP framework. We have implemented an ERMA aspect that applies monitoring to all Action component invocations with just a few lines of reusable XML configuration. Spring creates a dynamic proxy for each Action object once at startup, and overhead at runtime is minimal as just one extra method invocation through the proxy is involved.
GO As a result of these techniques for applying reusable instrumentation, a developer at Orbitz needs to spend almost no time at all to get basic monitoring coverage. The frameworks that we use were instrumented by a small group of platform developers, many other development teams benefit without the need to spend any additional development time. So have we have BUSTED this myth of no time for instrumentation
MD From a standard ROI perspective, instrumentation does not provide real dollars back for the money invested in it ’s development. The value provided is often in reduced downtime, better understanding of code performance, better understanding of code dependencies and interactions of systems, opportunities to increase application performance and enhance the customer experience. While from a long term perspective these enhancements can provide increased revenue, it is not as immediate as implementing something like a new feature with has a more immediate ROI.
MD The ERMA Instrumented applications sends monitoring data back to the Event Processor engine. The data is sent by a background thread in the ERMA instrumented application which prevents latency from being introduced for the other incoming remote service calls. Since Event Processing is done outside of the instrumented application, this helps to reduce the introduction of latency in the instrumented application. The event processor aggregates and summarizes the various metrics, computing summary statistics (Average, Standard Deviation, % Fail, % Success, etc), and sends these metrics over to Graphite for storage and visualization. The event processor is also capable of sending SNMP alarms when Aggregated data points exceed certain thresholds (e.g. latency is high, or rate of failures is high).
MD Graphite consists of several components. The 2 primary components are Carbon and the Web Application. The Carbon component is responsible for reading data into the system and storing it in fixed size database files (similar to RRD files). The web application then reads these files to graphically represent the data for the end user.
MD The Graphite composer interface allows you to browse various metrics available for reporting in a hierarchical tree. When a metric is selected a graph of that metric ’s data is drawn in the composer interface. The user can manipulate the graph by selecting size, duration of the data to be graphed, as well as other elements.
MD The Graphite Command Line Interface allows a user to draw graphs in individual windows. These windows can be arranged and sized within the browser window. The window layout can also be saved which allows a user to create “dashboard” of commonly used graphs.
MD Value is not in the instrumentation itself, but in the data that the instrumentation provides. Gartner estimates that on average an hour of downtime can cost an organization $42,000 per hour. Instrumentation data can help reduce the length of outages by making it easier for Operators to locate the problem (via SNMP alarms), and through the tools used to visualize the data.
MD Instrumentation provides a Return On Investment by maximizing the ROI of the applications that are monitored.
GO Another myth that we ’d like to address is the belief that instrumentation only causes bugs. Boiler plate code often used to apply instrumentation makes code harder to read and maintain. This has a direct effect on developer productivity. It also gives developers an argument to not add the instrumentation at all. We use several techniques that allow our developers to avoid the need to write boilerplate code. We provide reusable, well tested instrumentation packaged in libraries and applied via hooks, abstraction and AOP as described previously.
GO Another good option to avoid boilerplate code with ERMA is Annotations. This feature, supported with Java 5 and above, applies instrumentation at build time and requires no ERMA code to be mixed with business code. The example shown here will apply an ERMA TransactionMonitor to each method in this service.
GO This example will wrap a TransactionMonitor around only the purchase method. Setting includeArguments to true will include method parameters in the monitor object as an attribute. The nice thing about this approach is how cleanly separated the business code is from the monitoring, it is simply declarative monitoring versus intrusive instrumentation
MD Our use of ERMA has introduced very few bugs. In fact, far more bugs have been uncovered using the ERMA data.
MD Instrumentation data allows to base line your current application performance against historical data. You can also use instrumentation data to build theoretical models to help verify that testing tools are correctly measuring application performance.
MD Historical instrumentation data can be used to build models based on Multivariate Adaptive Statistical Filtering and Statistical Process Control. This allows you to determine if your current application is performing within historical bounds and if something has changed.
GO For a large system there is a need to provide an abstraction for monitoring so that developers, operators and business analysts can all share the same language for describing system functionality. Example abstractions from our domain are “hotel search”, “air purchase”, “package selection”, etc. ERMA has some unique design features that enable detailed monitoring put in the context of these abstractions. ERMA assembles hierarchies of events transparently within its MonitoringEngine component. It does this by maintaining a stack of Monitors for each application thread. Whenever a new monitor is created during request processing, a parent-child relationship is introduced with the Monitor previously on top of the stack. At the completion of request processing, the result is a tree data structure can be analyzed to find event patterns. Our Jini framework passes monitoring data back and forth, allowing these event patterns to even span the boundaries of all applications involved in servicing a user request.. As a result, we can accelerate root cause analysis by delivering alarms to our operations teams that contain both the low level root cause of a problem and the impact to our customer.
GO What you are looking at here is an example of an ERMA event pattern captured from an air search request in our system. We present these patterns to the operator in such a way that is obvious where an exception originated and how it bubbled up through the stack. Yellow represents any monitor that has recorded a failure and red represents the lowest level failure. We use this information to zero in on the application and component that is contributing most significantly to a site issue. An e.g. alarm that may be sent to our operations center based on this data would read “Air search is failing at 80% due to a maket application DbPoolExhaustedException.” So it is very clear as to the top-level impact (air search is failing) and points to the underlying issue (likely that there are no available database connections). Before we implemented this approach an alarm would be generated for every failure in this pattern and our operators would be left trying to figure out the bigger picture. The improved alarms help ensure proper development resources are engaged quickly when support teams are troubleshooting production issues. They also help to prioritize action on alarm conditions by making clear the impact to our customers. ERMA patterns also enable you to drill down into latency metrics in order to see which components are contributing the most to latency. We are working on a user interface that will make it easier to visualize this kind of data. For example, by generating dynamic UML sequence diagrams based on the runtime behavior of the system.
GO So in our experience many more bugs are uncovered using the data produced by instrumentation than are caused by it. Bugs that would otherwise be difficult or impossible to diagnose without the instrumentation. So the myth that instrumentation only causes bugs is busted.
MD ??? GO Pragmatic in the design of our monitoring platform / tools. We have acknowledged developers are focused on implementing new features and site improvements. So monitoring of all core metrics is already in place. These include JVM and machine-level statistics such as: cpu, memory and threads. Resource pools such as database connections, also network connections to external suppliers. Frameworks contain monitoring of our business services and detailed monitoring of every request into the web application. Lastly, we have invested in tools that allow us to get tremendous value out of the instrumentation. These tools translate detailed metric data into an improved customer experience.
GAO I want to thank CMG for allowing us to share our story with everyone here. I too want to thank Neil Gunther again for all of his help in putting the paper and this presentation together
MD Both ERMA and Graphite have been open sourced by Orbitz Worldwide and the teams welcome your feedback and contributions.