Enterprise applications in the cloud - are providers ready?


Published on

Managing the performance of enterprise applications is hard. Managing and optimizing the performance of enterprise applications on shared virtualized infrastructure (i.e. cloud computing) is even harder. This article outlines the specifics of capacity planning and performance management of EAs deployed in the cloud.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Enterprise applications in the cloud - are providers ready?

  1. 1. Enterprise applications in the Cloud: Are service providers ready? Leonid Grinshpan, Oracle Corporation (www.oracle.com)SubjectManaging the performance of enterprise applications is hard. Managing and optimizingthe performance of enterprise applications on shared virtualized infrastructure (i.e.cloud computing) is even harder. This article outlines the specifics of capacity planningand performance management of EAs deployed in the cloud.Forrester research [www.teamquest.com/pdfs/.../forrester-key-cloud-virtual-computing.pdf]indicates that 98% of interviewed executives in North America and in Europe believethat the main challenges of virtualized and cloud environments have their root incapacity and performance management:
  2. 2. Those findings are directly applicable to enterprise applications (EAs) featuring highcomplexity, multiplatform deployments, and strict requirements to service quality.As an Oracle consultant, the author has eyewitnessed numerous confirmations toForrester’s findings while working with a diversity of customers on sizing and tuningOracle’s EAs.The following is a real-life story. A large international bank has deployed in a privatecloud a financial EA featuring spikes in user workload. The spikes occurred during eachfinancial reporting period because of the high rate of financial consolidations. The ITdepartment did not have the tools to measure transaction times and was not aware thatconsolidations were unacceptably long during workload peaks. Monitored by ITaround the clock utilization of hardware resources did not exceed 70% on any of theservers and IT was under the impression that the EA worked as expected. That feelingevaporated immediately after the first complaints from users were logged. ITreexamined all collected hardware performance counters, did not find any indicationsof hardware resource shortage, and decided to resort to EA experts.Analysis of the consolidation transaction indicated that most of the time it was runningon the OLAP (on-line analytical processing) database. In order to be in an active state, ithas to acquire database connections. The monitoring of the EA under peak load found ashortage in the number of database connections. This limited the number ofconsolidations the OLAP server was able to process concurrently. The finding clarifiedthat the unacceptable increase in consolidation time was due to long waits for availabledatabase connections. Increasing the number of database connections noticeablyimproved consolidation time, but produced an unwanted effect – the database serverwas running on almost 100% of its total CPU capacity. Fortunately IT was proficient indetecting and treating such a malaise. By raising on a database virtual machine thenumber of CPUs from 24 to 32, IT delivered expected consolidation time as well asbrought CPU utilization to a normal level. Unfortunately, because of the over-commitment of CPUs, other applications in the cloud started to experience performanceissues, but that is a beginning of another real-life story. The IT department learned afew lessons: monitor transaction time, become skilled at detecting and fixing softwarebottlenecks, and forecast consequences of the changes aimed at performanceimprovement.
  3. 3. Failure is not an option when launching EA into the Cloud – failure equates to adisruption of a company’s operations with all accompanying fiscal and public relationconsequences. EAs’ are critical vehicles carrying out day-to-day business functions; theymust perform as expected at any instance of the production cycle and efficiently processworkloads that fluctuate within broad limits.EAs’ can be implemented in different ways and target diverse business tasks. In thisarticle we consider EAs’ deployed inside corporations that are in use by theiremployees; this means that they are not retail apps or customer serving apps. We alsodefine EA as a complex, unified object consisting of hardware infrastructure, business-oriented software, and operating systems.A nascent trend in EA deployment is their relocation to the Clouds. The beauty of cloudcomputing from an IT perspective is that along with EA relocation it also migrates aheadache of EA management from company’s IT to a cloud provider. The latterbecomes in charge for meeting Service Level Agreement (SLA), a task significantlycomplicated by the great expectations of cloud customers.Cloud providers are facing two major challenges – allocation of appropriate resourcesto EA (capacity planning) and maintaining acceptable SLA (performance management).What significantly complicates both tasks is that any cloud inherently represents acollection of resources shared among a number of applications, unlike the traditional ITenvironment with servers and appliances mostly dedicated to applications. Thecloud’s shared resources are under dynamically changing demands from a variety ofusers working with diverse EAs having in common only one characteristic - mission-critical importance for the corporations.This article outlines the specifics of capacity planning and performance management ofEAs deployed in the Cloud. We are using queuing models of EAs to emulate andanalyze performance related happenings in the cloud’s shared platforms.Methodological foundation for this study can be found in the author’s book [LeonidGrinshpan. Solving Enterprise Application Performance Puzzles: Queuing Models to theRescue, Willey-IEEE Press; available in bookstores and from Web booksellers from January2012].
  4. 4. The article affirms that in order to ensure successful EAs hosting, a cloud provider (inaddition to perfect execution of traditional system management duties) has to becapable to efficiently carry out: - Monitoring and characterizing of EA workload. - Proactive evaluation of EA capacity as well as transaction times compliance with SLA. - Monitoring of business transactions. - Identification and fixing of software bottlenecks.EA workload characterizationEA workload characterization includes three components: - List of business transactions. - Number of each transaction executions during one hour per requests from one user (transaction rate). - Number of users requesting each transaction.Only one of the three components is not prone to frequent fluctuations – that is a list ofapplication transactions. It reflects application functionality that tends to change slowly,usually by small additions/deductions with new software releases. The two othercomponents are inclined to be highly volatile. They normally feature daily, weekly,monthly, and yearly fluctuations usually exhibiting repeatable patterns.Any cloud hosting several EAs has to service a number of diversified dynamicworkloads and manage to process them according to SLAs. This is possible only if acloud provider is equipped with the tools to monitor all three components of workloadcharacterization and can collect and correctly interpret workload data necessary forcloud capacity planning.Planning for capacityA cloud with permanently changing workloads can deliver expected services only bysystematically implementing capacity planning. The highest degree of accuracy incloud capacity planning can be achieved by using queuing network models of EAs.Queuing models are capable of factoring in cloud architecture, processing times ondifferent servers, the parameters of hardware as well as user workload. Models also can
  5. 5. assess the effects and limitations of software parameters like the number of threads,connections to system resources, etc. Queuing models take into account thefundamental behavior of any system servicing users –the fact that user requests arewaiting in the queues if a speed of a service is slower than a rate of incoming requests.Wait time in any queue contributes to transaction time. The ability of queuing modelsto assess it for different workloads and system architectures enables the cloud providerto estimate needed capacity as well as compliance with SLA.A cloud infrastructure has to have the unique ability to quickly reallocate systemresources as workload changes. We demonstrate how queuing models help find outcapacity needed to satisfy particular workloads, as well as how to predict usage cost forthe cloud’s customers.Let’s consider an EA’s queuing model on Figure 1. It represents a classical three-tieredEA with Web, Application, and Database servers. Each server corresponds to a model’snode with the number of processing units equal to the number of CPUs in a server. Theusers and network are modeled by dedicated nodes. We assume that all servers (nomatter physical or virtual) are allocated by a cloud provider to our EA and each one has8 CPUs. Figure 1 Model 1 of a three-tiered enterprise application
  6. 6. The workload for Model 1 is presented in Table 1. For simplicity it has only onetransaction named “Interactive transaction”; each user initiates an interactivetransaction ten times per hour. We have analyzed the model for 100, 200, 300, and 400users. Table 1 Workload for Model 1 Transaction name Number of users Number of transaction executions per user per hour Interactive transaction 1,100, 200, 300, 400 10The models in this article were analyzed using TeamQuest solver[http://teamquest.com/products/model/index.htm]. Model 1 predicts response timeexponential degradation starting from 300 users (Figure 2). It also estimates that for upto 250 users transaction time will be under required by SLA 10 sec (the vertical line onthe chart). 120 100 A 80 B 60 40 20 0 1 user 100 users 200 user 300 users 400 users Transaction response time (sec) System throughput (trans/sec) Figure 2 Transaction response time and system throughputSystem throughput (measured in the number of transactions per second or per hour)grows linear until it reaches a breaking point for 300 users and its growth slows down(point A on chart). At point A system throughput is 0.8 trans/sec * 3600 sec = 2880trans/hour (for convenient representation on the chart we have scaled the system
  7. 7. throughput line 100 times). At point B (where transaction time is still in line with SLA),system throughput is 0.65 trans/sec * 3600 sec = 2340 trans/hour.System throughput is a must-have parameter for cost estimatation when the cloud’sprice policy requires customers to pay per each transaction. Usage cost is calculated performula: Application Usage Cost = Cost of one transaction * System throughputPer Figure 2, throughput increases when the number of users grows. As our system cansupport no more than 250 users without SLA violation, we consider throughputsupported by a system for 250 users as a SLA-compliant maximum throughput. Thecloud provider, in order to receive the highest revenue, has to monitor workload anddynamically allocate to the EA a volume of resources that keeps system at a SLA-compliant maximum throughput level. In Model 1 on that level Database server isefficiently utilized (Figure 3), the transaction time is in line with SLA, and the cloudprovider has the highest return on investment. In case a workload is fewer than 250users, the Database server is underutilized and the cloud provider can reallocate it toanother EA. 100 90 80 70 Percentage 60 50 40 30 20 10 0 1 user 100 users 200 user 300 users 400 users Database server Application server Web server Figure 3 Utilization of system servers
  8. 8. Price policy can be based not only on system throughput but also on hardwareutilization or on specifications of allocated hardware (number of CPUs, their speed,memory size etc). No matter what price policy, queuing models provide data forscientific estimates of the provider’s return on investment as well as the customer’s cost.Monitoring business transactionsThe most important EA performance indicator is transaction response time. It isspecified in SLA and seemingly has to be under IT monitoring and control 24/7.Ironically that is not a predominant case; many IT departments instead are obsessedwith hardware capacity monitoring and even do not have appropriate instruments andpolicies for business transaction monitoring. In relentless pursuit for IT optimizationthey strive for getting the most out of hardware servers and appliances oftencompromising transaction time. Figures 2 and 3 show how such a policy can jeopardizeEA performance – after exceeding 250 users hardware utilization and systemthroughput are going up, but the price we pay is the exponential degradation oftransaction time.Software bottlenecksTransferring EA in the Cloud puts cloud providers in charge of detection andtroubleshooting for all kinds of performance bottlenecks. EAs may suffer from one oftwo distinct groups of bottlenecks - hardware and software. The first group is wellfamiliar to any IT department and cloud provider; the remediation prescriptions can befound in textbooks on performance. Usually when a hardware bottleneck is identified itcan be fixed by either vertical or horizontal scaling.Things are much trickier with software bottlenecks. Software bottlenecks are caused bythe settings of EA tuning parameters that limit the application’s ability to use availablehardware capacity to satisfy workload. Examples of a tuning parameter: Java VirtualMachine heap size, number of Web server connections, number of databaseconnections, number of software threads used to execute particular EA function, etc.Bottlenecks of this group can be fixed by changing the values of tuning parameters;their identification requires knowledge of EA functionality which might be terraincognita for a cloud provider. The inability of a cloud provider to detect and fix EAsoftware bottlenecks is equivalent to a provider’s failure to deliver service acceptable to
  9. 9. cloud customers. Queuing Model 2 demonstrates an impact of software bottlenecks onEA performance.Model 2 has the same topology as Model 1 presented on Figure 1, the same count ofCPUs for each server and the same workload described in Table 1. The differencebetween both models is that Model 2 emulates database connection pooling. Atransaction has to acquire two database connections before it can be processed by thedatabase; after processing is completed the connections are returned to a pool and canbe allocated to another transaction. If the pool does not have idle connections then thetransaction will wait until they become available by the same token increasing its time.We start model analysis for the pool size equal to ten connections. Model 2 forecaststhat for 250 users transaction time starts to degrade exponentially, but the reason for itsdegradation is not overutilization of system servers – the most loaded is the Databaseserver and it is running only up to 50% of its capacity for 300 users (Figure 4). 60.00 50.00 40.00 30.00 20.00 10.00 0.00 1 user 100 users 200 users 300 users Transaction time (sec) Database server utilization (%) Application server utilization (%) Web server utilization (%) Figure 4 Utilization of system servers and transaction time (Database server has 8 CPUs, connection pool has 10 connections)Increasing the Database server capacity by an additional 8 CPUs brings its total numberto 16 but does not fix the bottleneck; it reduces Database server utilization to 25% butalmost doubles transaction time (Figure 5).
  10. 10. 120.00 100.00 80.00 60.00 40.00 20.00 0.00 1 user 100 users 200 users 300 users Transaction time (sec) Database server utilization (%) Application server utilization (%) Web server utilization (%) Figure 5 Utilization of system servers and transaction time (Database server has 16 CPUs, connection pool has 10 connections)The explanation of such an EA behavior is rooted in the insufficient size of the databaseconnection pool, which prevents the EA from using the database server until idleconnections become available. Let’s increase pool size and bring it to 30 for a Databaseserver with 8 CPUs. Model 2 predicts that software bottleneck will be fixed (Figure 6)and transaction time for 300 users will be down to 15 sec. 100.00 80.00 60.00 40.00 20.00 0.00 1 user 100 users 200 users 300 users Transaction time (sec) Database server utilization (%) Application server utilization (%) Web server utilization (%) Figure 6 Utilization of system servers and transaction time (Database server has 8 CPUs, connection pool has 30 connections)
  11. 11. Apparently, a more powerful Database server is needed because the one with eightCPUs is utilized on 90% for 300 users. We solved Model 2 for a Database server with 16CPUs and a connection pool with 30 connections (Figure 7). 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0.00 1 user 100 users 200 users 300 users Transaction time (sec) Database server utilization (%) Application server utilization (%) Web server utilization (%) Figure 7 Utilization of system servers and transaction time (Database server has 16 CPUs, connection pool has 30 connections)This architecture eliminated software and hardware bottlenecks and deliveredtransaction time consistent across all numbers of users as well as brought utilization ofthe Database server under 35%.Take away from the articleSuccessful launch of EAs into the Cloud is conditioned by a cloud provider’s ability toperform below functions that expand traditional system management framework.Monitoring and characterizing EA workloadThe only unvarying feature of an EAs workload is its dynamism. Changing workloadrequires change in hardware capacity to keep EA transaction times in line with SLA.The estimates of hardware capacity and the values of software tuning parameter
  12. 12. needed to satisfy SLA are based on EA modeling; input data for the models arecollected by workload monitoring.Proactive evaluation of EA capacity and transaction time compliance with SLA using queuingmodelsQueuing models factor in cloud architecture, processing times on different servers,hardware specifications as well as user workload. They also assess performanceimplications of software parameters like the number of threads, connections to systemresources, etc. In addition, models provide data needed for the assessment of servicecost to a cloud’s customers and provider’s revenue. Models convey an estimate of SLA-compliant maximum throughput – a parameter cloud provider has to know in order toreceive the highest revenue. When the EA is fine tuned and delivers SLA-compliantmaximum throughput these three goals are achieved: 1) cloud’s hardware capacity isefficiently utilized; 2) transaction time is in line with SLA; 3) cloud provider has thehighest return on investment (assuming customer pays per executed transactions).Monitoring business transactionsHardware performance counters always had been observed by IT and they continue tobe on the dashboards of the cloud providers. Unfortunately, transaction timemonitoring is predominantly neglected by system management applications, despitethe fact that it represents the most important SLA requirement. As we havedemonstrated using models, software bottlenecks lead to underutilization of hardware;if the provider takes notice of only hardware utilization he might come to the wrongconclusion that the system has sufficient capacity and can process even more intenseworkload. By measuring business transaction times the provider knows when the SLAis violated and can immediately start looking for hardware and software bottlenecks.Identification and fixing software bottlenecksQueuing models analyzed in this article have shown that software bottlenecks block EAaccess to available hardware resources, increasing transaction time and keepinghardware utilization low. If a software bottleneck is not identified and transaction timeis not monitored, a cloud provider can be under the impression that the system workswell, because hardware utilization does not exceed predetermined critical levelstriggering alarm. Trying to fix software bottlenecks by increasing hardware capacity
  13. 13. brings hardware utilization even lower; such solution also increases transaction times asmore transactions compete for a limited number of software threads or databaseconnections. The only efficient corrective action is to change the value of appropriatesoftware parameters, which requires a cloud provider to be familiar with EAfunctionality. When a cloud provider is in charge of a number of EAs, this holds truefor all of them.The cloud provider has to evaluate its ability to deal with the challenges noted in thisarticle while offering services to the customers. From their side customers have to bevigilant and execute their own due diligence to ensure that EAs are handed over toproviders that are capable of successfully maintaining them in the Cloud.About the authorLast fifteen years as an Oracle consultant author was hands on engaged in performancetuning and sizing of enterprise applications for various corporations (Dell, Citibank,Verizon, Clorox, Bank of America, AT&T, Best Buy, Aetna, Halliburton, Pfizer, AstraZeneca, Starbucks, etc).