Enterprise applications in the cloud - are providers ready?

Enterprise applications in the Cloud:

Are service providers ready?

Leonid Grinshpan, Oracle Corporation (www.oracle.com)

Subject

Managing the performance of enterprise applications is hard. Managing and optimizing
the performance of enterprise applications on shared virtualized infrastructure (i.e.
cloud computing) is even harder. This article outlines the specifics of capacity planning
and performance management of EAs deployed in the cloud.

Forrester research [www.teamquest.com/pdfs/.../forrester-key-cloud-virtual-computing.pdf]
indicates that 98% of interviewed executives in North America and in Europe believe
that the main challenges of virtualized and cloud environments have their root in
capacity and performance management:

Those findings are directly applicable to enterprise applications (EAs) featuring high
complexity, multiplatform deployments, and strict requirements to service quality.
As an Oracle consultant, the author has eyewitnessed numerous confirmations to
Forrester’s findings while working with a diversity of customers on sizing and tuning
Oracle’s EAs.

The following is a real-life story. A large international bank has deployed in a private
cloud a financial EA featuring spikes in user workload. The spikes occurred during each
financial reporting period because of the high rate of financial consolidations. The IT
department did not have the tools to measure transaction times and was not aware that
consolidations were unacceptably long during workload peaks. Monitored by IT
around the clock utilization of hardware resources did not exceed 70% on any of the
servers and IT was under the impression that the EA worked as expected. That feeling
evaporated immediately after the first complaints from users were logged. IT
reexamined all collected hardware performance counters, did not find any indications
of hardware resource shortage, and decided to resort to EA experts.

Analysis of the consolidation transaction indicated that most of the time it was running
on the OLAP (on-line analytical processing) database. In order to be in an active state, it
has to acquire database connections. The monitoring of the EA under peak load found a
shortage in the number of database connections. This limited the number of
consolidations the OLAP server was able to process concurrently. The finding clarified
that the unacceptable increase in consolidation time was due to long waits for available
database connections. Increasing the number of database connections noticeably
improved consolidation time, but produced an unwanted effect – the database server
was running on almost 100% of its total CPU capacity. Fortunately IT was proficient in
detecting and treating such a malaise. By raising on a database virtual machine the
number of CPUs from 24 to 32, IT delivered expected consolidation time as well as
brought CPU utilization to a normal level. Unfortunately, because of the over-
commitment of CPUs, other applications in the cloud started to experience performance
issues, but that is a beginning of another real-life story. The IT department learned a
few lessons: monitor transaction time, become skilled at detecting and fixing software
bottlenecks, and forecast consequences of the changes aimed at performance
improvement.

Failure is not an option when launching EA into the Cloud – failure equates to a
disruption of a company’s operations with all accompanying fiscal and public relation
consequences. EAs’ are critical vehicles carrying out day-to-day business functions; they
must perform as expected at any instance of the production cycle and efficiently process
workloads that fluctuate within broad limits.

EAs’ can be implemented in different ways and target diverse business tasks. In this
article we consider EAs’ deployed inside corporations that are in use by their
employees; this means that they are not retail apps or customer serving apps. We also
define EA as a complex, unified object consisting of hardware infrastructure, business-
oriented software, and operating systems.

A nascent trend in EA deployment is their relocation to the Clouds. The beauty of cloud
computing from an IT perspective is that along with EA relocation it also migrates a
headache of EA management from company’s IT to a cloud provider. The latter
becomes in charge for meeting Service Level Agreement (SLA), a task significantly
complicated by the great expectations of cloud customers.

Cloud providers are facing two major challenges – allocation of appropriate resources
to EA (capacity planning) and maintaining acceptable SLA (performance management).
What significantly complicates both tasks is that any cloud inherently represents a
collection of resources shared among a number of applications, unlike the traditional IT
environment with servers and appliances mostly dedicated to applications. The
cloud’s shared resources are under dynamically changing demands from a variety of
users working with diverse EAs having in common only one characteristic - mission-
critical importance for the corporations.

This article outlines the specifics of capacity planning and performance management of
EAs deployed in the Cloud. We are using queuing models of EAs to emulate and
analyze performance related happenings in the cloud’s shared platforms.
Methodological foundation for this study can be found in the author’s book [Leonid
Grinshpan. Solving Enterprise Application Performance Puzzles: Queuing Models to the
Rescue, Willey-IEEE Press; available in bookstores and from Web booksellers from January
2012].

The article affirms that in order to ensure successful EAs hosting, a cloud provider (in
addition to perfect execution of traditional system management duties) has to be
capable to efficiently carry out:

- Monitoring and characterizing of EA workload.
- Proactive evaluation of EA capacity as well as transaction times compliance with
SLA.
- Monitoring of business transactions.
- Identification and fixing of software bottlenecks.

EA workload characterization

EA workload characterization includes three components:

- List of business transactions.
- Number of each transaction executions during one hour per requests from one
user (transaction rate).
- Number of users requesting each transaction.

Only one of the three components is not prone to frequent fluctuations – that is a list of
application transactions. It reflects application functionality that tends to change slowly,
usually by small additions/deductions with new software releases. The two other
components are inclined to be highly volatile. They normally feature daily, weekly,
monthly, and yearly fluctuations usually exhibiting repeatable patterns.

Any cloud hosting several EAs has to service a number of diversified dynamic
workloads and manage to process them according to SLAs. This is possible only if a
cloud provider is equipped with the tools to monitor all three components of workload
characterization and can collect and correctly interpret workload data necessary for
cloud capacity planning.

Planning for capacity

A cloud with permanently changing workloads can deliver expected services only by
systematically implementing capacity planning. The highest degree of accuracy in
cloud capacity planning can be achieved by using queuing network models of EAs.
Queuing models are capable of factoring in cloud architecture, processing times on
different servers, the parameters of hardware as well as user workload. Models also can

assess the effects and limitations of software parameters like the number of threads,
connections to system resources, etc. Queuing models take into account the
fundamental behavior of any system servicing users –the fact that user requests are
waiting in the queues if a speed of a service is slower than a rate of incoming requests.
Wait time in any queue contributes to transaction time. The ability of queuing models
to assess it for different workloads and system architectures enables the cloud provider
to estimate needed capacity as well as compliance with SLA.

A cloud infrastructure has to have the unique ability to quickly reallocate system
resources as workload changes. We demonstrate how queuing models help find out
capacity needed to satisfy particular workloads, as well as how to predict usage cost for
the cloud’s customers.

Let’s consider an EA’s queuing model on Figure 1. It represents a classical three-tiered
EA with Web, Application, and Database servers. Each server corresponds to a model’s
node with the number of processing units equal to the number of CPUs in a server. The
users and network are modeled by dedicated nodes. We assume that all servers (no
matter physical or virtual) are allocated by a cloud provider to our EA and each one has
8 CPUs.

Figure 1 Model 1 of a three-tiered enterprise application

The workload for Model 1 is presented in Table 1. For simplicity it has only one
transaction named “Interactive transaction”; each user initiates an interactive
transaction ten times per hour. We have analyzed the model for 100, 200, 300, and 400
users.

Table 1

Workload for Model 1

Transaction name Number of users Number of transaction
executions per user per hour
Interactive transaction 1,100, 200, 300, 400 10

The models in this article were analyzed using TeamQuest solver
[http://teamquest.com/products/model/index.htm]. Model 1 predicts response time
exponential degradation starting from 300 users (Figure 2). It also estimates that for up
to 250 users transaction time will be under required by SLA 10 sec (the vertical line on
the chart).

120

100
A
80
B
60

40

20

0
1 user 100 users 200 user 300 users 400 users

Transaction response time (sec) System throughput (trans/sec)

Figure 2 Transaction response time and system throughput

System throughput (measured in the number of transactions per second or per hour)
grows linear until it reaches a breaking point for 300 users and its growth slows down
(point A on chart). At point A system throughput is 0.8 trans/sec * 3600 sec = 2880
trans/hour (for convenient representation on the chart we have scaled the system

throughput line 100 times). At point B (where transaction time is still in line with SLA),
system throughput is 0.65 trans/sec * 3600 sec = 2340 trans/hour.

System throughput is a must-have parameter for cost estimatation when the cloud’s
price policy requires customers to pay per each transaction. Usage cost is calculated per
formula:

Application Usage Cost = Cost of one transaction * System throughput

Per Figure 2, throughput increases when the number of users grows. As our system can
support no more than 250 users without SLA violation, we consider throughput
supported by a system for 250 users as a SLA-compliant maximum throughput. The
cloud provider, in order to receive the highest revenue, has to monitor workload and
dynamically allocate to the EA a volume of resources that keeps system at a SLA-
compliant maximum throughput level. In Model 1 on that level Database server is
efficiently utilized (Figure 3), the transaction time is in line with SLA, and the cloud
provider has the highest return on investment. In case a workload is fewer than 250
users, the Database server is underutilized and the cloud provider can reallocate it to
another EA.

100
90
80
70
Percentage

60
50
40
30
20
10
0
1 user 100 users 200 user 300 users 400 users

Database server Application server Web server

Figure 3 Utilization of system servers

Price policy can be based not only on system throughput but also on hardware
utilization or on specifications of allocated hardware (number of CPUs, their speed,
memory size etc). No matter what price policy, queuing models provide data for
scientific estimates of the provider’s return on investment as well as the customer’s cost.

Monitoring business transactions

The most important EA performance indicator is transaction response time. It is
specified in SLA and seemingly has to be under IT monitoring and control 24/7.
Ironically that is not a predominant case; many IT departments instead are obsessed
with hardware capacity monitoring and even do not have appropriate instruments and
policies for business transaction monitoring. In relentless pursuit for IT optimization
they strive for getting the most out of hardware servers and appliances often
compromising transaction time. Figures 2 and 3 show how such a policy can jeopardize
EA performance – after exceeding 250 users hardware utilization and system
throughput are going up, but the price we pay is the exponential degradation of
transaction time.

Software bottlenecks

Transferring EA in the Cloud puts cloud providers in charge of detection and
troubleshooting for all kinds of performance bottlenecks. EAs may suffer from one of
two distinct groups of bottlenecks - hardware and software. The first group is well
familiar to any IT department and cloud provider; the remediation prescriptions can be
found in textbooks on performance. Usually when a hardware bottleneck is identified it
can be fixed by either vertical or horizontal scaling.

Things are much trickier with software bottlenecks. Software bottlenecks are caused by
the settings of EA tuning parameters that limit the application’s ability to use available
hardware capacity to satisfy workload. Examples of a tuning parameter: Java Virtual
Machine heap size, number of Web server connections, number of database
connections, number of software threads used to execute particular EA function, etc.
Bottlenecks of this group can be fixed by changing the values of tuning parameters;
their identification requires knowledge of EA functionality which might be terra
incognita for a cloud provider. The inability of a cloud provider to detect and fix EA
software bottlenecks is equivalent to a provider’s failure to deliver service acceptable to

cloud customers. Queuing Model 2 demonstrates an impact of software bottlenecks on
EA performance.

Model 2 has the same topology as Model 1 presented on Figure 1, the same count of
CPUs for each server and the same workload described in Table 1. The difference
between both models is that Model 2 emulates database connection pooling. A
transaction has to acquire two database connections before it can be processed by the
database; after processing is completed the connections are returned to a pool and can
be allocated to another transaction. If the pool does not have idle connections then the
transaction will wait until they become available by the same token increasing its time.
We start model analysis for the pool size equal to ten connections. Model 2 forecasts
that for 250 users transaction time starts to degrade exponentially, but the reason for its
degradation is not overutilization of system servers – the most loaded is the Database
server and it is running only up to 50% of its capacity for 300 users (Figure 4).

60.00

50.00

40.00

30.00

20.00

10.00

0.00
1 user 100 users 200 users 300 users

Transaction time (sec) Database server utilization (%)
Application server utilization (%) Web server utilization (%)

Figure 4 Utilization of system servers and transaction time
(Database server has 8 CPUs, connection pool has 10 connections)

Increasing the Database server capacity by an additional 8 CPUs brings its total number
to 16 but does not fix the bottleneck; it reduces Database server utilization to 25% but
almost doubles transaction time (Figure 5).

120.00

100.00

80.00

60.00

40.00

20.00

0.00



The explanation of such an EA behavior is rooted in the insufficient size of the database
connection pool, which prevents the EA from using the database server until idle
connections become available. Let’s increase pool size and bring it to 30 for a Database
server with 8 CPUs. Model 2 predicts that software bottleneck will be fixed (Figure 6)
and transaction time for 300 users will be down to 15 sec.

100.00

80.00

60.00

40.00

20.00

0.00



Apparently, a more powerful Database server is needed because the one with eight
CPUs is utilized on 90% for 300 users. We solved Model 2 for a Database server with 16
CPUs and a connection pool with 30 connections (Figure 7).

35.00
30.00
25.00
20.00
15.00

10.00
5.00
0.00




This architecture eliminated software and hardware bottlenecks and delivered
transaction time consistent across all numbers of users as well as brought utilization of
the Database server under 35%.

Take away from the article

Successful launch of EAs into the Cloud is conditioned by a cloud provider’s ability to
perform below functions that expand traditional system management framework.

Monitoring and characterizing EA workload

The only unvarying feature of an EAs workload is its dynamism. Changing workload
requires change in hardware capacity to keep EA transaction times in line with SLA.
The estimates of hardware capacity and the values of software tuning parameter

needed to satisfy SLA are based on EA modeling; input data for the models are
collected by workload monitoring.

Proactive evaluation of EA capacity and transaction time compliance with SLA using queuing
models

Queuing models factor in cloud architecture, processing times on different servers,
hardware specifications as well as user workload. They also assess performance
implications of software parameters like the number of threads, connections to system
resources, etc. In addition, models provide data needed for the assessment of service
cost to a cloud’s customers and provider’s revenue. Models convey an estimate of SLA-
compliant maximum throughput – a parameter cloud provider has to know in order to
receive the highest revenue. When the EA is fine tuned and delivers SLA-compliant
maximum throughput these three goals are achieved: 1) cloud’s hardware capacity is
efficiently utilized; 2) transaction time is in line with SLA; 3) cloud provider has the
highest return on investment (assuming customer pays per executed transactions).

Monitoring business transactions

Hardware performance counters always had been observed by IT and they continue to
be on the dashboards of the cloud providers. Unfortunately, transaction time
monitoring is predominantly neglected by system management applications, despite
the fact that it represents the most important SLA requirement. As we have
demonstrated using models, software bottlenecks lead to underutilization of hardware;
if the provider takes notice of only hardware utilization he might come to the wrong
conclusion that the system has sufficient capacity and can process even more intense
workload. By measuring business transaction times the provider knows when the SLA
is violated and can immediately start looking for hardware and software bottlenecks.

Identification and fixing software bottlenecks

Queuing models analyzed in this article have shown that software bottlenecks block EA
access to available hardware resources, increasing transaction time and keeping
hardware utilization low. If a software bottleneck is not identified and transaction time
is not monitored, a cloud provider can be under the impression that the system works
well, because hardware utilization does not exceed predetermined critical levels
triggering alarm. Trying to fix software bottlenecks by increasing hardware capacity

brings hardware utilization even lower; such solution also increases transaction times as
more transactions compete for a limited number of software threads or database
connections. The only efficient corrective action is to change the value of appropriate
software parameters, which requires a cloud provider to be familiar with EA
functionality. When a cloud provider is in charge of a number of EAs, this holds true
for all of them.

The cloud provider has to evaluate its ability to deal with the challenges noted in this
article while offering services to the customers. From their side customers have to be
vigilant and execute their own due diligence to ensure that EAs are handed over to
providers that are capable of successfully maintaining them in the Cloud.

About the author

Last fifteen years as an Oracle consultant author was hands on engaged in performance
tuning and sizing of enterprise applications for various corporations (Dell, Citibank,
Verizon, Clorox, Bank of America, AT&T, Best Buy, Aetna, Halliburton, Pfizer, Astra
Zeneca, Starbucks, etc).

Enterprise applications in the cloud - are providers ready?

More Related Content

What's hot

Similar to Enterprise applications in the cloud - are providers ready?

More from Leonid Grinshpan, Ph.D.

Recently uploaded

Enterprise applications in the cloud - are providers ready?