SlideShare a Scribd company logo
Cost Based Performance Modeling – dealing with
                performance “uncertainties”
                                           Eugene Margulis
                                     Telus Health Solutions, Ottawa
                                      eugene_margulis@yahoo.ca

Abstract. Traditional Performance Evaluation methods (e.g. “big system test”, instrumentation) are costly
and do not deal well with inherent performance related “uncertainties” of modern systems. There are three
                                                                                                        rd
such uncertainties: requirements that are either unclear or change from deployment to deployment, 3 party
code that one has no direct access to and the variable h/w platform. These uncertainties make exhaustive
testing impractical and the “worst case” testing results in engineering for the “worst impossible case” rather
than a realistic customer scenario. Creating a single model that is based on traceable and repeatable test
results of individual system components (transactions) saves a huge amount of effort/costs on performance
related engineering/QA - and provides an almost instantaneous "what if" analysis for product planning.



Introduction
The primary goal of performance/capacity activities is the ability to articulate/quantify resource
requirements for a given system behavior on a given h/w provided a number of constraints.

The resource requirements may apply to CPU, memory, network bandwidth, size of thread pools, heap sizes
of individual JVMs, disk sizes, etc. There are many different types of resources that the system uses during
processing its “payload”.

The behavior of the system defines the expected quantifiable use patterns. For example, the number of
network elements the network controller is connected to, the frequency of evens from the network elements
to the controller would represent an aspect of system behavior. A system can have several “behaviors” –
e.g. payload processing, upgrade, overload, etc.

The constraints are the additional sets of requirements that define what behaviors are “useful” (from the user
perspective). For example, it may not be useful to process 10 events per second if the response time per
event is greater than 20 seconds. Or it may not be useful to process 10 events per second if you cannot
retain event logs for 30 days.

The h/w defines the target h/w environment for system deployment. An 8 core 32 “strands” processor 1GHz
system with 8G of memory may fit better to one behavior pattern favoring throughput whereas a single core
at 2GHz might fit another behavior where response time is more important.

The goal is to create a quantifiable relation between behaviors, constraints resource requirements and h/w.
This can, of course be accomplished using “brute force” testing, but the size of the “test space” makes this
approach impractical. Traditionally this has been addressed by testing for “small/medium/large”
configurations. Unfortunately many deployments/behaviors do not easily scale, they are simply different and
not larger or smaller. (e.g. How do you compare a deployment A that requires 30 day historical event
retention and 3 events/sec with deployment B that requires 1 day retention but 20 events/second?)

In addition, the traditional methods of performing most test measurements in a high capacity lab are often
costly, inflexible and in most cases can only be done at the tail end of the development process when
discovered problems are very expensive to fix or mitigate. Cost based capacity modeling provides a lean
and flexible approach to performance and capacity evaluation resulting in a significant reduction of
performance related R&D costs.

In this paper we will describe cost based performance modeling, how it applies to the different
performance/capacity QA activities and outline the benefits of using this approach.




                                                      1
What are the challenges and uncertainties?
The key challenges during development of such systems from performance perspective are dealing with the
three uncertainties:

         Behavior Uncertainty – because of multiple deployment scenarios one is never quite certain of the
         conditions the system is going to be used in. If the product that is being developed is new and
         provides a new capability there is little historical data on how it might be used.

         Code Uncertainty – in the past when most of the code was developed “in-house” one could always
         have direct access to the code, test it in isolation, contact the developers and understand the
                                                 rd
         resource requirements. When using 3 party code this is no longer the case. Such code can only
         be treated as “black box”, one cannot often rebuild it with debug/optimization options.

         H/W uncertainty – underlying h/w architecture is no longer fixed. If it takes 6 months to develop a
         system the deployment h/w may be very different from the h/w available in the lab.

Performance is no longer a verification exercise. Traditionally performance activities during system
development were viewed as a verification exercise. It was assumed that there was a set of requirements,
defined h/w and a stable system to test on. The uncertainties mentioned above make performance analysis
of more exploratory nature, where the focus of performance activities is to discover and identify system
operating costs and limits rather than to validate a specific scenario. (Direct validation/verification of a
specific scenario/behavior is still the best option – provided that this scenario is known).

Efficient articulation and sharing of performance information. In addition to these uncertainties there
are organizational challenges of sharing performance information between different groups of stakeholders.
The performance profile is usually a multi-dimensional problem (multiple resources, multiple constraints,
multiple behaviors). It is important to articulate and communicate this kind of complex information effectively
and consistently. A development group may span across continents, time zones, language boundaries –
designers can be in India, architects in Canada, testers in China and customers in Spain. It is important to
make sure that when someone refers to the “cost of event processing” everyone knows exactly what it
means and the implications.

Performance estimates available at any time. Given the length of the development cycle and the size of
the modern systems it is usually too late to identify/address performance issues after the code “freeze”. It is
important to be able to determine performance bounds as early as possible. For example, if event-
processing component is ready to be tested before historical event reporting, then there is no reason to wait
until everything is ready before determining the cost of event processing. The initial cost estimates of a given
component (e.g. event processing) should be determined as soon as the component has basic functionality
and should be refined throughout the development cycle. A quantifiable “best guess” of the system
performance should be available at any time and the accuracy and the confidence of such “guess” should
improve continuously throughout the development cycle.

Cost Based Transactional Model
3+ way view
The system performance and capacity relationship can be visualized as the following “3+ view” where the
Cost Model provides mapping between the system behavior (e.g. events per second), costs (cost per event),
and the total resource requirements. The relationship is subject to a number of constraints and h/w
characteristics:




                                                       2
Latency +
                                         HW
                                                  Other Constraints
                                                  Other Constraints


                                Costs
                                                  COST               Resource
                                                  MODEL
                                                                  Requirements
                                                                  Requirements
                             Behavior
                             Behaviour


The cost model is the “glue” that provides quantification of resources based on quantifiable behavior
(requirements), measured costs and specific constraints. The model is used to drive the performance
analysis activities throughout the project to make sure that we perform the minimal amount of testing for the
expected deployment. Building and using the model early on ensures that the tests performed and data
collected are relevant for the intended system deployment.

The model is capable of accurately determining a fairly comprehensive capacity envelope even though it is
based on a much smaller set of test measurements than the traditional brute force multi-dimensional testing
normally required to produce such an envelope. The model forecasts (either extrapolates or interpolates)
most of the behavior/constraints combinations so that explicit measurements are not necessary. The cost
savings are realized by the relatively small number of test measurements required to do this.

The following sections describe the process of building of such a model.

Transaction as a unit of System Behavior
System behavior can be described in terms of processing a number of distinct (transactions) within a unit of
time. A transaction represents some unit of work offered to the system. System Behavior is the offered
workload (that is, a set of transactions).

 For example, a system may be expected to process X events per second, update Y user GUI displays
every Z seconds, collect N performance management reports from M network elements every K minutes,
etc. Each of these examples can be viewed as a transaction associated with some frequency of execution:

         Transaction = (TransactionType, TransactionFrequency)
         SystemBehaviour = SetOf {Transactions}

Clearly defining an exhaustive set of transactions for any large system is impractical. However, most
systems execute only a small subset of transactions most of the time, especially during the steady state”
payload” processing.

It is possible to define multiple system behaviors for different system states to expand this model. For
example, a steady state processing can be associated with one set of transactions; overload state can be
associated with another set. From a performance/capacity perspective we need to identify and to prioritize
the set of system behaviors that need to be assessed and evaluated.

Each transaction executed on the system results in use of system resources – CPU, memory, disk, etc. The
total amount of resources required to support the behavior depends on behavior (the set of transactions and
their frequencies) and constraints:

         SystemRequiredResourceUsage = Function (SystemBehavior, Constraints)

The resource cost of a given system behavior is usually dependent on the transaction cost and the
frequency of transactions – however different resources may require different resource computations, for
example the”cost” of disk (disk space) is related to the total disk “residue” of the transaction – e.g. DB space
required per event record; memory is most difficult to trace since memory utilization is affected by multiple
layers of memory management policies.




                                                       3
The cost based model implements the function above – namely the mapping of system behavior and
constraints to the required system resources to support such behavior. This mapping is based on cost
measurements and testing of individual transactions – rather then on the on direct testing of every behavior.

Transaction Linearity
Each Transaction is associated with resource “costs” – the price the system has to “pay” for performing the
transaction. For example, processing 1 event per second may require 10% of CPU on 1500 MHz processor.
The CPU utilization may or may not be linear with respect to the frequency of the given transaction type.
That is, processing of 2 events per second might not be twice as expensive as processing of 1 event.
However, using a linear model as a starting point provides a good initial approximation:

         Most of the processes are linear within the “operating scope of the system” of 20%-70% of CPU.
         Over 70% CPU utilization the OS switching results in non-linear CPU utilization (the actual max
         CPU cut-off may be higher than 70% for many systems). However, at that range we are usually not
         interested in capacity of the system since it is outside of the operating scope (as long as we know
         that the utilization is over 70%). If the utilization is below (say 20%) then the error due to non-
         linearity is fairly small and can be ignored

         If the behavior of the system with respect to the given transaction is demonstrated to be non-linear
         (as a result of testing) within the operating scope of the system then it may be possible to
         decompose the transaction into independent linear components. For example, event processing
         may include aggregating and storing events in a historical database. Database “write” (the most
         expensive part) is usually done in batches with multiple events. Therefore the overall cost of event
         processing is non-linear. However decomposing event processing into two transactions – 'event
         arrival' and 'event save' with different frequencies results in a linear transaction set.

         If the system behavior is demonstrated to be non-linear with respect to transaction frequency and
         further decomposition is not possible/practical and linear assumption results in significant
         estimation errors then a non-linear function must be developed based on capacity tests for various
         transaction rates.

Using the linear assumption above we can derive the system resource usage as follows:

                  SystemRequiredResourceUsage =
                         Function(
                                SumOf {Cost(TransactionType, TransactionFrequency)},
                                Constraints
                         ) + C

The C above is the constant representing the background resource utilization. In theory it is possible to
decompose C into individual transaction (consisting of OS activities, system management, etc.) but in
practice C is simply the background utilization in absence of the defined transactions.

Transaction Costs
A transaction may depend on a number of resources therefore the maximum rate of a given transaction may
be reached even though a given resource is not exhausted. For example assuming an event processing is a
single-threaded process that “costs” 5% of CPU for 1 event per second on a 4-core processor. In this case
processing of 5 events per second will result in 25% of CPU. Processing of 6 events per second will not be
possible since since the single-threaded process will never be able to use more than 1 core of the 4 core
CPU.

The costs per transaction are measured by monitoring resource utilization and constraints (e.g. latency) for
various rates of the transaction. For example, the following charts show the test results of executing a
certain transaction at rates of 2, 4, 6, 8, 10 and 12 rps (requests per second):




                                                      4
80.0
                        Cpu%                                                                                                                                                                                                                     Cpu%                                                Latency (sec)
                                                                                                                                                                                                                                               60%                             y = 0.048x + 0.006            12
         70.0                                                              TOTAL                                                       Vmstat1:
                                                                                                                                        CPU                                                                                                                                         R2 = 0.9876
         60.0                                                              usr
                                                                                                                                                                                                                                               50%                                                           10
         50.0                                                              sys                                                                                                                                                                                      CPU%
         40.0                                                                                                                                                                                                                                  40%                  LATENCY                                  8
         30.0                                                                                                                                                                                                                                                       Linear
                                                                                                                                                                                                                                               30%
                                                                                                                                                                                                                                                                    (CPU%)                                   6
         20.0

         10.0                                                                                                                                                                                                                                  20%                                                           4
          0.0
                                                                                                                                                                                                                                               10%                                                           2
                11/05-08:54:23

                                 11/05-08:59:23

                                                  11/05-09:04:23

                                                                   11/05-09:09:23

                                                                                    11/05-09:14:24

                                                                                                     11/05-09:19:24

                                                                                                                      11/05-09:24:24

                                                                                                                                        11/05-09:29:24

                                                                                                                                                         11/05-09:34:24

                                                                                                                                                                          11/05-09:39:25

                                                                                                                                                                                            11/05-09:44:25

                                                                                                                                                                                                             11/05-09:49:25

                                                                                                                                                                                                                              11/05-09:54:25
                                                                                                                                                                                                                                               0%                                                            0
                                                                                                                                                                                                                                                     1    2     3    4     5    6     7   8   9   10 11 12
                                                                                                                                                                                                                                                                         Requests Per Sec




The chart on the left shows the CPU utilization pattern throughout the test. The chart on the right shows the
averaged CPU utilization per test (blue) as well as transaction latency (pink) for different rates of
transactions.

One can see that at the rates higher than 10 rps the latency increases dramatically (yet the CPU% utilization
at this rate is about 50%). Clearly the limiting factor for this transaction is not CPU but something else (in
this specific case it was JVM heap utilization). All of the tests were run by generating requested rate of
transaction for the same amount of time. The CPU% trace on the left chart shows that during the 12 rps test
the system was active for longer than during the other tests – the latency increased so much that the system
was not able to keep up with the given rate of transactions. Based on these results the maximum
sustainable rate (MaxRate) for this transaction is 10 rps. The 12 rps is not sustainable and was therefore not
included in the average CPU values used for trend estimations on the chart on the right.

It is also clear that the transaction costs are linear with respect to transaction frequency – the slope of trend
line shows that the cost is approximately 4.8% of CPU per 1 transaction per sec. For the model computation
purposes it is useful to express the CPU costs in terms of MHz per transaction (so that we can map them
from one processor to another). This test was performed on Netra 440 (4 CPUs, 1.6 GHz each) – therefore
4.8% is equivalent to 4.5%*4*1600 = 307 MHz per request.

The transaction cost therefore is described as (MHz, MaxRate) = (307, 10).

In practice the cost of a transaction may depend on additional parameters. The cost of DB insert will depend
on the number of records in the database. Other transactions may depend on multiple factors, but there are
usually very few parameters that transaction costs are most sensitive to. Once these parameters are
identified then the cost of transaction is measured for the different values of the parameters. MaxRate and
the MHz for each combination of parameters is measured and derived as above. The final transaction cost
profile is represented as transaction cost matrix:

                                                                                                                                                                                           DB Insert Cost
                                                                                                                                                                                           ActiveRecords                                                 MHz        MaxRate
                                                                                                                                                                                                                                            0            12.0            125.0
                                                                                                                                                                                                                                        10000            15.5             96.6
                                                                                                                                                                                                                                        30000            16.5             90.8
                                                                                                                                                                                                                                        60000            18.0             83.3
                                                                                                                                                                                                                                       100000            20.0             75.0
                                                                                                                                                                                                                                       200000            24.9              60.1




                                                                                                                                                                                                                                               5
The table above shows the costs and maximum rates of DB Insert transaction with respect to the value of
ActiveRecords parameter.

Transaction Cost Matrix description is unambiguous, can be shared between groups, can be traced directly
to specific test runs, can be reproduced, tested and consistently used for projection and estimations.

Model Implementation
This model lends itself to a fairly straightforward spreadsheet implementation. The spreadsheet
implementation allows for continuous update, refinement and automatic import of test data and
measurements. Once implemented, the model allows immediate “what-if” analysis of the impact of a change
in customer behavior or a given cost (e.g. what if we have twice as many events but half as many GUI
clients? How many more clients can we handle if we speed-up security function by a factor of 2?).

The primary “deliverable” of such model is an operating envelope of the system with respect to the system
resources. The operating envelope can be graphically represented as a 2D or 3D chart showing system
operating limits with respect to key transactions and resources. An envelope can be dynamically
recomputed with respect to any behavior/scenario/hardware.

For example, the following figure shows the operating envelope of a system with respect to 3 parameters
(event rate, number of users and number of records). The particular system had approximately 30 different
“dimensions” (about 10 transaction types and other constraints and parameters). The model was capable of
computing operating envelope for any three parameters while fixing the values of the rest. The area below
the 3D surface represents system operating range that would meet capacity and behavior constraints and
allows instantaneous “what if” analysis. Based on this chart, If the intended deployment calls for 30000 users
then the system will meet its performance/capacity constraints provided that event rate is less than 14 per
sec and the number of records is less than 22000. This capability to compute this kind of operating envelope
greatly assists and simplifies the “what-if” analysis necessary to determine, articulate and communicate
limits of a potential deployment.


                           Operating Envelope



                               18
                              17.2
                              16.4
                               15.6
                               14.8
                        EVENT
                                 14
                        RATE
                                13.2
                                12.4
                                11.6
                                 10.8
                                   10
                                                                                                       10000
                                        25000




                                                                                                      14000
                                                                                                     18000
                                                26800




                                                                                                   22000
                                                                                                  26000
                                                        28600




                                                                                                30000
                                                                                               34000




                                                                                                         NRECORDS1
                                                                30400




                                                                                             38000




                                 NUSERS
                                                                                           42000
                                                                        32200




                                                                                         46000
                                                                                 50000
                                                                                34000




In the example above, the operating envelope showed the general limitations of the system based on the
value of certain parameters and transaction frequencies.

Another example (below) shows the operating envelope chart that displays the specific resources that limit
the system capacity. In this case, the system, network controller, can support various combinations of two
network element types (NE1 and NE2). The white line (MAX) shows the overall operating limit of the system.
The support of NE2s maxes out at about 21 and is bounded by CPU (MAX_cpu). The limiting factor to NE1s
is the disk (MAX_ed), which maxes out at 300-400 depending on the number of NE2s. This chart clearly




                                                                                         6
shows what resource use needs to be optimized (or amount of resource increased) depending on potential
deployment.


                                                                          80
                                                                                                                                            Operating Envelope
                                                                          70
                                                                          60                                                                                                                   MAX
                                                                                                                                                                                               MAX_cpu
                                                                          50
                                                                                                                                                                                               MAX_ed


                                                              #NE2
                                                                          40                                                                                                                   MAX_ned
                                                                          30                                                                                                                   MAX_bw
                                                                                                                                                                                               MAX_aeps
                                                                          20
                                                                          10
                                                                           0
                                                                                         0                      200                   400                             600          800       1000       1200
                                                                                                                                                                  #NE1


As part of the total resource utilization computation for the given behavior the model must compute the
expected resource utilization by each individual transaction (or application/component identified with a
transaction set). This information is valuable in identifying which feature/transaction is responsible for most
use of resources – for a particular system behavior. The next three charts show this type of breakdown for
CPU, Memory and disk utilization on various systems:


40%
                                                       CPU Distribution by feature                                                                                                                         IPCom m s
35%                                                                                                                                   32%
                                                           (500A x 15K cph)
                                                                                                                                                                                                                Base
30%                                    27%
25%                                                                                                                                                                                                              IMF

20%                                                                                                                                                                               RAM - Top 10 Users               Logs
      14%                 13%
15%                                                                                                                                                                                                                OSI
10%                                                                                                                                                                                                                    STACK
                4%                                                                                                          5%                                                             AppLd
                                                                                                                 3%
 5%                                                                       2%                       1%                                                                                                                  PP
                                                 0%           0%                             0%                                                    0%             0%
 0%
                                                                                                                                                                                                                       GES
                                                 Give IVR

                                                               Give RAN




                                                                                                                                                   RTData API
                                                                                                                                      RTDisplay
                                                                                                   Intrinsics
                Queuing




                                                                           MlinkScrPop




                                                                                                                CalByCall


                                                                                                                            Blue DB
      Base CP




                                       Collect




                                                                                             Hdx




                                                                                                                                                                  Reports
                          Broadcasts


                                       Digits




                                                                                                                                                                                                                       HIDiags
                                                                                                                   DB




                                                                                                                                                                                                                       OTHER


                                                                                                     Disk_NE_Loads_GB,
                                                                                                             7

                                                            Disk_Security_Logs
                                                                  _GB, 1
                                                                                                                                                                                             FREE
                                           Disk CACP_GB, 3                                                                                                      spare, 37



                                       Disk NE B&R_GB, 7




                                                                                                                                                                      Disk_Alarm_DB_GB,
                                                                                                                                                                              10
                                                  Disk_NE_LOG_GB,
                                                         57                                                                                                     Disk_Alarms_REDO_
                                                                                                                                                                       GB, 5
                                                                                                                                                  Disk_PM_GB, 11




For example, the top left chart above shows that the biggest CPU users are “CollectDigits” and “RTDisplay”
components (and transactions associated with them). If CPU is the limiting factor then these two should be
the focus of improvements/optimization.




                                                                                                                                                                  7
Model Accuracy – Start early, improve as you go
The model accuracy clearly depends on the accuracy of the underlying data. The underlying data broadly
belongs to two categories – tested/measured transaction costs and the quantification of expected system
behavior. As the system matures the tested/measured transaction costs become more and more accurate.
However the expected behavior, expected use (especially in new systems) is often a best guess based on
perceived marked need, similar systems, past experience or even wishful thinking. The benefit of the model
is that it allows to focus on implication of such guesses and encourages Product Line Management /
Marketing to come up with better guesses and estimates.

A performance team should be able to provide best estimates of system performance at any stage of
development. The accuracy of such estimates should improve as system matures.

The model allows order-of-magnitude estimates very early in design cycle. One can usually start getting
initial sensitivity/assessment results even before a single line of code is tested based on expected behavior
and educated guesses on transaction costs. Often the educated “guesses” of transaction costs can be
based on previous experience, past system analysis or a simple baseline testing of similar functionality. This
early analysis is valuable for detecting “big” showstoppers and allows early adjustment of architecture. As
the system is being developed and measurements are performed the model becomes progressively more
accurate.

The model is usually within 10% accurate by the end of the development cycle for a new product, but it does
require final calibration and verification with a “Big System Test” where the resource is measured under load
of multiple transactions at a time.

For mature systems where this methodology has been applied over a number of years, the accuracy is such
that in some projects the model was productized as a provisioning and planning tool.

Improving communications and addressing uncertainties
Use of the model allows unambiguous and clear communicates of test results. The results per individual
transaction are communicated in terms of costs (transaction costs matrices can be exchanged between
groups, tracked by load and release to release basis). The results in terms of the overall behavior impact are
communicated pictorially as operating envelopes that show the performance capacity implications visually
and explicitly.

For example, the following two charts show the operating envelope of the system with and without certain
code optimization improving processing of transactions SEC1 and SEC2. Since there is a cost associated
with deployment of the optimization/patch it is important to demonstrate the impact of the change. The
charts clearly show which behaviors (deployments) will be impacted and where the change needs to be
deployed. The deployments that have less then 15000 records and do not use SEC2 transactions will not
benefit from the patch.
Ope rating Enve lope (w /o code change )                                            Ope rating Enve lope (w ith code change )




         50000                                                                                  50000

        45000.1                                                                                45000.1

        40000.2                                                                                40000.2

        35000.3                                                                                35000.3

        30000.4                                                                                30000.4

 NREC    25000.5                                                                        NREC    25000.5
         20000.6                                                                                20000.6

         15000.7                                                                                15000.7
          10000.8                                                                                10000.8
           5000.9                                                                                 5000.9
                   1                                                                                      1
                                                                                                                                                                     1
                                                                              1




                                                                                                              0
                       0




                                                                                                                                                                   2.4
                                                                            2.4




                                                                                                                                                                 3.8
                                                                          3.8




                                                                                                                  1
                           1




                                                                                                                                                               5.2
                                                                        5.2




                                                                                                                                                             6.6
                                                                      6.6




                                                                                                                      2
                               2




                                                                                                                                                             8
                                                                      8




                                                                             SEC1                             SEC2                                                  SEC1
                       SEC2
                                                                                                                                                       9.4
                                                                9.4




                                                                                                                          3
                                   3




                                                                                                                                                10.8
                                                         10.8




                                                                                                                                              12.2
                                                       12.2




                                                                                                                              4
                                       4




                                                                                                                                       13.6
                                                13.6




                                                                                                                                  15
                                           15




                                                                                                                                  5
                                           5




                                                                                    8
The model can also be used to compute the operating envelope with respect to different h/w. For example,
the following two charts show the different range of system behaviors that can be supported on different h/w.
Form the two charts below it is clear that V890 would be a better choice handling high rates of TX1
transactions (right), but T2 would be better at handling the mix of TX1, TX2 transactions (left), specifically
with high number of records (TX2 implementation was able to take advantage of the multithreaded
architecture of T2, whereas TX1 was had a single-threaded implementation that benefited form high speed
of V890).
     Ope rating Enve lope - 1000M Hz/Core , 4 Core , 32 s trands , Sun T2                       Ope rating Enve lope - 1800M Hz/Core , 8 Core , Sun V890




               30000                                                                                         30000

              27000.1                                                                                       27000.1

              24000.2                                                                                       24000.2
              21000.3                                                                                       21000.3
              18000.4                                                                                       18000.4
               15000.5                                                                              NRECS    15000.5
      NRECS
               12000.6                                                                                       12000.6
                 9000.7                                                                                       9000.7
                 6000.8                                                                                        6000.8
                 3000.9                                                                                        3000.9
                         1                                                                                             1




                                                                                                                                                                                     1
                                                                                       1




                                                                                                                           1
                             1




                                                                                                                                                                                   5.9
                                                                                     5.9




                                                                                                                                                                                10.8
                                                                                  10.8




                                                                                                                               4.8
                                 4.8




                                                                                                                                                                              15.7
                                                                                15.7




                                                                                                                                                                            20.6
                                                                              20.6




                                                                                                                                     8.6
                                       8.6




                                                                                                                                                                         25.5
                                                                           25.5



                                                                                     TX1//SEC                                                                                      TX1//SEC




                                                                                                                                                                       30.4
                                                                         30.4




                     TX2/SEC                                                                                      TX2/SEC




                                                                                                                                           12.4
                                             12.4




                                                                                                                                                                    35.3
                                                                      35.3




                                                                                                                                                                  40.2
                                                                    40.2




                                                                                                                                                  16.2
                                                    16.2




                                                                                                                                                               45.1
                                                                 45.1




                                                                                                                                                          50
                                                            50




                                                                                                                                                         20
                                                           20




The use of cost based model throughout as the base for the performance analysis activities addresses and
mitigates the three uncertainties:

            Behavior. Even thought the expected system behavior at the deployment time may not be known,
            the model allows forecasting for ANY behavior. Ability to compute the operating envelope identifies
            and quantifies the range of acceptable behaviors/scenarios that allows making intelligent decisions
            on deployment and capacity limitation and trade-offs.

            Code. The transaction based cost model is based on identifying costs with respect to the customer
            visible behavior, work/activity, rather then a specific code module, process, etc. There is no need to
                                                         rd
            access, instrument, rebuild or recompile 3 party code. All of the code is treated as black box
            (although depending on the granularity of metrics collected one can always map transactional
            resource utilization to the specific process, module).

            H/W. Using of cost model allows mapping of results from one h/w to another. It might be necessary
            to perform a small set of pilot test cases on different h/w platforms to identify the mapping; the
            model can be used to compute operating envelope for different target h/w throughout the
            development cycle. The h/w limitations are identified early that allows to make changes if
            necessary.

Cost Reduction
There is inherent cost reduction of using the model to forecast the comprehensive operating envelope of the
system vs. the brute force multi-dimensional testing. Suppose the system has N transactions, each
transaction can have Rn distinct rates that can be expected in some deployment scenarios. Suppose this
system has M parameters with Pm distinct values each. Then the total number of “brute force” tests
necessary to determine the operating envelope for the system would be equivalent to the number of points
within this (N+M)-dimensional space, or the product of the number of distinct rates per transaction and the
number of distinct parameters values:

                                   BruteForceTests = R1 * R2 * R3 … * Rn * P1 * P2 * P3 … * Pm

On the other hand, if we are using the model to forecast such envelope, we only need to determine the cost
of each transaction individually and forecast the cost of any combination. Each transaction individually
would depend on a (small) subset of all the parameters, so the number of tests for transaction X will be Rx *



                                                                                                9
Ptx where Ptx is the product of the number of distinct parameter values the transaction X depends on.
Therefore the total number of tests is the sum of the number of distinct rates per transaction and relevant
parameters combinations:

                   ModelTests = R1*Pt1 + R2*Pt2 + R3*Pt3 … + Rn*Ptn

The reduction of the number of tests represents a significant direct cost reduction. There are additional cost
benefits related to the use of the cost based model:

         The test simplicity – individual transaction tests are much simpler to run and automate then the big
         system test required to test multiple transactions in combinations.

         Hardware cost reduction – there is no need to run all tests on the target hardware all the time.
         Once the mapping is established the results obtained on one hardware platform can be mapped to
         another. (The mapping needs to be re-tested/re-established periodically)

There are, of course, costs associated with creating and maintaining such model (and validating it with a
BST-lie tests), but in many cases they are far outweighed by the cost reductions associated with testing.

There are situations when direct testing makes more sense (and is more efficient) than the modeling. If the
expected rate for each transaction and the specific value of each parameter is known in advance then
according BruteForceTests formula above only one test will be required. However this is rare and signifies a
system with no “uncertainties” as described here.

Practical experience
Using transaction based cost analysis is not new. A similar approach, Microsoft’s Transaction Cost Analysis
is described in [1] and is used for web services costs evaluation. Transaction Aware Performance Modeling
[2] focuses on response time per individual transaction. We used this approach to focus on capacity profile
with respect to multiple resource types (cpu, memory, threading, disk space etc.) under a set of well-defined
constraints.

We have applied this approach in a number of projects in Telecommunication environment in a variety of
systems including:
        Call Centre Server (WinNT platform, C++)
        Optical Switch (VxWorks, C, Java)
        Network Management System (Solaris, Mixed, 3rd party, Java)
        Management Platform Application (Solaris, Mixed, 3rd party, Java)

Each project presented different challenges (memory modeling in VxWorks, Java Heap sizing, Threading in
Solaris, etc.). Application of this method has resulted in a number of “indirect” benefits (apart from the direct
benefit of addressing uncertainties and enabling forecasting):

         Improved Communication across groups – everyone speaks the same language (well defined
         transactions/costs). The “framework” or “platform” groups were able to describe the cost of the
         services they provide to the rest of the system by publishing the cost of the key transactions. The
         users of these services were able to use these costs directly for their estimations.

         Change of verification focus – the verification activity focuses on validation and calibration of the
         model, rather then on some specific scenario. This results in more reliable and trustworthy model.

         “De-politization” of performance engineering. We found that the visual representation of the
         complex information as well as an instantaneous “what-if” capability greatly reduced the negative
         political aspects of performance engineering. The charts and the underlying data were open and
         traceable to individual costs/tests making all the numbers and trade-offs clear and quantifiable.

         Better requirements – quantifiable, PLM/Customer can see value in quantifying behavior. The PLM
         and customer were also much more motivated to spend extra effort determining realistic
         requirements.




                                                       10
Documentation reduction – engineering guides are replaced by the model; the performance related
         design documentation focus on design/architecture improvements.

         Early problem detection - most performance problems are discovered before code “freeze” and the
         beginning of the official verification cycle.

         Cost Reduction – less need for BST type of tests/equipment, less effort to run PV, reduced “over-
         engineering”. The transaction costs can be measured on one h/w platform and mapped to another.
         The transaction costs are determined based on automated tests that can often be executed on
         non-dedicated h/w by designers (although final tests must be done in a controlled environment).

         End-user capacity planning tools – the model can be directly used to develop end-user capacity
         planning and performance analysis tools.



Summary

Cost Based Modeling effectively addresses the key deployment uncertainties of Performance evaluation in
modern systems by providing a quick and inexpensive method to estimate performance impact of changing
system behavior and hardware platform. It is a “black” box based approach that does not require access to
                 rd
the “hidden” 3 party code. In addition, cost based modeling provides the ability to obtain performance and
capacity estimates for key product functionalities throughout the entire development cycle, often even before
the first line of code is written. It is a conceptually simple and inexpensive to implement requiring no large-
scale equipment. The approach improves performance/capacity information communication in large projects
and facilitates iterative feedback to project management and design groups.

Acknowledgements:

I would like to thank my former colleague, Robert Lieberman for his advice and comments. Most of results
described here are based on various performance projects at Nortel.

References:

[1] Using Transaction Cost Analysis for Site Capacity Planning, Microsoft, http://technet.microsoft.com/en-
us/commerceserver/bb608757.aspx

[2] Securing System Performance with Transaction Aware Performance Modelling, Michael Kok, (Parts 1,2,3
in CMG MeasureIt) http://www.cmg.org/measureit/issues/mit61/m_61_17.html



About the author:

Eugene Margulis is a software performance lead at Telus Health Solutions. He has worked on
capacity/performance analysis and evaluation on multiple projects at Nortel over the last 15 years. During
this time he was involved in design, architecture and QA of telecommunication systems – from hard real
time call processing to network management.




                                                      11

More Related Content

What's hot

Chapter 4 - Quality Characteristics for Technical Testing
Chapter 4 - Quality Characteristics for Technical TestingChapter 4 - Quality Characteristics for Technical Testing
Chapter 4 - Quality Characteristics for Technical Testing
Neeraj Kumar Singh
 
Categories of test design techniques
Categories of test design techniquesCategories of test design techniques
Categories of test design techniques
Zuliar Efendi
 
Chapter 4 - Performance Testing Tasks
Chapter 4 - Performance Testing TasksChapter 4 - Performance Testing Tasks
Chapter 4 - Performance Testing Tasks
Neeraj Kumar Singh
 
Chapter 1 - Introduction and Objectives for Test Automation
Chapter 1 - Introduction and Objectives for Test AutomationChapter 1 - Introduction and Objectives for Test Automation
Chapter 1 - Introduction and Objectives for Test Automation
Neeraj Kumar Singh
 
Chapter 7 - People Skills and Team Composition
Chapter 7 - People Skills and Team CompositionChapter 7 - People Skills and Team Composition
Chapter 7 - People Skills and Team Composition
Neeraj Kumar Singh
 
Chapter 3 - Test Automation
Chapter 3 - Test AutomationChapter 3 - Test Automation
Chapter 3 - Test Automation
Neeraj Kumar Singh
 
Chapter 7 - Verifying the TAS
Chapter 7 - Verifying the TASChapter 7 - Verifying the TAS
Chapter 7 - Verifying the TAS
Neeraj Kumar Singh
 
Chapter 5 - Automating the Test Execution
Chapter 5 - Automating the Test ExecutionChapter 5 - Automating the Test Execution
Chapter 5 - Automating the Test Execution
Neeraj Kumar Singh
 
Chapter 4 - Defect Management
Chapter 4 - Defect ManagementChapter 4 - Defect Management
Chapter 4 - Defect Management
Neeraj Kumar Singh
 
Chapter 1 - Requirement Engineering
Chapter 1 - Requirement EngineeringChapter 1 - Requirement Engineering
Chapter 1 - Requirement Engineering
Neeraj Kumar Singh
 
Chapter 6 - Transitioning Manual Testing to an Automation Environment
Chapter 6 - Transitioning Manual Testing to an Automation EnvironmentChapter 6 - Transitioning Manual Testing to an Automation Environment
Chapter 6 - Transitioning Manual Testing to an Automation Environment
Neeraj Kumar Singh
 
Chapter 2 - Fundamental Agile Testing Principle, Practices & Process
Chapter 2 - Fundamental Agile Testing Principle, Practices & ProcessChapter 2 - Fundamental Agile Testing Principle, Practices & Process
Chapter 2 - Fundamental Agile Testing Principle, Practices & Process
Neeraj Kumar Singh
 
Chapter 2 - Preparing for Test Automation
Chapter 2 - Preparing for Test AutomationChapter 2 - Preparing for Test Automation
Chapter 2 - Preparing for Test Automation
Neeraj Kumar Singh
 
Chapter 1 - Testing Process
Chapter 1 - Testing ProcessChapter 1 - Testing Process
Chapter 1 - Testing Process
Neeraj Kumar Singh
 
Chapter 5 - Test Automation Reporting and Metrics
Chapter 5 - Test Automation Reporting and MetricsChapter 5 - Test Automation Reporting and Metrics
Chapter 5 - Test Automation Reporting and Metrics
Neeraj Kumar Singh
 
Chapter 6 - Tool Support for Testing
Chapter 6 - Tool Support for TestingChapter 6 - Tool Support for Testing
Chapter 6 - Tool Support for Testing
Neeraj Kumar Singh
 
Chapter 3 - The Generic Test Automation Architecture
Chapter 3 - The Generic Test Automation Architecture Chapter 3 - The Generic Test Automation Architecture
Chapter 3 - The Generic Test Automation Architecture
Neeraj Kumar Singh
 
Software Reliability
Software ReliabilitySoftware Reliability
Software Reliability
Gurkamal Rakhra
 
Chapter 6 - Test Tools and Automation
Chapter 6 - Test Tools and AutomationChapter 6 - Test Tools and Automation
Chapter 6 - Test Tools and Automation
Neeraj Kumar Singh
 

What's hot (20)

Chapter 4 - Quality Characteristics for Technical Testing
Chapter 4 - Quality Characteristics for Technical TestingChapter 4 - Quality Characteristics for Technical Testing
Chapter 4 - Quality Characteristics for Technical Testing
 
Categories of test design techniques
Categories of test design techniquesCategories of test design techniques
Categories of test design techniques
 
Chapter 4 - Performance Testing Tasks
Chapter 4 - Performance Testing TasksChapter 4 - Performance Testing Tasks
Chapter 4 - Performance Testing Tasks
 
Chapter 1 - Introduction and Objectives for Test Automation
Chapter 1 - Introduction and Objectives for Test AutomationChapter 1 - Introduction and Objectives for Test Automation
Chapter 1 - Introduction and Objectives for Test Automation
 
Chapter 7 - People Skills and Team Composition
Chapter 7 - People Skills and Team CompositionChapter 7 - People Skills and Team Composition
Chapter 7 - People Skills and Team Composition
 
Istqb chapter 5
Istqb chapter 5Istqb chapter 5
Istqb chapter 5
 
Chapter 3 - Test Automation
Chapter 3 - Test AutomationChapter 3 - Test Automation
Chapter 3 - Test Automation
 
Chapter 7 - Verifying the TAS
Chapter 7 - Verifying the TASChapter 7 - Verifying the TAS
Chapter 7 - Verifying the TAS
 
Chapter 5 - Automating the Test Execution
Chapter 5 - Automating the Test ExecutionChapter 5 - Automating the Test Execution
Chapter 5 - Automating the Test Execution
 
Chapter 4 - Defect Management
Chapter 4 - Defect ManagementChapter 4 - Defect Management
Chapter 4 - Defect Management
 
Chapter 1 - Requirement Engineering
Chapter 1 - Requirement EngineeringChapter 1 - Requirement Engineering
Chapter 1 - Requirement Engineering
 
Chapter 6 - Transitioning Manual Testing to an Automation Environment
Chapter 6 - Transitioning Manual Testing to an Automation EnvironmentChapter 6 - Transitioning Manual Testing to an Automation Environment
Chapter 6 - Transitioning Manual Testing to an Automation Environment
 
Chapter 2 - Fundamental Agile Testing Principle, Practices & Process
Chapter 2 - Fundamental Agile Testing Principle, Practices & ProcessChapter 2 - Fundamental Agile Testing Principle, Practices & Process
Chapter 2 - Fundamental Agile Testing Principle, Practices & Process
 
Chapter 2 - Preparing for Test Automation
Chapter 2 - Preparing for Test AutomationChapter 2 - Preparing for Test Automation
Chapter 2 - Preparing for Test Automation
 
Chapter 1 - Testing Process
Chapter 1 - Testing ProcessChapter 1 - Testing Process
Chapter 1 - Testing Process
 
Chapter 5 - Test Automation Reporting and Metrics
Chapter 5 - Test Automation Reporting and MetricsChapter 5 - Test Automation Reporting and Metrics
Chapter 5 - Test Automation Reporting and Metrics
 
Chapter 6 - Tool Support for Testing
Chapter 6 - Tool Support for TestingChapter 6 - Tool Support for Testing
Chapter 6 - Tool Support for Testing
 
Chapter 3 - The Generic Test Automation Architecture
Chapter 3 - The Generic Test Automation Architecture Chapter 3 - The Generic Test Automation Architecture
Chapter 3 - The Generic Test Automation Architecture
 
Software Reliability
Software ReliabilitySoftware Reliability
Software Reliability
 
Chapter 6 - Test Tools and Automation
Chapter 6 - Test Tools and AutomationChapter 6 - Test Tools and Automation
Chapter 6 - Test Tools and Automation
 

Similar to Cost Based Performance Modelling

Enterprise performance engineering solutions
Enterprise performance engineering solutionsEnterprise performance engineering solutions
Enterprise performance engineering solutionsInfosys
 
DBTest 2013 - In Data Veritas - Data Driven Testing for Distributed Systems
DBTest 2013 - In Data Veritas - Data Driven Testing for Distributed SystemsDBTest 2013 - In Data Veritas - Data Driven Testing for Distributed Systems
DBTest 2013 - In Data Veritas - Data Driven Testing for Distributed Systems
Mihir Gandhi
 
Multiple Dimensions of Load Testing, CMG 2015 paper
Multiple Dimensions of Load Testing, CMG 2015 paperMultiple Dimensions of Load Testing, CMG 2015 paper
Multiple Dimensions of Load Testing, CMG 2015 paper
Alexander Podelko
 
.Net projects 2011 by core ieeeprojects.com
.Net projects 2011 by core ieeeprojects.com .Net projects 2011 by core ieeeprojects.com
.Net projects 2011 by core ieeeprojects.com msudan92
 
Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Mumbai Academisc
 
Reinventing Performance Testing. CMG imPACt 2016 paper
  Reinventing Performance Testing. CMG imPACt 2016 paper  Reinventing Performance Testing. CMG imPACt 2016 paper
Reinventing Performance Testing. CMG imPACt 2016 paper
Alexander Podelko
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Mumbai Academisc
 
Configuration Navigation Analysis Model for Regression Test Case Prioritization
Configuration Navigation Analysis Model for Regression Test Case PrioritizationConfiguration Navigation Analysis Model for Regression Test Case Prioritization
Configuration Navigation Analysis Model for Regression Test Case Prioritization
ijsrd.com
 
Performance testing : An Overview
Performance testing : An OverviewPerformance testing : An Overview
Performance testing : An Overview
sharadkjain
 
Data Structures in the Multicore Age : Notes
Data Structures in the Multicore Age : NotesData Structures in the Multicore Age : Notes
Data Structures in the Multicore Age : Notes
Subhajit Sahu
 
Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)
Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)
Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)
David Rosenblum
 
Journals analysis ppt
Journals analysis pptJournals analysis ppt
Journals analysis ppt
Muhammad Heikal
 
Cybernetics in supply chain management
Cybernetics in supply chain managementCybernetics in supply chain management
Cybernetics in supply chain management
Luis Cabrera
 
30 February 2005 QUEUE rants [email protected] DARNEDTestin.docx
30  February 2005  QUEUE rants [email protected] DARNEDTestin.docx30  February 2005  QUEUE rants [email protected] DARNEDTestin.docx
30 February 2005 QUEUE rants [email protected] DARNEDTestin.docx
tamicawaysmith
 
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
VLSICS Design
 
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
VLSICS Design
 
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
VLSICS Design
 
A quality of service management in distributed feedback control scheduling ar...
A quality of service management in distributed feedback control scheduling ar...A quality of service management in distributed feedback control scheduling ar...
A quality of service management in distributed feedback control scheduling ar...
csandit
 

Similar to Cost Based Performance Modelling (20)

Enterprise performance engineering solutions
Enterprise performance engineering solutionsEnterprise performance engineering solutions
Enterprise performance engineering solutions
 
DBTest 2013 - In Data Veritas - Data Driven Testing for Distributed Systems
DBTest 2013 - In Data Veritas - Data Driven Testing for Distributed SystemsDBTest 2013 - In Data Veritas - Data Driven Testing for Distributed Systems
DBTest 2013 - In Data Veritas - Data Driven Testing for Distributed Systems
 
Multiple Dimensions of Load Testing, CMG 2015 paper
Multiple Dimensions of Load Testing, CMG 2015 paperMultiple Dimensions of Load Testing, CMG 2015 paper
Multiple Dimensions of Load Testing, CMG 2015 paper
 
.Net projects 2011 by core ieeeprojects.com
.Net projects 2011 by core ieeeprojects.com .Net projects 2011 by core ieeeprojects.com
.Net projects 2011 by core ieeeprojects.com
 
Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...
 
Reinventing Performance Testing. CMG imPACt 2016 paper
  Reinventing Performance Testing. CMG imPACt 2016 paper  Reinventing Performance Testing. CMG imPACt 2016 paper
Reinventing Performance Testing. CMG imPACt 2016 paper
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
 
Cjoin
CjoinCjoin
Cjoin
 
Configuration Navigation Analysis Model for Regression Test Case Prioritization
Configuration Navigation Analysis Model for Regression Test Case PrioritizationConfiguration Navigation Analysis Model for Regression Test Case Prioritization
Configuration Navigation Analysis Model for Regression Test Case Prioritization
 
Performance testing : An Overview
Performance testing : An OverviewPerformance testing : An Overview
Performance testing : An Overview
 
Data Structures in the Multicore Age : Notes
Data Structures in the Multicore Age : NotesData Structures in the Multicore Age : Notes
Data Structures in the Multicore Age : Notes
 
Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)
Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)
Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)
 
Journals analysis ppt
Journals analysis pptJournals analysis ppt
Journals analysis ppt
 
Cybernetics in supply chain management
Cybernetics in supply chain managementCybernetics in supply chain management
Cybernetics in supply chain management
 
30 February 2005 QUEUE rants [email protected] DARNEDTestin.docx
30  February 2005  QUEUE rants [email protected] DARNEDTestin.docx30  February 2005  QUEUE rants [email protected] DARNEDTestin.docx
30 February 2005 QUEUE rants [email protected] DARNEDTestin.docx
 
Cloud Storage and Security
Cloud Storage and SecurityCloud Storage and Security
Cloud Storage and Security
 
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
 
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
 
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
 
A quality of service management in distributed feedback control scheduling ar...
A quality of service management in distributed feedback control scheduling ar...A quality of service management in distributed feedback control scheduling ar...
A quality of service management in distributed feedback control scheduling ar...
 

Cost Based Performance Modelling

  • 1. Cost Based Performance Modeling – dealing with performance “uncertainties” Eugene Margulis Telus Health Solutions, Ottawa eugene_margulis@yahoo.ca Abstract. Traditional Performance Evaluation methods (e.g. “big system test”, instrumentation) are costly and do not deal well with inherent performance related “uncertainties” of modern systems. There are three rd such uncertainties: requirements that are either unclear or change from deployment to deployment, 3 party code that one has no direct access to and the variable h/w platform. These uncertainties make exhaustive testing impractical and the “worst case” testing results in engineering for the “worst impossible case” rather than a realistic customer scenario. Creating a single model that is based on traceable and repeatable test results of individual system components (transactions) saves a huge amount of effort/costs on performance related engineering/QA - and provides an almost instantaneous "what if" analysis for product planning. Introduction The primary goal of performance/capacity activities is the ability to articulate/quantify resource requirements for a given system behavior on a given h/w provided a number of constraints. The resource requirements may apply to CPU, memory, network bandwidth, size of thread pools, heap sizes of individual JVMs, disk sizes, etc. There are many different types of resources that the system uses during processing its “payload”. The behavior of the system defines the expected quantifiable use patterns. For example, the number of network elements the network controller is connected to, the frequency of evens from the network elements to the controller would represent an aspect of system behavior. A system can have several “behaviors” – e.g. payload processing, upgrade, overload, etc. The constraints are the additional sets of requirements that define what behaviors are “useful” (from the user perspective). For example, it may not be useful to process 10 events per second if the response time per event is greater than 20 seconds. Or it may not be useful to process 10 events per second if you cannot retain event logs for 30 days. The h/w defines the target h/w environment for system deployment. An 8 core 32 “strands” processor 1GHz system with 8G of memory may fit better to one behavior pattern favoring throughput whereas a single core at 2GHz might fit another behavior where response time is more important. The goal is to create a quantifiable relation between behaviors, constraints resource requirements and h/w. This can, of course be accomplished using “brute force” testing, but the size of the “test space” makes this approach impractical. Traditionally this has been addressed by testing for “small/medium/large” configurations. Unfortunately many deployments/behaviors do not easily scale, they are simply different and not larger or smaller. (e.g. How do you compare a deployment A that requires 30 day historical event retention and 3 events/sec with deployment B that requires 1 day retention but 20 events/second?) In addition, the traditional methods of performing most test measurements in a high capacity lab are often costly, inflexible and in most cases can only be done at the tail end of the development process when discovered problems are very expensive to fix or mitigate. Cost based capacity modeling provides a lean and flexible approach to performance and capacity evaluation resulting in a significant reduction of performance related R&D costs. In this paper we will describe cost based performance modeling, how it applies to the different performance/capacity QA activities and outline the benefits of using this approach. 1
  • 2. What are the challenges and uncertainties? The key challenges during development of such systems from performance perspective are dealing with the three uncertainties: Behavior Uncertainty – because of multiple deployment scenarios one is never quite certain of the conditions the system is going to be used in. If the product that is being developed is new and provides a new capability there is little historical data on how it might be used. Code Uncertainty – in the past when most of the code was developed “in-house” one could always have direct access to the code, test it in isolation, contact the developers and understand the rd resource requirements. When using 3 party code this is no longer the case. Such code can only be treated as “black box”, one cannot often rebuild it with debug/optimization options. H/W uncertainty – underlying h/w architecture is no longer fixed. If it takes 6 months to develop a system the deployment h/w may be very different from the h/w available in the lab. Performance is no longer a verification exercise. Traditionally performance activities during system development were viewed as a verification exercise. It was assumed that there was a set of requirements, defined h/w and a stable system to test on. The uncertainties mentioned above make performance analysis of more exploratory nature, where the focus of performance activities is to discover and identify system operating costs and limits rather than to validate a specific scenario. (Direct validation/verification of a specific scenario/behavior is still the best option – provided that this scenario is known). Efficient articulation and sharing of performance information. In addition to these uncertainties there are organizational challenges of sharing performance information between different groups of stakeholders. The performance profile is usually a multi-dimensional problem (multiple resources, multiple constraints, multiple behaviors). It is important to articulate and communicate this kind of complex information effectively and consistently. A development group may span across continents, time zones, language boundaries – designers can be in India, architects in Canada, testers in China and customers in Spain. It is important to make sure that when someone refers to the “cost of event processing” everyone knows exactly what it means and the implications. Performance estimates available at any time. Given the length of the development cycle and the size of the modern systems it is usually too late to identify/address performance issues after the code “freeze”. It is important to be able to determine performance bounds as early as possible. For example, if event- processing component is ready to be tested before historical event reporting, then there is no reason to wait until everything is ready before determining the cost of event processing. The initial cost estimates of a given component (e.g. event processing) should be determined as soon as the component has basic functionality and should be refined throughout the development cycle. A quantifiable “best guess” of the system performance should be available at any time and the accuracy and the confidence of such “guess” should improve continuously throughout the development cycle. Cost Based Transactional Model 3+ way view The system performance and capacity relationship can be visualized as the following “3+ view” where the Cost Model provides mapping between the system behavior (e.g. events per second), costs (cost per event), and the total resource requirements. The relationship is subject to a number of constraints and h/w characteristics: 2
  • 3. Latency + HW Other Constraints Other Constraints Costs COST Resource MODEL Requirements Requirements Behavior Behaviour The cost model is the “glue” that provides quantification of resources based on quantifiable behavior (requirements), measured costs and specific constraints. The model is used to drive the performance analysis activities throughout the project to make sure that we perform the minimal amount of testing for the expected deployment. Building and using the model early on ensures that the tests performed and data collected are relevant for the intended system deployment. The model is capable of accurately determining a fairly comprehensive capacity envelope even though it is based on a much smaller set of test measurements than the traditional brute force multi-dimensional testing normally required to produce such an envelope. The model forecasts (either extrapolates or interpolates) most of the behavior/constraints combinations so that explicit measurements are not necessary. The cost savings are realized by the relatively small number of test measurements required to do this. The following sections describe the process of building of such a model. Transaction as a unit of System Behavior System behavior can be described in terms of processing a number of distinct (transactions) within a unit of time. A transaction represents some unit of work offered to the system. System Behavior is the offered workload (that is, a set of transactions). For example, a system may be expected to process X events per second, update Y user GUI displays every Z seconds, collect N performance management reports from M network elements every K minutes, etc. Each of these examples can be viewed as a transaction associated with some frequency of execution: Transaction = (TransactionType, TransactionFrequency) SystemBehaviour = SetOf {Transactions} Clearly defining an exhaustive set of transactions for any large system is impractical. However, most systems execute only a small subset of transactions most of the time, especially during the steady state” payload” processing. It is possible to define multiple system behaviors for different system states to expand this model. For example, a steady state processing can be associated with one set of transactions; overload state can be associated with another set. From a performance/capacity perspective we need to identify and to prioritize the set of system behaviors that need to be assessed and evaluated. Each transaction executed on the system results in use of system resources – CPU, memory, disk, etc. The total amount of resources required to support the behavior depends on behavior (the set of transactions and their frequencies) and constraints: SystemRequiredResourceUsage = Function (SystemBehavior, Constraints) The resource cost of a given system behavior is usually dependent on the transaction cost and the frequency of transactions – however different resources may require different resource computations, for example the”cost” of disk (disk space) is related to the total disk “residue” of the transaction – e.g. DB space required per event record; memory is most difficult to trace since memory utilization is affected by multiple layers of memory management policies. 3
  • 4. The cost based model implements the function above – namely the mapping of system behavior and constraints to the required system resources to support such behavior. This mapping is based on cost measurements and testing of individual transactions – rather then on the on direct testing of every behavior. Transaction Linearity Each Transaction is associated with resource “costs” – the price the system has to “pay” for performing the transaction. For example, processing 1 event per second may require 10% of CPU on 1500 MHz processor. The CPU utilization may or may not be linear with respect to the frequency of the given transaction type. That is, processing of 2 events per second might not be twice as expensive as processing of 1 event. However, using a linear model as a starting point provides a good initial approximation: Most of the processes are linear within the “operating scope of the system” of 20%-70% of CPU. Over 70% CPU utilization the OS switching results in non-linear CPU utilization (the actual max CPU cut-off may be higher than 70% for many systems). However, at that range we are usually not interested in capacity of the system since it is outside of the operating scope (as long as we know that the utilization is over 70%). If the utilization is below (say 20%) then the error due to non- linearity is fairly small and can be ignored If the behavior of the system with respect to the given transaction is demonstrated to be non-linear (as a result of testing) within the operating scope of the system then it may be possible to decompose the transaction into independent linear components. For example, event processing may include aggregating and storing events in a historical database. Database “write” (the most expensive part) is usually done in batches with multiple events. Therefore the overall cost of event processing is non-linear. However decomposing event processing into two transactions – 'event arrival' and 'event save' with different frequencies results in a linear transaction set. If the system behavior is demonstrated to be non-linear with respect to transaction frequency and further decomposition is not possible/practical and linear assumption results in significant estimation errors then a non-linear function must be developed based on capacity tests for various transaction rates. Using the linear assumption above we can derive the system resource usage as follows: SystemRequiredResourceUsage = Function( SumOf {Cost(TransactionType, TransactionFrequency)}, Constraints ) + C The C above is the constant representing the background resource utilization. In theory it is possible to decompose C into individual transaction (consisting of OS activities, system management, etc.) but in practice C is simply the background utilization in absence of the defined transactions. Transaction Costs A transaction may depend on a number of resources therefore the maximum rate of a given transaction may be reached even though a given resource is not exhausted. For example assuming an event processing is a single-threaded process that “costs” 5% of CPU for 1 event per second on a 4-core processor. In this case processing of 5 events per second will result in 25% of CPU. Processing of 6 events per second will not be possible since since the single-threaded process will never be able to use more than 1 core of the 4 core CPU. The costs per transaction are measured by monitoring resource utilization and constraints (e.g. latency) for various rates of the transaction. For example, the following charts show the test results of executing a certain transaction at rates of 2, 4, 6, 8, 10 and 12 rps (requests per second): 4
  • 5. 80.0 Cpu% Cpu% Latency (sec) 60% y = 0.048x + 0.006 12 70.0 TOTAL Vmstat1: CPU R2 = 0.9876 60.0 usr 50% 10 50.0 sys CPU% 40.0 40% LATENCY 8 30.0 Linear 30% (CPU%) 6 20.0 10.0 20% 4 0.0 10% 2 11/05-08:54:23 11/05-08:59:23 11/05-09:04:23 11/05-09:09:23 11/05-09:14:24 11/05-09:19:24 11/05-09:24:24 11/05-09:29:24 11/05-09:34:24 11/05-09:39:25 11/05-09:44:25 11/05-09:49:25 11/05-09:54:25 0% 0 1 2 3 4 5 6 7 8 9 10 11 12 Requests Per Sec The chart on the left shows the CPU utilization pattern throughout the test. The chart on the right shows the averaged CPU utilization per test (blue) as well as transaction latency (pink) for different rates of transactions. One can see that at the rates higher than 10 rps the latency increases dramatically (yet the CPU% utilization at this rate is about 50%). Clearly the limiting factor for this transaction is not CPU but something else (in this specific case it was JVM heap utilization). All of the tests were run by generating requested rate of transaction for the same amount of time. The CPU% trace on the left chart shows that during the 12 rps test the system was active for longer than during the other tests – the latency increased so much that the system was not able to keep up with the given rate of transactions. Based on these results the maximum sustainable rate (MaxRate) for this transaction is 10 rps. The 12 rps is not sustainable and was therefore not included in the average CPU values used for trend estimations on the chart on the right. It is also clear that the transaction costs are linear with respect to transaction frequency – the slope of trend line shows that the cost is approximately 4.8% of CPU per 1 transaction per sec. For the model computation purposes it is useful to express the CPU costs in terms of MHz per transaction (so that we can map them from one processor to another). This test was performed on Netra 440 (4 CPUs, 1.6 GHz each) – therefore 4.8% is equivalent to 4.5%*4*1600 = 307 MHz per request. The transaction cost therefore is described as (MHz, MaxRate) = (307, 10). In practice the cost of a transaction may depend on additional parameters. The cost of DB insert will depend on the number of records in the database. Other transactions may depend on multiple factors, but there are usually very few parameters that transaction costs are most sensitive to. Once these parameters are identified then the cost of transaction is measured for the different values of the parameters. MaxRate and the MHz for each combination of parameters is measured and derived as above. The final transaction cost profile is represented as transaction cost matrix: DB Insert Cost ActiveRecords MHz MaxRate 0 12.0 125.0 10000 15.5 96.6 30000 16.5 90.8 60000 18.0 83.3 100000 20.0 75.0 200000 24.9 60.1 5
  • 6. The table above shows the costs and maximum rates of DB Insert transaction with respect to the value of ActiveRecords parameter. Transaction Cost Matrix description is unambiguous, can be shared between groups, can be traced directly to specific test runs, can be reproduced, tested and consistently used for projection and estimations. Model Implementation This model lends itself to a fairly straightforward spreadsheet implementation. The spreadsheet implementation allows for continuous update, refinement and automatic import of test data and measurements. Once implemented, the model allows immediate “what-if” analysis of the impact of a change in customer behavior or a given cost (e.g. what if we have twice as many events but half as many GUI clients? How many more clients can we handle if we speed-up security function by a factor of 2?). The primary “deliverable” of such model is an operating envelope of the system with respect to the system resources. The operating envelope can be graphically represented as a 2D or 3D chart showing system operating limits with respect to key transactions and resources. An envelope can be dynamically recomputed with respect to any behavior/scenario/hardware. For example, the following figure shows the operating envelope of a system with respect to 3 parameters (event rate, number of users and number of records). The particular system had approximately 30 different “dimensions” (about 10 transaction types and other constraints and parameters). The model was capable of computing operating envelope for any three parameters while fixing the values of the rest. The area below the 3D surface represents system operating range that would meet capacity and behavior constraints and allows instantaneous “what if” analysis. Based on this chart, If the intended deployment calls for 30000 users then the system will meet its performance/capacity constraints provided that event rate is less than 14 per sec and the number of records is less than 22000. This capability to compute this kind of operating envelope greatly assists and simplifies the “what-if” analysis necessary to determine, articulate and communicate limits of a potential deployment. Operating Envelope 18 17.2 16.4 15.6 14.8 EVENT 14 RATE 13.2 12.4 11.6 10.8 10 10000 25000 14000 18000 26800 22000 26000 28600 30000 34000 NRECORDS1 30400 38000 NUSERS 42000 32200 46000 50000 34000 In the example above, the operating envelope showed the general limitations of the system based on the value of certain parameters and transaction frequencies. Another example (below) shows the operating envelope chart that displays the specific resources that limit the system capacity. In this case, the system, network controller, can support various combinations of two network element types (NE1 and NE2). The white line (MAX) shows the overall operating limit of the system. The support of NE2s maxes out at about 21 and is bounded by CPU (MAX_cpu). The limiting factor to NE1s is the disk (MAX_ed), which maxes out at 300-400 depending on the number of NE2s. This chart clearly 6
  • 7. shows what resource use needs to be optimized (or amount of resource increased) depending on potential deployment. 80 Operating Envelope 70 60 MAX MAX_cpu 50 MAX_ed #NE2 40 MAX_ned 30 MAX_bw MAX_aeps 20 10 0 0 200 400 600 800 1000 1200 #NE1 As part of the total resource utilization computation for the given behavior the model must compute the expected resource utilization by each individual transaction (or application/component identified with a transaction set). This information is valuable in identifying which feature/transaction is responsible for most use of resources – for a particular system behavior. The next three charts show this type of breakdown for CPU, Memory and disk utilization on various systems: 40% CPU Distribution by feature IPCom m s 35% 32% (500A x 15K cph) Base 30% 27% 25% IMF 20% RAM - Top 10 Users Logs 14% 13% 15% OSI 10% STACK 4% 5% AppLd 3% 5% 2% 1% PP 0% 0% 0% 0% 0% 0% GES Give IVR Give RAN RTData API RTDisplay Intrinsics Queuing MlinkScrPop CalByCall Blue DB Base CP Collect Hdx Reports Broadcasts Digits HIDiags DB OTHER Disk_NE_Loads_GB, 7 Disk_Security_Logs _GB, 1 FREE Disk CACP_GB, 3 spare, 37 Disk NE B&R_GB, 7 Disk_Alarm_DB_GB, 10 Disk_NE_LOG_GB, 57 Disk_Alarms_REDO_ GB, 5 Disk_PM_GB, 11 For example, the top left chart above shows that the biggest CPU users are “CollectDigits” and “RTDisplay” components (and transactions associated with them). If CPU is the limiting factor then these two should be the focus of improvements/optimization. 7
  • 8. Model Accuracy – Start early, improve as you go The model accuracy clearly depends on the accuracy of the underlying data. The underlying data broadly belongs to two categories – tested/measured transaction costs and the quantification of expected system behavior. As the system matures the tested/measured transaction costs become more and more accurate. However the expected behavior, expected use (especially in new systems) is often a best guess based on perceived marked need, similar systems, past experience or even wishful thinking. The benefit of the model is that it allows to focus on implication of such guesses and encourages Product Line Management / Marketing to come up with better guesses and estimates. A performance team should be able to provide best estimates of system performance at any stage of development. The accuracy of such estimates should improve as system matures. The model allows order-of-magnitude estimates very early in design cycle. One can usually start getting initial sensitivity/assessment results even before a single line of code is tested based on expected behavior and educated guesses on transaction costs. Often the educated “guesses” of transaction costs can be based on previous experience, past system analysis or a simple baseline testing of similar functionality. This early analysis is valuable for detecting “big” showstoppers and allows early adjustment of architecture. As the system is being developed and measurements are performed the model becomes progressively more accurate. The model is usually within 10% accurate by the end of the development cycle for a new product, but it does require final calibration and verification with a “Big System Test” where the resource is measured under load of multiple transactions at a time. For mature systems where this methodology has been applied over a number of years, the accuracy is such that in some projects the model was productized as a provisioning and planning tool. Improving communications and addressing uncertainties Use of the model allows unambiguous and clear communicates of test results. The results per individual transaction are communicated in terms of costs (transaction costs matrices can be exchanged between groups, tracked by load and release to release basis). The results in terms of the overall behavior impact are communicated pictorially as operating envelopes that show the performance capacity implications visually and explicitly. For example, the following two charts show the operating envelope of the system with and without certain code optimization improving processing of transactions SEC1 and SEC2. Since there is a cost associated with deployment of the optimization/patch it is important to demonstrate the impact of the change. The charts clearly show which behaviors (deployments) will be impacted and where the change needs to be deployed. The deployments that have less then 15000 records and do not use SEC2 transactions will not benefit from the patch. Ope rating Enve lope (w /o code change ) Ope rating Enve lope (w ith code change ) 50000 50000 45000.1 45000.1 40000.2 40000.2 35000.3 35000.3 30000.4 30000.4 NREC 25000.5 NREC 25000.5 20000.6 20000.6 15000.7 15000.7 10000.8 10000.8 5000.9 5000.9 1 1 1 1 0 0 2.4 2.4 3.8 3.8 1 1 5.2 5.2 6.6 6.6 2 2 8 8 SEC1 SEC2 SEC1 SEC2 9.4 9.4 3 3 10.8 10.8 12.2 12.2 4 4 13.6 13.6 15 15 5 5 8
  • 9. The model can also be used to compute the operating envelope with respect to different h/w. For example, the following two charts show the different range of system behaviors that can be supported on different h/w. Form the two charts below it is clear that V890 would be a better choice handling high rates of TX1 transactions (right), but T2 would be better at handling the mix of TX1, TX2 transactions (left), specifically with high number of records (TX2 implementation was able to take advantage of the multithreaded architecture of T2, whereas TX1 was had a single-threaded implementation that benefited form high speed of V890). Ope rating Enve lope - 1000M Hz/Core , 4 Core , 32 s trands , Sun T2 Ope rating Enve lope - 1800M Hz/Core , 8 Core , Sun V890 30000 30000 27000.1 27000.1 24000.2 24000.2 21000.3 21000.3 18000.4 18000.4 15000.5 NRECS 15000.5 NRECS 12000.6 12000.6 9000.7 9000.7 6000.8 6000.8 3000.9 3000.9 1 1 1 1 1 1 5.9 5.9 10.8 10.8 4.8 4.8 15.7 15.7 20.6 20.6 8.6 8.6 25.5 25.5 TX1//SEC TX1//SEC 30.4 30.4 TX2/SEC TX2/SEC 12.4 12.4 35.3 35.3 40.2 40.2 16.2 16.2 45.1 45.1 50 50 20 20 The use of cost based model throughout as the base for the performance analysis activities addresses and mitigates the three uncertainties: Behavior. Even thought the expected system behavior at the deployment time may not be known, the model allows forecasting for ANY behavior. Ability to compute the operating envelope identifies and quantifies the range of acceptable behaviors/scenarios that allows making intelligent decisions on deployment and capacity limitation and trade-offs. Code. The transaction based cost model is based on identifying costs with respect to the customer visible behavior, work/activity, rather then a specific code module, process, etc. There is no need to rd access, instrument, rebuild or recompile 3 party code. All of the code is treated as black box (although depending on the granularity of metrics collected one can always map transactional resource utilization to the specific process, module). H/W. Using of cost model allows mapping of results from one h/w to another. It might be necessary to perform a small set of pilot test cases on different h/w platforms to identify the mapping; the model can be used to compute operating envelope for different target h/w throughout the development cycle. The h/w limitations are identified early that allows to make changes if necessary. Cost Reduction There is inherent cost reduction of using the model to forecast the comprehensive operating envelope of the system vs. the brute force multi-dimensional testing. Suppose the system has N transactions, each transaction can have Rn distinct rates that can be expected in some deployment scenarios. Suppose this system has M parameters with Pm distinct values each. Then the total number of “brute force” tests necessary to determine the operating envelope for the system would be equivalent to the number of points within this (N+M)-dimensional space, or the product of the number of distinct rates per transaction and the number of distinct parameters values: BruteForceTests = R1 * R2 * R3 … * Rn * P1 * P2 * P3 … * Pm On the other hand, if we are using the model to forecast such envelope, we only need to determine the cost of each transaction individually and forecast the cost of any combination. Each transaction individually would depend on a (small) subset of all the parameters, so the number of tests for transaction X will be Rx * 9
  • 10. Ptx where Ptx is the product of the number of distinct parameter values the transaction X depends on. Therefore the total number of tests is the sum of the number of distinct rates per transaction and relevant parameters combinations: ModelTests = R1*Pt1 + R2*Pt2 + R3*Pt3 … + Rn*Ptn The reduction of the number of tests represents a significant direct cost reduction. There are additional cost benefits related to the use of the cost based model: The test simplicity – individual transaction tests are much simpler to run and automate then the big system test required to test multiple transactions in combinations. Hardware cost reduction – there is no need to run all tests on the target hardware all the time. Once the mapping is established the results obtained on one hardware platform can be mapped to another. (The mapping needs to be re-tested/re-established periodically) There are, of course, costs associated with creating and maintaining such model (and validating it with a BST-lie tests), but in many cases they are far outweighed by the cost reductions associated with testing. There are situations when direct testing makes more sense (and is more efficient) than the modeling. If the expected rate for each transaction and the specific value of each parameter is known in advance then according BruteForceTests formula above only one test will be required. However this is rare and signifies a system with no “uncertainties” as described here. Practical experience Using transaction based cost analysis is not new. A similar approach, Microsoft’s Transaction Cost Analysis is described in [1] and is used for web services costs evaluation. Transaction Aware Performance Modeling [2] focuses on response time per individual transaction. We used this approach to focus on capacity profile with respect to multiple resource types (cpu, memory, threading, disk space etc.) under a set of well-defined constraints. We have applied this approach in a number of projects in Telecommunication environment in a variety of systems including: Call Centre Server (WinNT platform, C++) Optical Switch (VxWorks, C, Java) Network Management System (Solaris, Mixed, 3rd party, Java) Management Platform Application (Solaris, Mixed, 3rd party, Java) Each project presented different challenges (memory modeling in VxWorks, Java Heap sizing, Threading in Solaris, etc.). Application of this method has resulted in a number of “indirect” benefits (apart from the direct benefit of addressing uncertainties and enabling forecasting): Improved Communication across groups – everyone speaks the same language (well defined transactions/costs). The “framework” or “platform” groups were able to describe the cost of the services they provide to the rest of the system by publishing the cost of the key transactions. The users of these services were able to use these costs directly for their estimations. Change of verification focus – the verification activity focuses on validation and calibration of the model, rather then on some specific scenario. This results in more reliable and trustworthy model. “De-politization” of performance engineering. We found that the visual representation of the complex information as well as an instantaneous “what-if” capability greatly reduced the negative political aspects of performance engineering. The charts and the underlying data were open and traceable to individual costs/tests making all the numbers and trade-offs clear and quantifiable. Better requirements – quantifiable, PLM/Customer can see value in quantifying behavior. The PLM and customer were also much more motivated to spend extra effort determining realistic requirements. 10
  • 11. Documentation reduction – engineering guides are replaced by the model; the performance related design documentation focus on design/architecture improvements. Early problem detection - most performance problems are discovered before code “freeze” and the beginning of the official verification cycle. Cost Reduction – less need for BST type of tests/equipment, less effort to run PV, reduced “over- engineering”. The transaction costs can be measured on one h/w platform and mapped to another. The transaction costs are determined based on automated tests that can often be executed on non-dedicated h/w by designers (although final tests must be done in a controlled environment). End-user capacity planning tools – the model can be directly used to develop end-user capacity planning and performance analysis tools. Summary Cost Based Modeling effectively addresses the key deployment uncertainties of Performance evaluation in modern systems by providing a quick and inexpensive method to estimate performance impact of changing system behavior and hardware platform. It is a “black” box based approach that does not require access to rd the “hidden” 3 party code. In addition, cost based modeling provides the ability to obtain performance and capacity estimates for key product functionalities throughout the entire development cycle, often even before the first line of code is written. It is a conceptually simple and inexpensive to implement requiring no large- scale equipment. The approach improves performance/capacity information communication in large projects and facilitates iterative feedback to project management and design groups. Acknowledgements: I would like to thank my former colleague, Robert Lieberman for his advice and comments. Most of results described here are based on various performance projects at Nortel. References: [1] Using Transaction Cost Analysis for Site Capacity Planning, Microsoft, http://technet.microsoft.com/en- us/commerceserver/bb608757.aspx [2] Securing System Performance with Transaction Aware Performance Modelling, Michael Kok, (Parts 1,2,3 in CMG MeasureIt) http://www.cmg.org/measureit/issues/mit61/m_61_17.html About the author: Eugene Margulis is a software performance lead at Telus Health Solutions. He has worked on capacity/performance analysis and evaluation on multiple projects at Nortel over the last 15 years. During this time he was involved in design, architecture and QA of telecommunication systems – from hard real time call processing to network management. 11