Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Compsac2010 malik
1. Using Load Test to Automatically Compare
the Subsystems of a Large Enterprise
System
Haroon Malik, Bram Adams & Ahmed E. Hassan
Software Analysis and Intelligence Lab (SAIL)
Queen’s University, Kingston, Canada
Parminder Flora & Gilbert Hamann
Performance Engineering
Research In Motion, Waterloo, Canada
2. Today's Large scale systems (LSS) are composed of many
underlying subsystems.
These LSS grow rapidly in size to handle growing traffic, complex
services and business critical functionality
Performance analyst have to face the challenge of dealing with
performance bugs as processing is spread across thousands of
subsystems and mail lion of hardware nodes
11. home
Like
Just
Work
Now
really
:::::::
cool
cpt
Just man Work home
smash lunch day pretty
beer ready working
home day smash pretty
Time getting get well
dude dinner bucket
head really heading got
time night get dude got
Feeling matt dude last
4560 ut2465 like now
still good feel still next
might game today 4562
PC-1
PC-2
PC-3
Lot of Data Our Methodology Signature
METHODOLOGY
12. home
Like
Just
Work
Now
really
:::::::
cool
cpt
Just man Work home
smash lunch day pretty
beer ready working
home day smash pretty
Time getting get well
dude dinner bucket
head really heading got
time night get dude got
Feeling matt dude last
4560 ut2465 like now
still good feel still next
might game today 4562
PC-1
PC-2
PC-3
Lot of Data Our Methodology Signature
METHODOLOGYDatabase
Mail Web
16. MEASURING THE PERFORMANCE
Base- Line
Test- 1
t1 t2 t3 t4 t5 t6
Deviations
Predicted (P)
Deviations
Occurred (O)
PO= P ∩ O
Precision = P ∩ O/ P = 1/4 = 0.25 Recall = P ∩ O/ O = 1/3 = 0.33
17. RESEARCH QUESTIONS
Can our methodology identify the subsystems
of an LSS, which have performance deviations
relative to prior tests?
Can we save time on the unnecessary load test
completion by early identifying the
performance deviations along different
subsystems of a LSS?
How is the performance of our methodology
affected by different sampling intervals?
18. Can our methodology identify the
subsystems of an LSS, which have
performance deviations relative to prior
tests?
RQ-1
19. APPROACH
4 Load tests 8 hours
700 performance counters each
Monitoring interval 15 sec 1922 instances
Baseline test 85% data reduction
Test-1 Baseline test reproduction
Test-2 Synthetic fault injection via mutation
Test-3 Increased the work load intensity (8X)
21. FINDINGS
Our methodology help performance analysts
to identify sub-systems with performance
deviations relative to prior tests
Subsystems
Load Test
Test-A Synthesized 8-X load
Data Base 0.997 0.732 0.826
Web Server-A 1.000 0.701 0.795
Web Server-B 1.000 0.700 0.790
Application 1.000 0.623 0.681
22. Can we save time on the unnecessary load
test completion by early identifying the
performance deviations along different
subsystems of a LSS?
RQ-2
25. Baseline
Load Test
CPU Stress
38
88
0 50 100
Time (min)
% CPU Utilization
Two Load Test
2 hours, each
Monitoring rate – 15 sec
CPU stress on database server at the 60th min
for 15 sec.
Test comparison
Removed 12% sample - 10 min
6%
6%
APPROACH
26. Baseline
Load Test
CPU Stress
38
88
0 50 100
Time (min)
% CPU Utilization
Two Load Test
2 hours, each
Monitoring rate – 15 sec
CPU stress on database server at the 60th min
for 15 sec.
Test comparison
Removed 12% sample - 10 min
6%
6%
APPROACH
27. Baseline
Load Test
CPU Stress
38
88
0 50 100
Time (min)
% CPU Utilization
Two Load Test
2 hours, each
Monitoring rate – 15 sec
CPU stress on database server at the 60th min
for 15 sec.
Test comparison
Removed 12% sample - 10 min
6%
6%
APPROACH
28. Baseline
Load Test
CPU Stress
38
88
0 50 100
Time (min)
% CPU Utilization
Two Load Test
2 hours, each
Monitoring rate – 15 sec
CPU stress on database server at the 60th min
for 15 sec.
Test comparison
Removed 12% sample - 10 min
6%
6%
APPROACH
29. Baseline
Load Test
CPU Stress
38
88
0 50 100
Time (min)
% CPU Utilization
Two Load Test
2 hours, each
Monitoring rate – 15 sec
CPU stress on database server at the 60th min
for 15 sec.
Test comparison
Removed 12% sample - 10 min
6%
6%
APPROACH
30. Baseline
Load Test
CPU Stress
38
88
0 50 100
Time (min)
% CPU Utilization
Two Load Test
2 hours, each
Monitoring rate – 15 sec
CPU stress on database server at the 60th min
for 15 sec.
Test comparison
Removed 12% sample - 10 min
6%
6%
APPROACH
33. How is the performance of our
methodology affected by different
sampling intervals?
RQ-3
34. Two Load Test
2 hours, Each
Monitoring rate– 15 sec
Fault Stopped Load Generators 10 Times- 15 sec each
Measured the performance of methodology at different time
interval
30 min – 4 Samples
15 min – 8 Samples
Baseline
Load Test -1
30-min
APPROACH
35. Baseline
Load Test -1
30-min
Two Load Test
2 hours, Each
Monitoring rate– 15 sec
Fault Stopped Load Generators 10 Times- 15 sec each
Measured the performance of methodology at different time
interval
30 min – 4 Samples
15 min – 8 Samples
APPROACH
36. Baseline
Load Test -130-min
Two Load Test
2 hours, Each
Monitoring rate– 15 sec
Fault Stopped Load Generators 10 Times- 15 sec each
Measured the performance of methodology at different time
interval
30 min – 4 Samples
15 min – 8 Samples
15-min
APPROACH
37. Small sample yield high RECALL
FINDINGS
Test Run Database Web Server -1 Web Server- 2 Application System Average
Min Obs Samples Recall Prec Recall Prec Recall Prec Recall Prec Recall Prec
30 120 4 0.50 1.00 0.50 1.00 0.30 1.00 0.25 1.00 0.325 1.000
15 60 8 0.62 1.00 0.62 1.0 0.62 1.0 0.50 1.0 0.590 1.000
10 40 12 1.00 0.90 1.00 0.9 1.00 0.9 0.9 0.69 0.975 0.847
5 20 24 1.00 0.70 1.00 0.7 1.00 0.8 1.00 0.66 1.000 0.715
All - 0.78 0.90 0.78 0.90 0.73 0.92 0.66 0.83 0.738 0.890
Large sample yield high PRECISION
Methodology performs best at 10
minutes time interval with nice
balance of both recall and precision
Editor's Notes
Today's LSS such as Google, eBay ,Facebook and Amazon are composed of many underlying components and subsystems.
These LSS grow rapidly in size to handle growing traffic, complex services and business critical functionality.
This exponential growth increases the individual component’s complexity and hence, the integration between the geographically distributed components.
- The performance of LSS is periodically measured to satisfy the high business demands on system quality, availability and responsive.
-Load testing is an important weapon in LSS development to uncover functional and performance problems of a system under load
Performance of the LSS is calibrated using Load test before it becomes a field or post-deployment problem.
-Performance problems include an application not responding fast enough, crashing or hanging under heavy load or not meeting the desired service level agreements (SLA).
Environment-Setup:
First but the most important phase of the load testing. As the most common load test failures occurs due to improper environment setup for load tests.
The environment setup includes installing the applications and load testing tools on different machines and possibly on different operating systems
Load generators, which emulates the users interaction with the systems, need to carefully configured to match the real workload in field.
Load Test Execution:
It involves starting the components of the system under test, i.e., starting required services, hardware resources and tools ( load generators and performance monitors).
Performance counter are recorded in this step too.
Load Test Analysis:
This step involves comparing the results of a load test against an other load tests results or against predefined thresh holds as baselines.
Unlike functional and unit testing, which results in pass of failure classification for each test; load testing requires additional quantitative metrics like response time, throughput and hardware resources utilization to summarize results.
The performance analysts selects few of the important performance counters among thousands collected. Based on his experience and domain knowledge performance analyst manually compare the selected performance counters with those of past runs to look for evidence of performance deviations, for example using plots and performing correlations tests.
Report Generation:
Includes filing the performance deviations, if found, based on the personal judgment of an analysts. Mostly the results produced are verified by an experience analysts.
Based on the extent of performance deviation and its relevance to team responsible for handling the subsystems i.e., (database, application, web system etc.)
- Unfortunately, the current practice to analyze load test is costly, time consuming and error prone.
This is due to the fact that the load test analysis practices have not kept pace with the rapid growth in size and complexity of the large enterprise systems.
In practice, the dominant tools and techniques to analyze large distributed systems have remained unchanged for over twenty years
Most of the research had focus on the automatic generation of load testing suits rather then load test analysis
- There are many challenges and limitation associated with the current practices of load test analysis that remains unsolved
- Last from computer of hours to several days.
They generate performance logs that can be of terra bytes in size
Even logging all counters on typical machine at 1Hz generates about 8.6 million values in a single weeks
A cluster of 12 machine over a week –13 TB of performance counter data per week. Assuming 64 bit representation for each counter value.
Analysis of such large counter log is still a bit challenge in load tests.
Performance analysts in LSS have only limited time to reach and complete diagnostics on performance counter logs and to make necessary configuration changes.
Load testing is usually the last step in an already tight and usually delayed release schedule. Hence, managers are always eager to reduce the time allocated for performance testing.
Error prone because of manual process involved in analyzing performance counter data in current practice
Impossible for an analyst to skim through large volume of log data, indeed they analyst use few key performance counters know to them from past practices, performance experts and domain trends as ‘ rule of thumbs’.
With large scale system that are continuously being evlved by adding new functionalities, applying same rules of thumb can mislead performance issues.
Due to these challenges, we believe the current practice to perform load test analysis is neither effective nor sufficient to uncover performance deviations accurately and in limited time.
-
1) The performance log obtained from a load test do not suffice for direct analysis by our methodology. These logs need to be prepared to make them suitable for statistical techniques employed by our methodology. In this step take care of data sanitization (missing counter variables and incomplete counter variables) and pre-treatment of data such as standardization and data scaling to remove the biasing of variance depended techniques.