WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!
Automated Discovery of Performance Regressions in Enterprise Applications
1. Automated Discovery of
Performance Regressions
in Enterprise Applications
King Chun (Derek) Foo
Supervisors: Dr. Jenny Zou and Dr. Ahmed E. Hassen
Department of Electrical and Computer Engineering
2. Performance Regression
• Software changes over time
– Bug fixes
– Features enhancements
– Execution environments
• Performance regressions describe situations
where the performance degrades compared
to previous releases
2
3. Example of Performance Regression
3
Application
Server
Application
Server Data StoreData Store
SP1 introduces a new
default policy to throttle
the “# of RPC/min”
• Significant increase of job queue and response
time
• CPU utilization decreases
• Certification of 3rd
party component
Load Generator
Application
Server
Application
Server
Data Store
SP 1
Data Store
SP 1
Load Generator
6. 1. Too Late in the Development Lifecycle
• Design changes are not evaluated until after
code is written
– Happens at the last stage of a delayed schedule
6
7. 2. Lots of Data
• Industrial case studies
have 2000> counters
• Time consuming to
analyze
• Hard to compare more
than 2 tests at once
7
8. 3. No Documented Behavior
8
• Analysts have different perceptions of
performance regressions
• Analysis may be influenced by
– Analyst’s knowledge
– Deadline
9. 4. Heterogeneous Environments
• Multiple labs to parallelize test executions
– Hardware and software may differ
– Tests from one lab may not be used to analyze
tests from another lab
9
12. Evaluate Design Changes through
Performance Modeling
• Analytical models are often not suitable for all
stakeholders
– Abstract mathematical and statistical concepts
• Simulation models can be implemented with
support of existing framework
– Visualization
– No systematic approach to construct models that
can be used by different stakeholders
12
13. Layered Simulation Model
13
Physical layerComponent layer
World view layer
Can the current
infrastructure support
the projected growth
of users?
Investigate
threading model
Hardware resource
utilization
14. Case Studies
• We conducted two case studies
– RSS Cloud
• Show the process of constructed the model
• Derive the bottleneck of the application
– Performance monitor for ULS systems
• Evaluate whether or not an organization should re-
architect the software
• Our model can be used to extract important
information and aid in decision making
14
16. Challenges with Analyzing
Performance Tests
• Lots of data
– Industrial case studies have 2000> counters
– Time consuming to analyze
– Hard to compare more than 2 tests at once
• No documented behavior
– Analyst’s subjectivity
16
17. Performance Signatures
Intuition: Counter correlations are the same across tests
17
RepositoryRepository
Arrival Rate
Medium
Arrival Rate
Medium
CPU
Utilization
Medium
CPU
Utilization
Medium
Throughput
Medium
Throughput
Medium
RAM
Utilization
Medium
RAM
Utilization
Medium
Job Queue
Size
Low
Job Queue
Size
Low
…
Performance Signatures
19. Case Studies
• 2 Open Source Applications
– Dell DVD Store and JPetStore
– Manually injected bugs to simulate performance
regressions
• Enterprise Application
– Compare counters flagged by our technique
against analyst’s reports
19
20. Case Studies Result
• Open source applications:
– Precision: 75% - 100%
– Recall: 52% - 67%
• Enterprise application:
– Precision: 93%
– Recall: 100% (relative to the organization’s report)
– Discovered new regressions that were not
included in the analysis reports
20
22. Heterogeneous Environments
• Different hardware and software
configurations
• Performance tests conducted in different lab
exhibits different behaviors
• Must identify performance regressions from
performance difference caused by
heterogeneous environments
22
23. Ensemble-based Approach
• Build a collection of models from the
repository
– Each model specializes in detecting the
performance regressions in a specific environment
• Reduces risks of following a single model which
may contain conflicting behaviors
23
24. Case Studies
• 2 Open Source Applications
– Dell DVD Store and JPetStore
– Manually injected bugs and varies
hardware/software resources
• Enterprise Application
– Use existing tests conducted from different lab
24
25. Case Studies Result
• Original approach
– Precision: 80%
– Recall: 50% (3-level discretization) - 60% (EW)
• Ensemble-based approach:
– Precision: 80% (Bagging) - 100% (Stacking)
– Recall: 80%
• Ensemble-based approach with stacking
produces the best result in our experiments
25
26. Major Contributions
• An approach to build layered simulation models
to evaluate design changes early
• An automated approach to detect performance
regression, allowing analysts to analyze large
amount of performance data while limiting
subjectivity
• An ensemble-based approach to deal with
performance tests conducted in heterogeneous
environments, which is common in practice
26
28. Publication
K. C. Foo, Z. M. Jiang, B. Adams, A. E. Hassan, Y.
Zou, P. Flora, "Mining Performance Regression
Testing Repositories for Automated
Performance Analysis," Proc. Int’l Conf. on
Quality Softw. (QSIC), 2010
28
29. Future Work
• Online analysis of performance test
• Compacting the performance regression
report
• Maintaining the training data for our
automated analysis approach
• Using performance signatures to build
performance models
29
32. 32
QN models Types of application suitable to be modeled
Open QN
Applications with jobs arriving externally; these jobs will eventually depart
from the applications.
Closed QN Applications with a fixed number of jobs circulating within the applications.
Mixed QN Applications with jobs that arrive externally and jobs that circulate within
the applications.SQN-HQN
SRN Distributed applications with synchronous communication.
LQN Distributed applications with synchronous or asynchronous communication.
Table 3 1: Summary of approaches based on QN model‑
34. 34
Stakeholder Performance Concerns
End user Overall system performance for various deployment scenarios
Programmer Organization and performance of system modules
System Engineer Hardware resource utilization of the running application
System Integrator Performance of each high-level component in the application
Table 4 1: Performance concerns of stakeholders‑
35. 35
Stakeholder Layer in Our Simulation Model 4+1 View Model
Architects, Managers
End users
Sales Representatives
World View Layer Logical view
Programmers
System Integrators
Component Layer
Development View
Process View
System Engineers Physical Layer Physical View
All Stakeholders Scenario Scenario
Table 4 2: Mapping of our simulation models to the 4+1 view model‑
43. 43
Layers Performance Data
World view layer Response time, Transmission Cost
Component layer Thread Utilization
Physical layer CPU and RAM utilization
Table 4 5: Performance data collected per layer‑
CPU Util. Low OK High Very High
Range (s) < 30 30 – 60 60 – 75 > 75
Discretization 0.25 0.5 0.75 1
Table 4 6: Categorization of CPU utilization‑
RAM Util. Low OK High Very High
Range (%) < 25 25 – 50 50 – 60 > 60
Discretization 0.25 0.5 0.75 1
Table 4 7: Categorization of RAM utilization‑
44. 44
Data
Collection
Frequency
(Hz)
Layers
Data
Broadcast
Period (s)
Respons
e time
(s)
Cost ($)
Central
Monito
r
Thread
Util.
(%)
Central
Monito
r CPU
Util.
(%)
Central
Monito
r RAM
Util.
(%)
0.1
World
View
1 6.8 5.0 1.6 15.6 6.1
Component 1 6.8 5.0 1.6 15.6 6.1
Physical 1 6.8 5.0 1.6 15.6 6.1
0.2
World
View
1 7.7 5.0 4.0 40.3 15.7
Component 1 7.7 5.0 4.0 40.3 15.7
Physical 7 8.9 5.3 2.3 23.4 9.2
0.3
World
View
1 8.9 5.0 6.4 64.4 25.3
Component 1 8.9 5.0 6.4 64.4 25.3
Physical 3 9.2 5.0 5.6 56.0 21.9
Table 4 8: Simulation result for the performance monitor case study‑
45. 45
(b) Details of performance regressions
Time series plots show
the periods where
performance regressions
are detected.Box plots give a quick visual
comparison between prior
tests and the new test.
Counters with performance
regressions (underlined) are
annotated with expected
counter correlations.
(a) Overview of problematic regressions
47. 47
(a) Original counter data
(b) Counter discretization
(Shaded area corresponds to the medium Discretization level)
Figure 5 3: Counter normalization and discretization‑
48. 48
For each counter,
High = All values above the medium level
Medium = Median +/- 1 standard deviation
Low = All values below the medium level
Figure 5 4: Definition of counter discretization levels‑
50. 50
# of test
scenarios
Duration per
test (hours)
Average
precision
Average recall
DS2 4 1 100% 52%
JPetStore 2 0.5 75% 67%
Enterprise
Application
13 8 93%
100%
(relative to organization’s
original analysis)
Table 5 1: Average precision and recall‑
51. 51
Load generator
% Processor Time
# Orders/minute
# Network Bytes Sent/second
# Network Bytes Received/Second
Tomcat
% Processor Time
# Threads
# Virtual Bytes
# Private Bytes
MySQL
% Processor Time
# Private Bytes
# Bytes written to disk/second
# Context Switches/second
# Page Reads/second
# Page Writes/second
% Committed Bytes In Use
# Disk Reads/second
# Disk Writes/second
# I/O Reads Bytes/second
# I/O Writes Bytes/second
Table 5 2: Summary of counters collected for DS2‑
53. 53
Figure 5 6: Performance Regression Report for DS2 test D_4 (Increased Load)‑
54. 54
Test
Summary of the report
submitted by the performance
analyst
Our findings
E_1 No performance problem found.
Our approach identified
abnormal behaviors in system
arrival rate and throughput
counters.
E_2
Arrival rates from two load
generators differ significantly.
Abnormally high database
transaction rate.
High spikes in job queue.
Our approach flagged the same
counters as the performance
analyst’s analysis with one false
positive.
E_3
Slight elevation of # database
transactions/second.
No counter flagged.
Table 5 4: Summary of analysis for the enterprise application‑
55. 55
Model Counters flagged as Violation
R1 CPU utilization, throughput
R2 Memory utilization, throughput
R3 Memory utilization, throughput
R4 Database transactions/second
Table 6 1: Counters flagged in T5 by multiple rule sets‑
56. 56
Counters flagged as Violation # of times flagged
Throughput 3
Memory utilization 2
CPU utilization 1
# Database transactions / second 1
Table 6 2: Count of counters flagged as violations by individual rule set‑
57. 57
Performance Testing Repository New Test
T1 T2 T5
CPU 2 GHz, 2 cores 2 GHz, 2 cores
2 GHz, 2
cores
Memory 2 GB 1 GB 2 GB
Database Version 1 2 1
OS Architecture 32 bit 64 bit 64 bit
Table 6 3: Test Configurations‑
58. 58
Table 6 4: Summary of performance of our approaches‑
P represents precision, R represents recall, and F represents F-measure
(Values are rounded up to 1 significant digit)