The document summarizes research on assessing the scalability of microservice architectures. It discusses how microservices introduce challenges for monitoring performance and reliability due to their decentralized nature. The researcher aims to develop approaches to identify bottlenecks, anomalies, and anti-patterns in microservices. The document outlines a framework called PPTAM that generates load tests to analyze the performance of different architectural configurations and identifies the most scalable option based on success rates under various workloads. Ongoing work also looks to recognize common anti-patterns that can degrade microservice performance.
2. 2
The starting point
– “The term ’Microservice Architecture’ has sprung up over
the last few years to describe a particular way of designing
so ware applications as suites of independently
deployable services. While there is no precise definition of
this architectural style, there are certain common
characteristics around organization around business
capability, automated deployment, intelligence in the
endpoints, and decentralized control of languages and
data1
.”
1
Martin Fowler 2014: Microservices, https://martinfowler.com/articles/
microservices.html
3. 3
Common representation of such architecture
Front-end
Service
registry
Micro-
service3
Micro-
service4
Micro-
service1
Micro-
service2
4. 4
Continuous delivery is key
– Automation is needed to operate such a “system of
systems” (e.g., to use “blue/green deployment”)
– Monitoring is crucial:
– failures in one service can cause degradation in another
– timing issues have a higher impact than in a monolith
– debugging becomes complicated
– Verifying and guaranteeing quality requirements becomes
harder.
– Our research goal is to develop approaches to identify
performance bottlenecks, detect anomalies, and recognize
anti-patterns in a DevOps setting.
5. 5
PPTAM: Overview (container diagram)2
PPTAM
record performance
tests pass/fail outcome
Test Engineer
performs qualiy
assurance,
release management
Product manager
visualize
tries to isolate problems
Software developer
APM Tools
Dashboard
Driver Testbed
retrieves Analysis
Architect
Repository
looks for
architectural alternatives
reads to generate
tests plan
stores test results
deploys/
queries
2
https://github.com/pptam/pptam-tool
12. 12
PPTAM: Overview (container diagram)
PPTAM
record performance
tests pass/fail outcome
Test Engineer
performs qualiy
assurance,
release management
Product manager
visualize
tries to isolate problems
Software developer
APM Tools
Dashboard
Driver Testbed
retrieves Analysis
Architect
Repository
looks for
architectural alternatives
reads to generate
tests plan
stores test results
deploys/
queries
13. 13
PPTAM: Overview (component diagram)
PPTAM
Driver Testbed
Docker swarm
Test-by-test
configuration
Test
orchestrator
drives
SUT
describes
SUT
configuration
calls
queries
stores test results
Testing
framework
Metrics
collection
Load test
template
uses
stores
configures
PPTAM
Configuration
uses
Analytics
Extraction/
Aggregation
of SUT Data
Repository
visualize
reads from
reads
Test plan
generator
System logs
Dashboarding
0
1
2 3
4
5
Test Engineer Product manager
Software developer Architect
reads to generate
test plan
APM Tools
writes
uses
runs
reads
deployed on
looks for
architectural alternatives
record performance tests
pass/fail outcome
performs qualiy
assurance, release management
tries to isolate problems
14. 14
PPTAM: Process
t
workload
9 am
Observe
workload
situations
workload
rel.
freq.
Empirical
distribution of
workload situations
sampled workload
rel.
freq.
Sample
selection
workload p(w)
50
100
150
.11
.19
.22
... ...
Load test
sequence
Analysis Sampling Experiment generation
Experiment
execution
Testbed
Test
engine
PPTAM service metric
1
2
...
...
...
...
Baseline
requirements
Load test template
Architectural
alternatives
Results for architectural alternatives
123
1 2 3
4
16. 16
Analysis of the results
response
time
workload
Baseline
workload
mean response
time for baseline
service 1
17. 17
Analysis of the results
response
time
workload
Baseline
workload
mean response
time for baseline
service 1
18. 18
Analysis of the results
response
time
workload
Baseline
workload
mean response
time for baseline
service 1
19. 19
Analysis of the results
response
time
workload
Baseline
workload
mean response
time for baseline
service 1
20. 20
Analysis of the results
response
time
workload
Baseline
workload
mean response
time for baseline
service 1
21. 21
Analysis of the results
response
time
workload
Baseline
workload
Maximum
tolerated
workload for
service 1
mean response
time for baseline
service 1
22. 22
Analysis of the results
response
time
workload
Baseline
workload
Maximum
tolerated
workload for
service 1
mean response
time for baseline
service 2
Maximum
tolerated
workload for
service 2
mean response
time for baseline
service 1
23. 23
Analysis of the results
response
time
workload
Baseline
workload
Maximum
tolerated
workload for
service 1
mean response
time for baseline
service 2
Maximum
tolerated
workload for
service 2
mean response
time for baseline
service 1
Operating
point
24. 23
Analysis of the results
response
time
workload
Baseline
workload
Maximum
tolerated
workload for
service 1
mean response
time for baseline
service 2
Maximum
tolerated
workload for
service 2
mean response
time for baseline
service 1
Operating
point
Variable Service 1 Service 2
x(l0) 0.018 2.008
σ 0.008 0.003
Req. 0.042 2.017
x(lop) 0.015 2.009
Pass/fail pass pass
Calls 20% 80%
25. 23
Analysis of the results
response
time
workload
Baseline
workload
Maximum
tolerated
workload for
service 1
mean response
time for baseline
service 2
Maximum
tolerated
workload for
service 2
mean response
time for baseline
service 1
Operating
point
Variable Service 1 Service 2
x(l0) 0.018 2.008
σ 0.008 0.003
Req. 0.042 2.017
x(lop) 0.015 2.009
Pass/fail pass pass
Calls 20% 80%
Success rate=20% + 80%
26. 24
Analysis of the results
response
time
workload
Baseline
workload
Maximum
tolerated
workload for
service 1
mean response
time for baseline
service 2
Maximum
tolerated
workload for
service 2
mean response
time for baseline
service 1
Operating
point
Variable Service 1 Service 2
x(l0) 0.018 2.008
σ 0.008 0.003
Req. 0.042 2.017
x(lop) 2.015 2.009
Pass/fail fail pass
Calls 22% 78%
Success rate=78%
27. 25
Success rate for different workloads
sampled workload situation
success
rate
0.0
1
0.5
50 100 150 200
architectural alternative 2
architectural alternative 1
30. 28
Ongoing work: identification of Antipatterns
– Application Hiccups: temporarily increased response times
that return to normal later.
– The Stifle: data is retrieved through many similar (or
equal) database queries. As each request causes a
considerable overhead, the high amount of database
requests leads to a performance problem.
– Traffic Jam: one problem causes a backlog of jobs that
produces wide variability in response time which persists
long a er the problem has disappeared.