Case study and guidelines for performance testing microservices. The talk focuses on clarifying the goals of performance testing, suggestions on tools that will help get started and common pitfalls. Originally presented at Reactive Summit 2018.
7. How much load can our systems support?
1 million users? 10 million conversations? How do you
come up with those numbers?
Pivotus use
case
Agent portal
Provide bank agents with
a portal to access and
manage conversations
with their customers.
7
Customer apps
Provide bank
customers with
IOS/Android apps to
instruct with their
users
Reactive
microservices
Backend of 8
microservices that
need to scale
8. The plan for
this session
3. Reactive
Specific
8
2. Discuss
Metrics
Used
1. Define
Performance
Testing
4. Lessons
Learned
14. Defining
success
criteria Define the load
handling goals
Most importantly what
is the scale that we are
targeting.
Define and implement
test scenarios
Document scenarios
and implement tests
using Jmeter
Establish high
watermarks
Find out the breaking
points of the system
How much load can it
sustain before
breaking/erroring out
14
Monitor the results
with each release
See how the
watermarks shift
15. In the
absence of
production
data
Benchmark Low Load Target Load
# Clients 1 3 10
Agents 10 100 5000
Customers 100 10000 250000
A/C ratio 1/10 1/100 1/500
Messages per day 104 106 108
15
Come up with the goals for your system. In the absence of
production data you need to define your own.
16. 2. How we
log the
metrics
3. Reactive
Specific
16
2. Discuss
Metrics Used
1. Define
Performan
ce Testing
4. Lessons
Learned
18. Where to run
these tests?
18
Production Cluster Clone of Production cluster
We used AWS for the examples discussed.
The approach is cloud provider/or data center of choice
agnostic.
21. 21
Performs load test
Simulates request to target server
Returns stats about performance
Creates
request to
target
server
JMeter
saves all
responses
Server
responds
JMeter
gathers
data
Start End
Create
Report
22. 22
InfluxDB is time series db for
performance metrics storage
Grafana is visualization tool for
viewing reports
29. Microservices
vs Monolith
vs UI testing
29
Monolith
Testing the qualities of the
system as a whole
Microservices
Can be considered in in
separation
User interface
Performance testing is
important to draw limits that
browser and apps can sustain
30. Classical
Distributed
systems vs
Reactive
30
Distinctive characteristics of
reactive microservices:
1. Responsive: the
system responds fast
in a consistent way
and encourages the
user’s interaction.
2. Resilient: the system
stays responsive in
case of failure.
3. Elastic: the system
stays responsive
under varying
workload.
4. Message Driven:
communication by
exchanging
asynchronous
messages.
message-driven
resilientelastic
responsive
32. Interesting
finding #1
32
Measuring 1 vs 3 instances
results
Investigations shows two
reasons for degradation:
1. DB data amount
2. Web socket handling with
multiple instances of services.
33. Interesting
finding #2
33
System did not properly recover after
endurance testing.
Normal operation
Disaster occurrence
Disruption and failure
of operation
Recovery process
Reconstruction -
Normal process
Reasons was:
Incorrect configuration for
recovery mode
34. Interesting
finding #3
34
Mass messaging feature via eyes of an agent.
After sending a mass
message an agent
can view the same
message in all
conversations where
he is a primary agent.
35. Interesting
finding #3
35
Agent Mass message
Customer
Customer
Customer
Customer
Customer
Customer
Agent Mass message Customer Customer Customer Customer
Mass messaging feature is intended for sending of
the same message from agent to her all customers.
It sent messages in parallel.
Redesigned the feature to send sequential messages
instead of parallel. Performance testing adjusted the
targets.
36. Interesting
finding #4
36
In server side we have two types of DB:
1. Read DB
2. Write DB
The system integrates with user authentication providers, like OKTA. After user creation
we usually try to login with that account. We found out that there is ~3 second timeout
between these operations.
Client
UI
Write
API
Read
API
Read query
Domain model
Queries
Commands DB
DB
Sync
Server
Command
Query
I can tell you all about startups, but that is another talk
AnitaB is great you should join one, I support women in tech efforts
Those companies know scale
For the winners
Measure the capacity of the pipe
The goal of the talk is to answer the questions on when to think about performance testing and what to keep in mind?
Kerr Dam~ Polson, Montana
Load the capacity of the piple
Stress the preassue pipe can withhold (before exploading)
Endurance the quality of the pipe (before it rusts, desolves)
Customers need this data and you need it to define when will you scale horizontally and when to purchase new hardware/add more services. Take customer input with a grain of salt
Not affiliated with these tools
TestRail is the software used to write the test cases in the proper format. It allows to create test cases and corresponding groups. Each item have a description and steps.We have designed test scenarios here as a scheme of test case.This schemes uses for scenarios implementation.
The jobs are automated and run for multiple services … you can target at any Jenkins interface
Jenkins have 1 master and some slave machines for usage. So we have installed JMeter with corresponding libs/plugins on slave machines.
As we have created wrapper script to execute scenarios via command line, we have created Jenkins job with appropriate commands and parameterized it.
After test execution we use Jenkins functionality to make artifact for a test result. Jenkins stores it in master machine and we can view them as build result of a job.We can open a results and view/analyze the results.
, performance-oriented business (functional) test, regression test, etc., on different protocols or technologies.JMeter simulates a group of users sending requests to a target server, and returns statistics that show the performance/functionality of the target server/application via tables, graphs, etc.
InfluxDB is the time series database which is used as a temporary metrics storage.First of all, we need to install InfluxDB as a permanent storage space for our performance metrics.To push performance metrics from JMeter to InfluxDB, we used the Backend Listener. This listener enables writing metrics directly to the database.Configured the Backend Listener for our InfluxDB host and used this listener for all scenarios execution.
After the test execution is completed, we can check the InfluxDB and verify that our metrics were reported there successfully.
Grafana is an open-source platform for time series analytics, which allows you to create real-time graphs based on time series data. Grafana allows you to store performance reports as long as you want.Grafana also allows to customize your dashboards. Therefore, JMeter with Grafana is a reasonable way to monitor your performance scripts - a bit complicated on the one hand but pretty beneficial on the other.
HIgh level - the overall health of the system
Low level - result of the test run
Sumo logic dashboard
HIgh level - the overall health of the system
Low level - result of the test run
Sumo logic dashboard
Datadog dashboard
Datadog is a monitoring service for cloud-scale applications, providing monitoring of servers, databases, tools, and services, through a SaaS-based data analytics platform. It helps to see infrastructure related points on one place. Especially it is used to view environment state during performance testing.
Reports documentation
How are they cross dependent?
Classical distributed systmes vs reactive
Our experience contains testing by single and multiple instances for each microservices.
It shows that sometimes the dedicated mechanism of scaling did not give us expected results.
The first suspect is configuration which was used for scaling. This initiated degradation of KPIs and as a result reflected on performance of application.Investigations shows two reasons for degradation:1. DB data amount2. Web socket handling with multiple instances of microservicesHere is how this affected on our KPIs.
In our case the system did not recover automatically for a long time. Ideally it should recover within 1 hour.
Reasons:
Wrong configuration for recovery mode
The thing that is nice on paper is not really how it results
Message sent was clogging the pipes
The correct architecture didn’t really show all
Maybe not the best architecture to do .. as performance engineers we are the ones seeing the issue where our assumptions is that everything happens instantaneously