Since its beginning, the Performance Advisory Council aims to promote engagement between various experts from around the world, to create relevant, value-added content sharing between members. For Neotys, to strengthen our position as a thought leader in load & performance testing. During this event, 12 participants convened in Chamonix (France) exploring several topics on the minds of today’s performance tester such as DevOps, Shift Left/Right, Test Automation, Blockchain and Artificial Intelligence.
5. Blackbox Testing
• Performance testing cannot be conducted in a black box
• As a minimum, we should know how the boxes interact with the
other boxes in the landscape
So we have Paul here. I just thought Paul would be a common name so I picked it.
He is just getting started at a new client site for a performance testing project.
Let’s say the client is a major University and they are rolling out a new web portal for their students.
The website is pretty simple let’s students view their information (personal information, classes, library loans, fees etc.) and students can update their information.
So after a few weeks of performance testing, the results seem fine so that portal went live and students started complaining about how slow the website is.
So Paul has done the workload modelling, and he create some awesome scripts in his favourite load testing tool.
But he wasn’t sure what went wrong, why he couldn’t catch performance issues.
Well there are a lot of things that could have gone wrong, but for the context of this talk, let’s say this is the problem
It is rarely a case that the software system we are testing is a standalone system.
Businesses are complex, they use software for a lot of purposes. They might have some software for handling CRM, they might have some payment systems, they might have some systems which takes care of authentication, or some middleware components which joins all these components.
The componets constant communicate with each other. So it is really important to understand how the software we are testing is part of the client’s landscape.
This is where, analyzing the solution architecture comes handy.
Performance testing cannot be treated as black box approach,
At a minimum we should know how it interacts with other boxes in the landscape. the boxes are connected to each other
Understanding and analyzing solution architecture provides us with a wealth of information, that shapes our perf test strategy. Saves of a lot of time of time especially you are coming from outside, need to understand a lot of system in a short period of time
Most of the information is already there, captured in SADs, architecture diagram or in people’s hand, so all you need to to do understand and capture the information what’s relevant to performance testing
This is not a brand new concept, some of us do it, many of us don’t do it.
I just thought of sharing my approach, something I found helpful while getting started on a new engagement.
Five layers:
Context
Consumer
Infrastructure
Communication
Data
This is about understanding what has changed in the architecture of your solution within the context of the release/project you are testing
Understanding what changes have been made as part of this release What has changed helps us where to focus our efforts
Architectural changes – could be new components that are added in this release
Components that are modified
It is really crucial to include these components because this is where most of the performance risk is. Especially if it’s a brand new component, there is no prior performance benchmark for this system.
It’s not just new and modified component, some existing components are crucial and need to be included in the performance testing
This is a little bit tricky because, although these components are not changed as part of this release, changes in one component could have an impact on the unchanged components, therefore needs to be included iin the testing
Let us take an example
This is a made up architecture diagram for our university.
We have students accessing the web portal, and the web portal accesses the core systems via the integration layers (which exposes a bunch of APIs)
As well as let’s just say we have a regulatory body that makes calls to the integration layer to get information about the student records or the courses offered, anything, I’m just making this up.
In this case, the web interface is brand new as marked in the image without any performance benchmark, so we definitely need to include this in our testing
On the other hand, the integration layer is already existing, because the regulatory body has been consuming the APIs get access the student information. However, in this release it has been extended to add new APIs were added to accommodate the need of the webportal. So there is performance risk with this layer too.
Finally the core systems are not touched as part of this project. But when we carefully see, the load on these system is going to change, these components are going to be used more than they used to be.
We can’t simply stub these components and say that I am going to test only web portal, it doesn’t work that way because at the end of the day, from your users perspective it doesn’t matter where the slowness is. It could be integration layer, or it could be because of the performance issues in the core systems because they weren’t designed to handled this extra load coming in or the webportal itself, the end users /customers are experiencing the slowness. So we need to take a holistic approach to address the risk.
So that’s all about the layer. This step essentially narrows down the scope of our testing because tehre could be hundreds of components.
By analyzing the architectural change and its impact on the other components, we can make sure to include all the key components that cannot be skipped from included in the performance testing.
Let’s move on to the next layer
This is about understanding the users of our solution under test.
This determines the channels/sources of incoming load, which is a key factor for determining the performance risk
Consumers can be anyone or anything who is consuming the services offered by your software.
Most often they are real people, like staff or students
Some consumers are not so obvious, they can be other software systems consuming the API exposed by your application
Or even batch processes or ETL jobs that consume data and services of your software that are putting load on your system
That’s why I named this layer as consumers as it doesn’t necessarily always the “users’ of the system
Going back to our university example one more time
Maybe when Paul was doing his load tests, he might have loaded this channel here (the load coming from the students directly hitting the web portal).
Maybe he missed simulating the load coming from the regulatory body here. There could be other channels, like the staff using the course managemen system heavily on a day to day basis.
Without these other sources of load, maybe the student web portal performed real fast when he did the testing. Which won’t be the case in production.
I’m not saying we need to load every single channel that we see here of course unless it is significant in terms of volumes. However, if we don’t look at the full picture we might miss important channels of load. we might be see unrealistic results which might trick us into thinking that everything seems fine when it is not.
Let’s move on to the next layer
It’s about understanding where your components are deployed
Understanding where the system is physically hosted helps us understand the performance risk that won’t be visible in logical architecture diagrams
This is not an exhaustive list but just to give an idea, this includes undersntaing where the software is deployed – One big physical machine or a virtual machine
Whether it is on-premise or in a cloud
There are a number of ways this information can be useful when planning our tests:
It helps us understanding potential performance risks – for example if some compponents are hosted in cloud and some on on-premise there could be performance risks with latency or bandwidth or the gateways sitting in between could be a bottleneck point
If the two nodes in a cluster are in two different datacenters, then there is a risk of latency between these two components.
If any component is not balanced and no failover mechanism in place then maybe it can be a single point of failure. Which is a risk.
3) Also tells us information about what sort of monitoring will be available What is the OS - Windows or linux
(if it is in cloud, maybe we will have limited visibility during the tests if the infrastructure is managed by the vendor, we can raise it as a project risk upfront when drafting our performance testing plan)
Components like firewalls and load balancers are usually only visible when we analyse the physical architecture of the solution.
Understanding what sort of load balancing policy is in place will help us make sure that the traffic we generate is not blocked by the load balancers and the results are not skewed.
Virtual hardware vs physical hardware
Cloud or On premise
Cloud – PaaS, SaaS, or IaaS
Active vs passive clusters
Load balancer policy
Physical architecture could be very different from the test environment and the production
------
Identify the differences in the environment
Find any single point of failure
Any restrictions with load balancing policy that could affect load testing traffic
Any network latency issues (two nodes – if they are in different data centers – communication via network)
Components that may not be visible in the logical architecture, that need to be monitored
Now let’s see the physical architecture of the university solution.
As we can see the integration layer is hosted on only one node. If this node goes down, then there is no way the external parties are going to communicate with the core systems. This is a reliability risk that we can identify without doing the testing and mitigate it without doing any testing.
In this diagram, the databases for the two core systems are hosted on the same physical machine. This means performance of one system could impact the performance of the other system. Which is again another performance risk we can identify and something to watch for when we do our testing.
We can also identify the difference between production environment sizing and performance test environment sizing. Which is a key piece of information for doing realistic performance testing.
Another thing is, I have seen a number of cases where some components in the performance test environment share the infrastructure with the production components. Which is not ideal but at least if we know the limitations we could probably work around them like running tests after hours when the production load is minimal
All these issues we can identify and address in our performance test stratregy
The next layer is the communication layer
As the name says it’s about understanding how the components in the landscape communicate
What sort of network protocol used to communicate (especially between the client and the server) – HTTPS, JMS or proprietary protocols. This information will be quite useful for assessing our tooling options
What sort of network is used (whether the users use mobile data or University Wifi or LAN) is something useful when designing our performance tests
Communication is not necessarily between the client and the server (can be communication between any two components in the architecture)
Whether the components communicate via synchronous fashion like HTTP or fire Asynchronous like Messaging services or even polling.
If some activities are happening in the background as an async activity without any users waiting on it, it is slightly lower risk
In the example, Sync communication with solid arrow and async are shown in dotted arrow
The calls made by the web portal to the two core systems is Sync. This means, when we have bursts in the usage of the web portal (say when results are our or when it it the last day to submit applciations to the next semester) then the course scheduling system or student management system are under risk. If the applciations were queued and sent to the downstream systems in a controlled fashion then it would have been lower performance risk.
Another scenario is because the core systems exchange information between them in an Asynchronous fashion, there is a chance that when the student is fetching the information from the core system he may not be seeing the real-time information.
This step is about understanding the nature of business data exchanged between the components in the landscape
This goes hand in hand with the communication layer we have seen earlier. This is essentially the kind of business data exchanged between the components in the solution.
Understanding how these components are used by the business.
One student -> 10 different courses (this ratio could be crucial while setting up our test data) we simply can’t use dummy student who are not enrolled to any classes.
Sequence of calls (how the business transaction translate into number of calls) -> opportunity for tuning
If you have sequence diagrams available, this information can be obtained.
Login -> series of back and forth interactions between on-premise and cloud. And the downstream systems
Something to keep an eye for during our testing
When the library data takes longer to load, then we can tell that it could be issue with the call to library system, other componets are fine.
This is not a brand new concept, some of us do it, many of us don’t do it.
I just thought of sharing my approach, something I found helpful while getting started on a new engagement.
Five layers:
Context
Consumer
Infrastructure
Communication
Data