Bots for continuous performance assessment are gaining use as a productivity tool. We discuss how and why open source projects use them and present an in-depth case study of the Nanosoldier bot used by the team behind the Julia programming language. Based on analysing the history of bot usage and interviews with developers we identify lack of a shared platform for performance measurement as an obstacle to wider adoption of performance measurement bots. To address this, we propose a prototype implementation of such a platform called PerfBot.
Joint work with Florian Markusse and Philipp Leitner, presented at 5th International Workshop on
Bots in Software Engineering, collocated with ICSE 2023, Melbourne Australia.
Towards Continuous Performance Assessment of Java Applications With PerfBot
1. USING BENCHMARKING
BOTS FOR CONTINUOUS
PERFORMANCE
ASSESSMENT
FLORIAN MARKUSSE
PHILIPP LEITNER
ALEXANDER SEREBRENIK
https://nl.freepik.com/vrije-photo/3d-geef-van-een-robot-met-een-traditionele-chroom-stopwatch_958078.htm Author
kjpargeter
8. RQ2: How does adoption
of benchmarking bots
affect the Open Source
Software project?
RQ1: How often are
benchmarking bots
adopted by the Open
Source Software
projects?
9. RQ1: How often are benchmarking
bots adopted by the Open Source
Software projects?
“performance regression”
“benchmark report”
“performance result”
GitHub Search
Projects w ≥ 10 pull requests
10. RQ1: How often are benchmarking
bots adopted by the Open Source
Software projects?
“performance regression”
“benchmark report”
“performance result”
GitHub Search
Projects w ≥ 10 pull requests
13. 1004
80 44 34 31 29 26 25 21 21
1
10
100
1000
10000
0 1 2 3 4 5 6 7 8 9
Number
of
developers
(log
scale)
Number of interactions with The Nanosoldier
14.
15. In the PRs with the bot Julia Rust Swift Fluent UI
#participants higher large medium small medium
#comments higher large medium small medium
comments longer large medium medium small
#reviews higher small small negligible n.s.
#commits higher small negligible small negligible
16. The Spruce / Steven Merkel https://www.thespruce.com/collect-clean-and-store-chicken-eggs-3016828
Thank for the opportunity to give the talk.
Florian, Alexander - Eindhoven; Philipp - Chalmers/Gothenburg
Paper
Bots are omnipresent in software development: they flag issues where the discussion has been finished (stall bot), check whether the pull requests or issues adhere to the rules defined in the project and they can even contribute to the warm and welcoming community by detecting toxicity or onboarding the newcomers.
Bots are omnipresent in software development: they flag issues where the discussion has been finished (stall bot), check whether the pull requests or issues adhere to the rules defined in the project and they can even contribute to the warm and welcoming community by detecting toxicity or onboarding the newcomers.
However, so far, much less attention has been paid to the usage of bots for continuous performance assessment, even though software performance is a critical quality concern for many projects. This is understandable given that even experienced developers often struggle to correctly benchmark their systems, or judge whether an observed slowdown is more likely to be statistical noise or an actual performance regression.
This is why we ask the following RQs
To answer the first research question we have searched for a series of strings related to performance to identify relevant pull requests. We have manually checked that the pull requests indeed have involved performance bots. We say that a project has adopted the benchmarking bot if more than 10 pull requests in a year use this benchmarking bot.
In this way we have identified 11 projects. Projects that adopt benchmarking bots tend to be large, and, unsurprisingly, in performance-sensitive domains, such as programming languages (Julia, Swift, Rust) or frameworks for Web development (Salesforce LWC, Hexo, or Microsoft’s Fluent UI). 4 use GitHub actions. all custom-built or heavily customized for usage in their projects. As of today, no “standard benchmarking bot”
We also observe that many projects in our list adopted their benchmarking bot as recently as 2020, with the notable exception of the Julia programming language, which has been using a benchmarking bot (The Nanosoldier) since early 2016. This is why we have decided to conduct a case study on Julia to answer our second research question, i.e., to understand the impact of adoption of the bot (The Nanosoldier) on the project itself (Julia).
In this way we have identified 11 projects. Projects that adopt benchmarking bots tend to be large, and, unsurprisingly, in performance-sensitive domains, such as programming languages (Julia, Swift, Rust) or frameworks for Web development (Salesforce LWC, Hexo, or Microsoft’s Fluent UI). We also observe that many projects in our list adopted their benchmarking bot as recently as 2020, with the notable exception of the Julia programming language, which has been using a benchmarking bot (The Nanosoldier) since early 2016. Finally, the benchmarking bots we observed in our study are all custom-built or heavily customized for usage in their projects. As of today, no “standard benchmarking bot” has emerged that sees usage across the open source ecosystem, akin to build or dependency management bots.
Firstly, there are two different ways to surface benchmarking results to developers: most bots add comments that link to detailed reports to pull request discussion threads; only three projects (simdjson, hexo, and ibis) surface benchmark results as GitHub Actions “checkmarks”. This is surprising given that similar badges are commonly used for other project quality checks [4].
Secondly, despite being integrated into the CI pipeline, four projects elect to not execute their benchmarking bots automatically and on every pull request. Instead, they have to be manually triggered by a developer, for instance by getting tagged in the code review discussion of a pull request. Given that system benchmarking is well-known to be time consuming and resource-intensive [5], these projects are selective with how often they choose to run their benchmarking bots. For example, in the Julia project, only 5% of pull requests are actually benchmarked, and only 8% of contributors have ever interacted with the benchmarking bot.
When we have been collecting data in June 2021, we have observed that while most projects in our study have adopted bots in 2020, Julia has adopted its benchmarking bot called The Nanosoldier already in 2016. This is why we have decided to conduct a case study on Julia to answer our second research question, i.e., to understand the impact of adoption of the bot (The Nanosoldier) on the project itself (Julia). Ca 7% of the PRs involve bots.
Pull requests where The Nanosoldier gets invoked involve more participants and more comments. Our research has also indicated that discussions involving the bot require more commits and lead to longer comment threads as measured by the total length of all comments taken together. The differences are statistically significant (p < 0.0001), with Cliff’s δ effect sizes between small (0.28; number of commits) and large (0.5; number of participants). Hence, it appears that The Nanosoldier is predominately involved in the discussion of complex issues.
Still, the number of developers having interacted with the Nanosoldier remains limited; benchmarking/performance seems to be a concern for a relatively small group.
Of course, the previous investigation has focused on Julia, while ideally we would like to be able to generalise these findings for all other bots in our dataset. Unfortunately, this was not possible since for many projects the number of data points was too low for the statistical analysis. Still we have enough points for two other programming languages, Rust and Apple Swift, and the Fluent UI framework.
We see that the hypotheses we have formulated based on the Julia case tend to hold for other projects (with one minor exception). At the same time, the effect size for other projects is usually smaller than for Julia.
However, it is unclear if the bot causes long, complex discussions, or if it is commonly brought into issues that simply are inherently more complex: this is a kind of chicken and egg problem. So we want back to Julia. To get further insights in the ways The Nanosoldier is used, we (a) analysed the pull requests manually performing card sorting, and (b) conducted semi-structured interviews with two active and experienced Julia developers who have had ample interaction with The Nanosoldier.
Based on card sorting the pull requests the process model on the slide has emerged. Arrows should be read as “follows by” rather than “causes”. The process model is not particularly surprising so I would like to highlight one element, namely presence of backporting requests. We do not think nanosoldier causes back porting requests. Rather, we speculate that when the PR is in a state where it looks promising, a backport is requested by a contributor. This argument is similar to why nanosoldier is invoked in the first place. The difference is that nanosoldier is invoked faster than a request to backport it. Thus, the backport request follows the benchmark invocation.
Then we moved to the interviews.
Interviewees confirmed our findings that PRs with a contribution from The Nanosoldier have a greater number of interactions, particularly because the report from the bot explicitly triggers discussion.
In a way, developers see The Nanosoldier as a kind of dishwasher, something they are so used to, that one has to turn it off to really appreciate it. One of the interviewees said: “You think, oh, most things will stay mostly stable, what could go wrong [when it’s offline]? And then you turn it back on, and there’s like ten bugs.”
Our study indicates that developers appreciate the peace of mind that using a performance benchmarking bot gives them.
Why only those??
However, if The Nanosoldier is as magnificent as the dishwasher why would developers not use it on every pull request? Only ca 7% of the pull requests involve The Nanosoldier…
Unfortunately, having the bot profile a pull request takes a lot of time. “Our benchmarks just take too long. If they were really fast, we would just run them on CI. But a combination of the fact that we’re using wall clock time and that we need to run it on a machine that’s
closer to the actual hardware and less susceptible to interrupts.”
Thus, invoking The Nanosoldier is reserved for pull requests that are judged to be potentially high-impact.
But at the same time, The Nanosoldier is beneficial for the project even if it is invoked only on a limited number of pull requests.
RQ1: Rare, GitHub actions are promising, dependabot4benchmarking?
RQ2: Significant positive impact, even if the use is limited due to speed