The document discusses biases in Microsoft Research (MSR) projects. It shows that from 2004-2020, most MSR projects focused on open source software like GitHub (20%), Eclipse (8%), and Apache (7%), with 49% on other open source software. However, from 2021 onward, there was less focus on open source software and more on non-software work (16%) and Stack Overflow (19%). It suggests MSR work could benefit from exploring a more diverse range of topics beyond open source software.
First of all, I would like to thank Shaowei and Bram for inviting me to talk about biases in MSR. Rather then talking about the topics we study as it is customary in scientific presentations, in this talk I would like to reflect on what we as the MSR research community usually do not study, what kind of stories we rarely hear.
There can be many different biases of course, but today I would like to focus on the 3P: Projects we select, People we talk to and Problems we study.
Let us start with discussing the projects we select.
The figure comes from the analysis performed by Flint et al. based on the publications for MSR 2004-2020. The green wedge “Closed source” corresponds to merely 8% of the papers. While different reports refer to 89% or even 97% of companies *using* OSS, TechBeacon reports ~35% of the code of an “average” application to be OSS. This suggests that closed source software has traditionally been understudied in MSR. However, the data in the study of Flint et al is historical, so maybe more recent studies are doing it better?
https://techbeacon.com/security/state-open-source-commercial-apps-youre-using-more-you-think#:~:text=Open%20source%20code%20comprised%20more,as%20high%20as%2075%20percent.
This is why I have checked the full papers of the 2021 edition of MSR. Closed source projects have still only been considered in 2 papers and in fact both of them used data from the same Dutch company called Adyen! GitHub is even more prominent than before, and additional data sources have been studied such as communication logs or security vulnerabilities.
The problem seems to be that for many of us it is more difficult to get access to closed source software as it requires establishing contacts with companies, while OSS is “just there”, and in particular “just there on GitHub”. Ultimately most company-based studies seem to report data from a limited number of usually large companies such as Microsoft, Google or Adyen.
However, at least in the Netherlands many software development companies are not that big and focussing on large companies leaves out lots of small and medium-size companies. Moreover, not all companies developing software will typically consider themselves as software companies: on the right we see a couple of examples of companies that develop software but are not considered typical software companies - a bank, a company producing lithography machines and a company building tracks.
Companies building software are not the same as software development companies.
By the same token persons developing software are not the same as software developers, and while MSR has traditionally focussed on the latter we have mostly overlooked the former. First of all, there are people developing software in a very different context, e.g., computational scientists or kids learning how to program. Moreover, one of the people I have interviewed indicated that they only had a bootcamp training and their university-trained peers made a point that only university-trained people deserve the title of an engineer.
Moreover, even when restricting our attention to software engineers, we tend to bias our results by conducting surveys and interviews solely in English. On the left we see the map of software developers population highlighting the importance of South America and Asia; these are developers we need to reach. On the right we see the map of the English language proficiency. Looking at the two maps calls for multi-lingual surveys, including for example, Spanish, Portuguese, Chinese and Japanese… The only example of a large-scale multi-lingual survey I am aware of is the Pandemic Programming work by Paul Ralph and his co-authors.
Moreover, what about developers not active on social media, or from countries we do not see (e.g., from countries under US sanctions that have no access to GitHub)?
And, of course, location or language are not the only demographic aspects we need to take into account: age, gender, disability, sexual orientation, socio-economic background, ethnicity influence how people experience software development and we should be more aware of these differences rather than assuming that the opinions of young Caucasian abled straight men from the middle class can be generalised to other developers. Deaggregation is, on the one hand, a must, but on the other hand threatens deanonymisation on small samples.
Finally, let us talk about the problems we study. Since software engineering is an applied discipline, we are supposed to propose solutions that can benefit practitioners and most of our studies refer to “implications for developers”.
However, several researchers have argued that the problems we study are “not real”, that “real” software is much more messy and complex and by studying weird or simplified problems we do not really help the industry.
Picture from Lionel’s Facebook.
At the same time, this makes me wonder: if our goal is to help the industry, why should we be doing it on the taxpayers’ money?