This is my fourth and final lecture in a series of 4 lectures on the topic of Evolving Software Ecosystems, presented during the NATO Marktoberdorf 2014 Summer School on Dependable Software System Engineering in Germany, August 2014.
9. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Based on Shannon’s notion of information entropy
and 2nd law of thermodynamics!
!
!
!
where X = set of n distinct species xi!
p(xi) = proportion of all individuals that belong to species xi!
!
Quantifies the uncertainty in predicting the species identity of an
individual that is taken at random from the dataset.!
102
Measuring(Diversity(
Evenness
€
H(X) = − p(xi)ln p(xi)
i=1
n
∑
Claude6Shannon6
1916I2001
20. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
• Example for individual GNOME
projects!
• Brasero!
• Evince!
• Analysing different data sources!
• Commits in a version
control repository!
• Mails in a mailing lists!
• Issue reports in a bug
tracker!
• Pareto principle is confirmed in
all case
113
Econometrics(
Pareto(Principle
Evidence for the Pareto principle
in Open Source Software Activity
Mathieu Goeminne and Tom Mens
Institut d’Informatique, Facult´e des Sciences
Universit´e de Mons – UMONS
Mons, Belgium
{ mathieu.goeminne | tom.mens }@umons.ac.be
Abstract—Numerous empirical studies analyse evolving open
source software (OSS) projects, and try to estimate the activity
and effort in these projects. Most of these studies, however, only
focus on a limited set of artefacts, being source code and defect
data. In our research, we extend the analysis by also taking into
account mailing list information. The main goal of this article
is to find evidence for the Pareto principle in this context, by
studying how the activity of developers and users involved in
OSS projects is distributed: it appears that most of the activity
is carried out by a small group of people. Following the GQM
paradigm, we provide evidence for this principle. We selected
a range of metrics used in economy to measure inequality in
distribution of wealth, and adapted these metrics to assess how
OSS project activity is distributed. Regardless of whether we
analyse version repositories, bug trackers, or mailing lists, and
for all three projects we studied, it turns out that the distribution
of activity is highly imbalanced.
Index Terms—software evolution, activity, software project,
data mining, empirical study, open source software, GQM, Pareto
I. INTRODUCTION
Numerous empirical studies aim to understand and model
how open source software (OSS) evolves over time [1]. In
order to gain a deeper understanding of this evolution, it
is essential to study not only the software artefacts that
evolve (e.g. source code, bug reports, and so on), but also
their interplay with the different project members (mainly
developers and users) that communicate (e.g., via mailing lists)
and collaborate in order to construct and evolve the software.
In this article, we wish to understand how activity is spread
over the different members of an OSS project, and how this
activity distribution evolves over time. Our hypothesis is that
the distribution of activity follows the Pareto principle, in the
sense that there is a small group of key persons that carry
out most of the activity, regardless of the type of considered
activity. To verify this hypothesis, we carry out an empirical
study based on the GQM paradigm [2]. We rely on concepts
borrowed from econometrics (the use of measurement in
economy), and apply them to the field of OSS evolution.
In particular, we apply indices that have been introduced
for measuring distribution (and inequality) of wealth, and
use them to measure the distribution of activity in software
development.
The remainder of this paper is structured as follows. Sec-
tion II explains the methodology we followed and defines
the metrics that we rely upon. Section III presents the ex-
perimental setup of our empirical study that we have carried
out. Section IV presents the results of our analysis of activity
distribution in three OSS projects. Section V discusses the
evidence we found for the Pareto principle. Section VI presents
related work, and Section VII concludes.
II. METHODOLOGY
A. GQM paradigm
To gain a deeper understanding of how OSS projects evolve,
we follow the well-known Goal-Question-Metric (GQM)
paradigm. Our main research Goal is to understand how ac-
tivity is distributed over the different stakeholders (developers
and users) involved in OSS projects. Once we have gained
deeper insight in this issue, we will be able to exploit it to
provide dedicated tool support to the OSS community, e.g.,
by helping newcomers to understand how the community is
structured, by improving the way in which the community
members communicate and collaborate, by trying to reduce
the potential risk of the so-called bus factor1
, and so on.
To reach the aforementioned research goal, we raise the
following research Questions:
1) Is there a core group of OSS project members (develop-
ers and/or users) that are significantly more active than
the other members?
2) How does the distribution of activity within an OSS
community evolve over time?
3) Is there an overlap between the different types of activity
(e.g., committing, mailing, submitting and changing bug
reports) the community members contribute to?
4) How does the distribution of activity vary across differ-
ent OSS projects?
As a third step, we need to select appropriate Metrics that
will enable us to provide a satisfactory answer to each of the
above research questions. For our empirical study, we will
make use of basic metrics to compute the activity of OSS
project members, and aggregate metrics that allow us to com-
pare these basic metric values across members (to understand
how activity is distributed), over time (to understand how they
1The bus factor refers to the total number of key persons (involved in the
project) that would, if they were to be hit by a bus, lead the project into
serious problems
SQM(2011
26. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Inequality(Indices(
Gini(coefficient
119
ution profiles similar to the ones we observed
fortunately, the number of freely-available,
ystems developed in C# framework that met
criteria is rather limited. So, we began our
tems that were originally written in Java and
ed to the .NET platform in order to take ad-
the knowledge gained in the analysis of their
a counterparts.
ET metrics extraction, we used CLI [18], an
der library that provides access to both the
byte code. We added a small wrapper for the
f the Gini coefficients and stored the resulting
file for further processing with JSeat.
ed metrics data from four .NET systems:
NHibernate, SharpDevelop, and NAnt. The
ur 10 measures produced Gini coefficients
he ones determined for Java systems. How-
re also exceptions. We observed a shift ex-
i.e., individual Gini coefficients doubled in
most all measures in NAnt version 0.8.3-rc1.
fficients stayed high until version 0.84-rc1,
sumed “normal” values again. An inspection
per logs provided an explanation: in version
NAntContrib project was integrated into the
tion. This project defines a number of utili-
trics exhibit very uneven distribution profiles
changes do happen and may result in significant fluctua-
tions in Gini coefficients that warrant a deeper analysis (see
Figure 4 showing selected Gini profiles for 51 consecutive
releases of the Spring framework). But why do we see such
a remarkable stability of Gini coefficients?
Figure 4. Selected Gini profiles in Spring.
Developers accumulate system competence over time.
Proven techniques to solve a given problem prevail, where
untested or weak practices have little chance of survival.
If a team has historically built software in a certain way,
then it will continue to prefer a certain approach over oth-
ers. Moreover, we can expect that most problems in a given
domain are similar, hence the means taken to tackle them
would be similar, too. Tversky and Kahneman coined the
Vasa(et(al.(Compara$ve/analysis/of/
evolving/soCware/systems/using/
the/Gini/coefficient.(ICSM(2009
27. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Inequality(Indices(
Gini(coefficient
120
ution profiles similar to the ones we observed
fortunately, the number of freely-available,
ystems developed in C# framework that met
criteria is rather limited. So, we began our
tems that were originally written in Java and
ed to the .NET platform in order to take ad-
the knowledge gained in the analysis of their
a counterparts.
ET metrics extraction, we used CLI [18], an
der library that provides access to both the
byte code. We added a small wrapper for the
f the Gini coefficients and stored the resulting
file for further processing with JSeat.
ed metrics data from four .NET systems:
NHibernate, SharpDevelop, and NAnt. The
ur 10 measures produced Gini coefficients
he ones determined for Java systems. How-
re also exceptions. We observed a shift ex-
i.e., individual Gini coefficients doubled in
most all measures in NAnt version 0.8.3-rc1.
fficients stayed high until version 0.84-rc1,
sumed “normal” values again. An inspection
per logs provided an explanation: in version
NAntContrib project was integrated into the
tion. This project defines a number of utili-
trics exhibit very uneven distribution profiles
changes do happen and may result in significant fluctua-
tions in Gini coefficients that warrant a deeper analysis (see
Figure 4 showing selected Gini profiles for 51 consecutive
releases of the Spring framework). But why do we see such
a remarkable stability of Gini coefficients?
Figure 4. Selected Gini profiles in Spring.
Developers accumulate system competence over time.
Proven techniques to solve a given problem prevail, where
untested or weak practices have little chance of survival.
If a team has historically built software in a certain way,
then it will continue to prefer a certain approach over oth-
ers. Moreover, we can expect that most problems in a given
domain are similar, hence the means taken to tackle them
would be similar, too. Tversky and Kahneman coined the
Vasa(et(al.(Compara$ve/analysis/of/
evolving/soCware/systems/using/
the/Gini/coefficient.(ICSM(2009
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!#*"
!#+"
!#,"
$"
-./0!)" 1230!*" -./0!*" 1230!+" -./0!+" 1230!," -./0!," 1230$!" -./0$!"
.4556/7"
58697"
:;"0".<8=>?7"
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!#*"
!#+"
!#,"
$"
-./0,," -./0!!" -./0!$" -./0!%" -./0!&" -./0!'" -./0!(" -./0!)" -./0!*" -./0!+" -./0!," -./0$!"
1233456"
37486"
9:;"/<.2/5"1=7>;<6"
Gnome/Brasero Gnome/Evince
Goeminne et al. Evidence for the Pareto principle in open source software activity. SQM 2011
29. Inequality(Indices(
Theil(index
Commits(sent(
E?mails(sent(
Bug(reports(modified
Evince
122
Evidence for the Pareto principle
in Open Source Software Activity
Mathieu Goeminne and Tom Mens
Institut d’Informatique, Facult´e des Sciences
Universit´e de Mons – UMONS
Mons, Belgium
{ mathieu.goeminne | tom.mens }@umons.ac.be
Abstract—Numerous empirical studies analyse evolving open
source software (OSS) projects, and try to estimate the activity
and effort in these projects. Most of these studies, however, only
focus on a limited set of artefacts, being source code and defect
data. In our research, we extend the analysis by also taking into
account mailing list information. The main goal of this article
is to find evidence for the Pareto principle in this context, by
studying how the activity of developers and users involved in
OSS projects is distributed: it appears that most of the activity
is carried out by a small group of people. Following the GQM
paradigm, we provide evidence for this principle. We selected
a range of metrics used in economy to measure inequality in
distribution of wealth, and adapted these metrics to assess how
OSS project activity is distributed. Regardless of whether we
analyse version repositories, bug trackers, or mailing lists, and
for all three projects we studied, it turns out that the distribution
of activity is highly imbalanced.
Index Terms—software evolution, activity, software project,
data mining, empirical study, open source software, GQM, Pareto
I. INTRODUCTION
Numerous empirical studies aim to understand and model
how open source software (OSS) evolves over time [1]. In
order to gain a deeper understanding of this evolution, it
is essential to study not only the software artefacts that
evolve (e.g. source code, bug reports, and so on), but also
their interplay with the different project members (mainly
developers and users) that communicate (e.g., via mailing lists)
and collaborate in order to construct and evolve the software.
In this article, we wish to understand how activity is spread
over the different members of an OSS project, and how this
activity distribution evolves over time. Our hypothesis is that
the distribution of activity follows the Pareto principle, in the
sense that there is a small group of key persons that carry
out most of the activity, regardless of the type of considered
activity. To verify this hypothesis, we carry out an empirical
study based on the GQM paradigm [2]. We rely on concepts
borrowed from econometrics (the use of measurement in
economy), and apply them to the field of OSS evolution.
In particular, we apply indices that have been introduced
for measuring distribution (and inequality) of wealth, and
use them to measure the distribution of activity in software
development.
The remainder of this paper is structured as follows. Sec-
tion II explains the methodology we followed and defines
the metrics that we rely upon. Section III presents the ex-
perimental setup of our empirical study that we have carried
out. Section IV presents the results of our analysis of activity
distribution in three OSS projects. Section V discusses the
evidence we found for the Pareto principle. Section VI presents
related work, and Section VII concludes.
II. METHODOLOGY
A. GQM paradigm
To gain a deeper understanding of how OSS projects evolve,
we follow the well-known Goal-Question-Metric (GQM)
paradigm. Our main research Goal is to understand how ac-
tivity is distributed over the different stakeholders (developers
and users) involved in OSS projects. Once we have gained
deeper insight in this issue, we will be able to exploit it to
provide dedicated tool support to the OSS community, e.g.,
by helping newcomers to understand how the community is
structured, by improving the way in which the community
members communicate and collaborate, by trying to reduce
the potential risk of the so-called bus factor1
, and so on.
To reach the aforementioned research goal, we raise the
following research Questions:
1) Is there a core group of OSS project members (develop-
ers and/or users) that are significantly more active than
the other members?
2) How does the distribution of activity within an OSS
community evolve over time?
3) Is there an overlap between the different types of activity
(e.g., committing, mailing, submitting and changing bug
reports) the community members contribute to?
4) How does the distribution of activity vary across differ-
ent OSS projects?
As a third step, we need to select appropriate Metrics that
will enable us to provide a satisfactory answer to each of the
above research questions. For our empirical study, we will
make use of basic metrics to compute the activity of OSS
project members, and aggregate metrics that allow us to com-
pare these basic metric values across members (to understand
how activity is distributed), over time (to understand how they
1The bus factor refers to the total number of key persons (involved in the
project) that would, if they were to be hit by a bus, lead the project into
serious problems
Brasero
Evolu7on(of(Theil(index(for(2(GNOME(projects
SQM2011
34. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Case(Study:(GNOME
Some(references
127
To appear in 2013 in Springer’s Empirical Software Engineering journal – manuscri
(will be inserted by the editor)
On the variation and specialisation of workload – A
case study of the Gnome ecosystem community
Bogdan Vasilescu · Alexander Serebrenik ·
Mathieu Goeminne · Tom Mens
DOI: 10.1007/s10664-013-9244-1
Abstract Most empirical studies of open source software repositories focus on the
analysis of isolated projects, or restrict themselves to the study of the relation-
ships between technical artifacts. In contrast, we have carried out a case study that
focuses on the actual contributors to software ecosystems, being collections of soft-
ware projects that are maintained by the same community. To this aim, we defined
a new series of workload and involvement metrics, as well as a novel approach—
eT-graphs—for reporting the results of comparing multiple distributions. We used
these techniques to statistically study how workload and involvement of ecosys-
tem contributors varies across projects and across activity types, and we explored
to which extent projects and contributors specialise in particular activity types.
Using Gnome as a case study we observed that, next to coding, the activities of lo-
calization, development documentation and building are prevalent throughout the
ecosystem. We also observed notable di↵erences between frequent and occasional
contributors in terms of the activity types they are involved in and the number
of projects they contribute to. Occasional contributors and contributors that are
involved in many di↵erent projects tend to be more involved in the localization ac-
tivity, while frequent contributors tend to be more involved in the coding activity
in a limited number of projects.
Keywords open source · software ecosystem · metrics · developer community ·
case study
B. Vasilescu and A. Serebrenik
MDSE, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, The Nether-
UMONS
Faculté des Sciences
Département d’Informatique
Understanding the Evolution of
Socio-technical Aspects in Open Source
Ecosystems: An Empirical Analysis of
GNOME
Mathieu Goeminne
A dissertation submitted in fulfillment of the requirements of
the degree of Docteur en Sciences
Advisor Jury
Dr. TOM MENS Dr. XAVIER BLANC
Université de Mons, Belgium Université de Bordeaux 1, France
Dr. VÉRONIQUE BRUYÈRE
Université de Mons, Belgium
Dr. JESUS M. GONZALEZ-BARAHONA
Universidad Rey Juan Carlos, Spain
Dr. TOM MENS
Université de Mons, Belgium
Dr. ALEXANDER SEREBRENIK
Technische Universiteit Eindhoven, The Netherlands
Dr. JEF WIJSEN
Université de Mons, Belgium
June 2013
A historical dataset for GNOME contributors
Mathieu Goeminne, Ma¨elick Claes and Tom Mens
Software Engineering Lab, COMPLEXYS research institute, UMONS, Belgium
Abstract—We present a dataset of the open source
software ecosystem GNOME from a social point of view.
We have collected historical data about the contributors
to all GNOME projects stored on git.gnome.org, taking
into account the problem of identity matching, and as-
sociating different activity types to the contributors. This
type of information is very useful to complement the
traditional, source-code related information one can ob-
tain by mining and analyzing the actual source code.
The dataset can be obtained at https://bitbucket.org/
mgoeminne/sgl-flossmetric-dbmerge.
I. INTRODUCTION
In this paper, we present the process we have used
to create a dataset containing the historical information
related to contributors to the GNOME ecosystem. Our
database and the tools and scripts used to created it can
be found on a dedicated Bitbucket repository2
.
In contrast to many other datasets, we do not focus on
source code, since a significant amount of files commit-
ted to GNOME’s project repositories do not even contain
code (e.g., image files, web pages, documentation, lo-
calization and many more). Such type of information is
often ignored in MSR research while it is very relevant
to understand which types of activities contributors are
@(MSR(2013