MOD2014-Mens-Lecture4

Evolving6So9ware6Ecosystems6
Marktoberdorf6Summer6School62014 
Lecture64
Tom6Mens6
So#ware(Engineering(Lab(
University(of(Mons
informa7que.umons.ac.be/genlog

Ecosystem(Measures
• The(characteris7cs(of(a(so#ware(ecosystem(can(
be(measured(in(diﬀerent(ways(
– Using(tradi7onal(so#ware(quality(metrics(
– Using(ecological(diversity(metrics(
– Using(econometrics
96

Ecosystem(Measures(
So#ware(Quality(Metrics
97

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
Ecosystem(Measures(
• So#ware(product((code)(metrics(
– size(metrics(
– e.g.(LOC,(NOM(
– complexity(metrics(
– e.g.(cycloma7c(complexity(
– coupling(and(cohesion(metrics(
– e.g.(LCOM,(CBO(
– dependency(metrics(
– e.g.(fan?in,(fan?out
98

Ecosystem(Measures(
• Side(remark(
• Distribu7on(of(most(of(these(metrics(is(highly(
skewed(
• Tradi7onal(aggrega7on(measures((mean,(median)(
are(only(reliable(for(centralised(distribu7ons(
• We(need(other(aggrega7on(measures(for(skewed(
distribu7ons
99
Mordal(et(al.(“So#ware(quality(metrics(aggrega7on(in(
industry”,(J./SoCware:/Evolu$on/and/Process/(2012)

Ecosystem(Measures(
Measuring(Diversity
Many(different(diversity(metrics:(
• species(richness$
• the(number(of(different(species(represented(in(an(ecological(
community(
• species(evenness$(entropy)$
• the(rela7ve(abundance(of(the(popula7on(of(each(species(in(the(
ecosystem(
• Shannon$diversity$index$(rela7ve(entropy)$
• how(specialised(is(a(given(species(in(rela7on(to(the(species(in(the(
other(level(
• Simpson$index$
• the(degree(of(concentra7on(when(individuals(are(classified(into(
species
100

Measuring(Diversity(
Evenness
• Quan7ﬁes(the(rela7ve(abundance(of(the(popula7on(
of(each(species(in(the(ecosystem(
• Maximum(evenness(if(all(species(are(equally(abundant((i.e.,(
have(same(number(of(individuals)(
• Low(evenness(if(some(species(dominate(the(others(
!
• Can(be(measured(using(Shannon’s(no7on(of(informa$on/
entropy
101

Based on Shannon’s notion of information entropy 
and 2nd law of thermodynamics!
!
!
!
where X = set of n distinct species xi!
p(xi) = proportion of all individuals that belong to species xi!
!
Quantiﬁes the uncertainty in predicting the species identity of an
individual that is taken at random from the dataset.!
102
Evenness
€
H(X) = − p(xi)ln p(xi)
i=1
n
∑
Claude6Shannon6
1916I2001

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering 103
Evenness
Dual(views(in(a(so#ware(ecosystem(
!
• Based(on(species(analogy(
✦ Contributors(are(species(that(thrive((
in(their(environment(of(projects(
✦ Projects(are(species(that(thrive(in(
their(environment(of(contributors(
(human(resources)
Bipar7te(
contributor?project(
graph
project(1
project(2
project(3

Two(dual(measures(of(entropy(
• Based(on(bipar7te(author((contributor)(?(
module((project)(graph(
M(=(set(of(n(dis7nct(modules(mi(
A(=(set(of(k(dis7nct(authors(aj(
Mi(=(#(commits(to(module(mi(
Aj(=(#(commits(by(author(aj(
aij(=((#(commits(to(module(mi(by(author(aj)(/(Aj(
mij(=((#commits(to(module(mi(by(author(aj)(/(Mi(
• Author(diversity(
!
!
• Module(diversity
104
Evenness
€
Ha j
= − aij lnaij
i=1
n
∑
Hmi
= − mij lnmij
j =1
k
∑
Posnet(et(al.(Dual/ecological/
measures/of/focus/in/soCware/
development.(ICSE(2013
module(1
module(2
module(3

Two(dual(measures(of(entropy(
• Based(on(bipar7te(author((contributor)(?(
module((project)(graph(
M(=(set(of(n(dis7nct(modules(mi(
A(=(set(of(k(dis7nct(authors(aj(
Mi(=(#(commits(to(module(mi(
Aj(=(#(commits(by(author(aj(
aij(=((#(commits(to(module(mi(by(author(aj)(/(Aj(
mij(=((#commits(to(module(mi(by(author(aj)(/(Mi(
• Author(diversity(
!
!
• Module(diversity
105
Evenness
€
Ha j
= − aij lnaij
i=1
n
∑
Hmi
= − mij lnmij
j =1
k
∑
Low(diversity(if(author(dominates(
most(commit(ac7vity
Low(diversity(if(module(
dominates(most(commit(ac7vity
Posnet(et(al.(Dual/ecological/
measures/of/focus/in/soCware/
development.(ICSE(2013

Shannon’s(diversity(index
Expresses(how(specialised(a(given(species(is(in(rela7on(to(the(
species(in(the(other(level(
Using(a(no7on(of(rela/ve$entropy(
Taking(into(account(the(contributor?project(duality
106
Projet 1
Projet 2
Projet 3
Thiruvalluvan Douglas Phillip
avro.genavro
avro.io.parsing
avro.io avro.generic avro.reflect
avro.specific
avro avro.file avro.tool
avro.util
avro.mapred.tether
avro.mapred
default
avro.idl
avro.ipc
avro.ipc.trace
avro.ipc.stats

Shannon’s(diversity(index(
Rela7ve(Entropy
Specialisa&on(of(a(species(rela7ve(to(the(species(in(the(other(level(
Takes(into(account(the(interac7on(between(authors(and(modules(as(
well(as(the(overall(amount(of(ac7vity(per(author(or(module.(
– Mi(and(Aj(deﬁned(as(before(
– mij(and(aij(deﬁned(as(before(
– C(=(total(#commits((
!
• Author((contributor)(specialisa7on(
!
!
• Module((project)(specialisa7on
107
Fa j
= − aij ln
aij
M'ii=1
n
∑
Fmi
= − mij ln
mij
A'jj=1
k
∑

• ANen&on6focus6=(normalisa7on(of(specialisa$on(by(the(
theore7cal(maximum(and(minimum(possible(values(((
!
!
!
• Findings(by(Posnet(et/al.(
– Project(leaders(and(top(contributors(tend(to(exhibit(lower(
aten7on(focus(than(others.(
– Narrowly(focused(developers(introduce(fewer(defects.(
– Increased(module(ac7vity(focus(results(in(a(greater(number(
of(defects.
Shannon’s(diversity(index(
Rela7ve(Entropy
108
Can(be(computed(with(R(package(‘bipartite’

Simpson(index
» Measures(the(degree(of(concentra7on(when(individuals(are(
classiﬁed(into(species(
• I.e.,(the(probability(that(two(individuals(taken(at(random 
from(the(dataset(belong(to(the(same(species(
• Is(minimal(when(all(species(are(equally(abundant(
• For(small(datasets:(
!
!
!
!
!
• For(large(datasets:
109
• R(=(number(of(species(types(
• N(=(number(of(en77es(in(the(dataset(
• ni(=(number(of(en77es(belonging(to(the(ith(species(type(

Ecosystem(Measures(
Econometrics
• Econometrics(are(measures(used(in(economy(
• Well?known(examples(
• Pareto(principle(
• Inequality(indices
110

Econometrics(
Pareto(Principle
Pareto(Principle
• A.k.a. 80–20 rule!
• Roughly 80% of the effects come from
20% of the causes.!
• Often coincides with power law distribution!
Examples!
• 80% of land owned by 20% of the population!
• 80% of sales come from 20% of clients!
• 80% of crashes come from 20% most
reported bugs
Pareto

Example in the GNOME
ecosystem!
• 20% of all contributors
account for about 80% of
the total workload in
GNOME code repository
112
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Cumulative percentage of contributors
Cumulativepercentageofworkload
Econometrics(
Pareto(Principle

• Example for individual GNOME
projects!
• Brasero!
• Evince!
• Analysing different data sources!
• Commits in a version
control repository!
• Mails in a mailing lists!
• Issue reports in a bug
tracker!
• Pareto principle is confirmed in
all case
113
Econometrics(
Pareto(Principle
Evidence for the Pareto principle
in Open Source Software Activity
Mathieu Goeminne and Tom Mens
Institut d’Informatique, Faculté des Sciences
Université de Mons – UMONS
Mons, Belgium
{ mathieu.goeminne | tom.mens }@umons.ac.be
Abstract—Numerous empirical studies analyse evolving open
source software (OSS) projects, and try to estimate the activity
and effort in these projects. Most of these studies, however, only
focus on a limited set of artefacts, being source code and defect
data. In our research, we extend the analysis by also taking into
account mailing list information. The main goal of this article
is to find evidence for the Pareto principle in this context, by
studying how the activity of developers and users involved in
OSS projects is distributed: it appears that most of the activity
is carried out by a small group of people. Following the GQM
paradigm, we provide evidence for this principle. We selected
a range of metrics used in economy to measure inequality in
distribution of wealth, and adapted these metrics to assess how
OSS project activity is distributed. Regardless of whether we
analyse version repositories, bug trackers, or mailing lists, and
for all three projects we studied, it turns out that the distribution
of activity is highly imbalanced.
Index Terms—software evolution, activity, software project,
data mining, empirical study, open source software, GQM, Pareto
I. INTRODUCTION
Numerous empirical studies aim to understand and model
how open source software (OSS) evolves over time [1]. In
order to gain a deeper understanding of this evolution, it
is essential to study not only the software artefacts that
evolve (e.g. source code, bug reports, and so on), but also
their interplay with the different project members (mainly
developers and users) that communicate (e.g., via mailing lists)
and collaborate in order to construct and evolve the software.
In this article, we wish to understand how activity is spread
over the different members of an OSS project, and how this
activity distribution evolves over time. Our hypothesis is that
the distribution of activity follows the Pareto principle, in the
sense that there is a small group of key persons that carry
out most of the activity, regardless of the type of considered
activity. To verify this hypothesis, we carry out an empirical
study based on the GQM paradigm [2]. We rely on concepts
borrowed from econometrics (the use of measurement in
economy), and apply them to the field of OSS evolution.
In particular, we apply indices that have been introduced
for measuring distribution (and inequality) of wealth, and
use them to measure the distribution of activity in software
development.
The remainder of this paper is structured as follows. Sec-
tion II explains the methodology we followed and defines
the metrics that we rely upon. Section III presents the ex-
perimental setup of our empirical study that we have carried
out. Section IV presents the results of our analysis of activity
distribution in three OSS projects. Section V discusses the
evidence we found for the Pareto principle. Section VI presents
related work, and Section VII concludes.
II. METHODOLOGY
A. GQM paradigm
To gain a deeper understanding of how OSS projects evolve,
we follow the well-known Goal-Question-Metric (GQM)
paradigm. Our main research Goal is to understand how ac-
tivity is distributed over the different stakeholders (developers
and users) involved in OSS projects. Once we have gained
deeper insight in this issue, we will be able to exploit it to
provide dedicated tool support to the OSS community, e.g.,
by helping newcomers to understand how the community is
structured, by improving the way in which the community
members communicate and collaborate, by trying to reduce
the potential risk of the so-called bus factor1
, and so on.
To reach the aforementioned research goal, we raise the
following research Questions:
1) Is there a core group of OSS project members (develop-
ers and/or users) that are significantly more active than
the other members?
2) How does the distribution of activity within an OSS
community evolve over time?
3) Is there an overlap between the different types of activity
(e.g., committing, mailing, submitting and changing bug
reports) the community members contribute to?
4) How does the distribution of activity vary across differ-
ent OSS projects?
As a third step, we need to select appropriate Metrics that
will enable us to provide a satisfactory answer to each of the
above research questions. For our empirical study, we will
make use of basic metrics to compute the activity of OSS
project members, and aggregate metrics that allow us to com-
pare these basic metric values across members (to understand
how activity is distributed), over time (to understand how they
1The bus factor refers to the total number of key persons (involved in the
project) that would, if they were to be hit by a bus, lead the project into
serious problems
SQM(2011

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Econometrics(
Pareto(Principle
114
Brasero
Evince
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
commits
mails
bug report changes

Econometrics(
Lorenz(curve
•A(graphical(representa7on(for(a(
cumula7ve(distribu7on(of(values(
• Example(for(income/wealth(distribu7on(
• A(point((x,y)(on(the(graph(indicates(that(the(
poorest(x%(of(persons(have(a(total(cumula7ve(
income(of(y%.(
• (Example(for(ecology/biodiversity(
• cumula7ve(propor7on(of(species(is(ploted(
against(cumula7ve(propor7on(of(individuals.(
!
• Can(be(used(to(check(Pareto(principle
115

Econometrics(
Inequality(Indices
•(are(used(to(measure(the(amount(of(inequality(in(
a(sta7s7cal(distribu7on(
– Examples:(Gini,(Theil,(Hoover,(Kolm,(Atkinson,(…(
•Values(typically(range(between(0(and(1(
•0(=(perfect(equality(
•1(=(maximal(inequality(
!
•Are(useful(for(skewed(distribu7ons,(where(use(of(
mean(and(median(as(aggrega7on(measure(is(not(
very(meaningful(
•Are(all(correlated,(in(prac7ce(…
116

Econometrics(
Inequality(Indices
•(Examples((and(deﬁni7ons)(
!
!
!
!
!
•Inequality(indices(have(been(used(in(
empirical(so#ware(engineering(to(study(
the(evolu7on(of(so#ware(metrics
117
Gini/
Theil
Atkinson/
Hoover/
Kolm

• Gini(coeﬃcient(measures(the(
inequality(among(values(of(a(
frequency(distribu7on(
• 0(=(perfect(equality(
• 1?1/n(=(maximal(inequality(
• Is(computed(based(on(the(
areas(above(and(below(the(
Lorenz(curve:(
Gini(=(A(/((A+B)
Inequality(Indices(
Gini(coeﬃcient

Inequality(Indices(
Gini(coefficient
119
ution profiles similar to the ones we observed
fortunately, the number of freely-available,
ystems developed in C# framework that met
criteria is rather limited. So, we began our
tems that were originally written in Java and
ed to the .NET platform in order to take ad-
the knowledge gained in the analysis of their
a counterparts.
ET metrics extraction, we used CLI [18], an
der library that provides access to both the
byte code. We added a small wrapper for the
f the Gini coefficients and stored the resulting
file for further processing with JSeat.
ed metrics data from four .NET systems:
NHibernate, SharpDevelop, and NAnt. The
ur 10 measures produced Gini coefficients
he ones determined for Java systems. How-
re also exceptions. We observed a shift ex-
i.e., individual Gini coefficients doubled in
most all measures in NAnt version 0.8.3-rc1.
fficients stayed high until version 0.84-rc1,
sumed “normal” values again. An inspection
per logs provided an explanation: in version
NAntContrib project was integrated into the
tion. This project defines a number of utili-
trics exhibit very uneven distribution profiles
changes do happen and may result in significant fluctua-
tions in Gini coefficients that warrant a deeper analysis (see
Figure 4 showing selected Gini profiles for 51 consecutive
releases of the Spring framework). But why do we see such
a remarkable stability of Gini coefficients?
Figure 4. Selected Gini profiles in Spring.
Developers accumulate system competence over time.
Proven techniques to solve a given problem prevail, where
untested or weak practices have little chance of survival.
If a team has historically built software in a certain way,
then it will continue to prefer a certain approach over oth-
ers. Moreover, we can expect that most problems in a given
domain are similar, hence the means taken to tackle them
would be similar, too. Tversky and Kahneman coined the
Vasa(et(al.(Compara$ve/analysis/of/
evolving/soCware/systems/using/
the/Gini/coefficient.(ICSM(2009

Inequality(Indices(
Gini(coefficient
120
ution profiles similar to the ones we observed
fortunately, the number of freely-available,
ystems developed in C# framework that met
criteria is rather limited. So, we began our
tems that were originally written in Java and
ed to the .NET platform in order to take ad-
the knowledge gained in the analysis of their
a counterparts.
ET metrics extraction, we used CLI [18], an
der library that provides access to both the
byte code. We added a small wrapper for the
f the Gini coefficients and stored the resulting
file for further processing with JSeat.
ed metrics data from four .NET systems:
NHibernate, SharpDevelop, and NAnt. The
ur 10 measures produced Gini coefficients
he ones determined for Java systems. How-
re also exceptions. We observed a shift ex-
i.e., individual Gini coefficients doubled in
most all measures in NAnt version 0.8.3-rc1.
fficients stayed high until version 0.84-rc1,
sumed “normal” values again. An inspection
per logs provided an explanation: in version
NAntContrib project was integrated into the
tion. This project defines a number of utili-
trics exhibit very uneven distribution profiles
changes do happen and may result in significant fluctua-
tions in Gini coefficients that warrant a deeper analysis (see
Figure 4 showing selected Gini profiles for 51 consecutive
releases of the Spring framework). But why do we see such
a remarkable stability of Gini coefficients?
Figure 4. Selected Gini profiles in Spring.
Developers accumulate system competence over time.
Proven techniques to solve a given problem prevail, where
untested or weak practices have little chance of survival.
If a team has historically built software in a certain way,
then it will continue to prefer a certain approach over oth-
ers. Moreover, we can expect that most problems in a given
domain are similar, hence the means taken to tackle them
would be similar, too. Tversky and Kahneman coined the
Vasa(et(al.(Compara$ve/analysis/of/
evolving/soCware/systems/using/
the/Gini/coefficient.(ICSM(2009
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!#*"
!#+"
!#,"
$"
-./0!)" 1230!*" -./0!*" 1230!+" -./0!+" 1230!," -./0!," 1230$!" -./0$!"
.4556/7"
58697"
:;"0".<8=>?7"
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!#*"
!#+"
!#,"
$"
-./0,," -./0!!" -./0!$" -./0!%" -./0!&" -./0!'" -./0!(" -./0!)" -./0!*" -./0!+" -./0!," -./0$!"
1233456"
37486"
9:;"/<.2/5"1=7>;<6"
Gnome/Brasero Gnome/Evince
Goeminne et al. Evidence for the Pareto principle in open source software activity. SQM 2011

Inequality(Indices(
Theil(index
•Is(deﬁned(as(
!
!
and(gives(a(value(between(0(and(ln/N/
•Corresponds(to(the(no7on(of(redundancy(in(informa7on(
theory(
!
•Normalised6Theil6index6is(obtained(by(dividing(by(ln/N 
and(gives(values(between(0(and(1(
•0(=(equal(distribu7on(
•1(=(unequal(distribu7on(
!
121

Inequality(Indices(
Theil(index
Commits(sent(
E?mails(sent(
Bug(reports(modified
Evince
122
Evidence for the Pareto principle
in Open Source Software Activity
Mathieu Goeminne and Tom Mens
Institut d’Informatique, Faculté des Sciences
Université de Mons – UMONS
Mons, Belgium
{ mathieu.goeminne | tom.mens }@umons.ac.be
Abstract—Numerous empirical studies analyse evolving open
source software (OSS) projects, and try to estimate the activity
and effort in these projects. Most of these studies, however, only
focus on a limited set of artefacts, being source code and defect
data. In our research, we extend the analysis by also taking into
account mailing list information. The main goal of this article
is to find evidence for the Pareto principle in this context, by
studying how the activity of developers and users involved in
OSS projects is distributed: it appears that most of the activity
is carried out by a small group of people. Following the GQM
paradigm, we provide evidence for this principle. We selected
a range of metrics used in economy to measure inequality in
distribution of wealth, and adapted these metrics to assess how
OSS project activity is distributed. Regardless of whether we
analyse version repositories, bug trackers, or mailing lists, and
for all three projects we studied, it turns out that the distribution
of activity is highly imbalanced.
Index Terms—software evolution, activity, software project,
data mining, empirical study, open source software, GQM, Pareto
I. INTRODUCTION
Numerous empirical studies aim to understand and model
how open source software (OSS) evolves over time [1]. In
order to gain a deeper understanding of this evolution, it
is essential to study not only the software artefacts that
evolve (e.g. source code, bug reports, and so on), but also
their interplay with the different project members (mainly
developers and users) that communicate (e.g., via mailing lists)
and collaborate in order to construct and evolve the software.
In this article, we wish to understand how activity is spread
over the different members of an OSS project, and how this
activity distribution evolves over time. Our hypothesis is that
the distribution of activity follows the Pareto principle, in the
sense that there is a small group of key persons that carry
out most of the activity, regardless of the type of considered
activity. To verify this hypothesis, we carry out an empirical
study based on the GQM paradigm [2]. We rely on concepts
borrowed from econometrics (the use of measurement in
economy), and apply them to the field of OSS evolution.
In particular, we apply indices that have been introduced
for measuring distribution (and inequality) of wealth, and
use them to measure the distribution of activity in software
development.
The remainder of this paper is structured as follows. Sec-
tion II explains the methodology we followed and defines
the metrics that we rely upon. Section III presents the ex-
perimental setup of our empirical study that we have carried
out. Section IV presents the results of our analysis of activity
distribution in three OSS projects. Section V discusses the
evidence we found for the Pareto principle. Section VI presents
related work, and Section VII concludes.
II. METHODOLOGY
A. GQM paradigm
To gain a deeper understanding of how OSS projects evolve,
we follow the well-known Goal-Question-Metric (GQM)
paradigm. Our main research Goal is to understand how ac-
tivity is distributed over the different stakeholders (developers
and users) involved in OSS projects. Once we have gained
deeper insight in this issue, we will be able to exploit it to
provide dedicated tool support to the OSS community, e.g.,
by helping newcomers to understand how the community is
structured, by improving the way in which the community
members communicate and collaborate, by trying to reduce
the potential risk of the so-called bus factor1
, and so on.
To reach the aforementioned research goal, we raise the
following research Questions:
1) Is there a core group of OSS project members (develop-
ers and/or users) that are significantly more active than
the other members?
2) How does the distribution of activity within an OSS
community evolve over time?
3) Is there an overlap between the different types of activity
(e.g., committing, mailing, submitting and changing bug
reports) the community members contribute to?
4) How does the distribution of activity vary across differ-
ent OSS projects?
As a third step, we need to select appropriate Metrics that
will enable us to provide a satisfactory answer to each of the
above research questions. For our empirical study, we will
make use of basic metrics to compute the activity of OSS
project members, and aggregate metrics that allow us to com-
pare these basic metric values across members (to understand
how activity is distributed), over time (to understand how they
1The bus factor refers to the total number of key persons (involved in the
project) that would, if they were to be hit by a bus, lead the project into
serious problems
Brasero
Evolu7on(of(Theil(index(for(2(GNOME(projects
SQM2011

Econometrics(
Inequality(Indices
Example: Comparison of (evolution of) inequality indices
for Evince
123
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Apr-99 Dec-99 Aug-00 Apr-01 Dec-01 Aug-02 Apr-03 Dec-03 Aug-04 Apr-05 Dec-05 Aug-06 Apr-07 Dec-07 Aug-08
Gini
Hoover
Theil (normalised)

So#ware(Ecosystems
Case(Study:(GNOME
Vasilescu(et(al.(On/the/varia$on/and/
specialisa$on/of/workload:/A/case/study/
of/the/GNOME/ecosystem/community.(
Emp.(So#w.(Eng.(2014

Overall(goal(revisited
Improve(support((tools/guidelines/models/…)(for(
dealing(with(changes(in(open(source(soCware(
ecosystems/
–Improve(chance(of(survival(of(a(project(within(its(
ecosystem(
–Improve(resilience(of(an(ecosystem(as(a(whole(
–Allow(to(make(changes(more(eﬀec7vely(
e.g.(higher(produc7vity,(faster(reac7on(to/
implementa7on(of(change/bug(requests)(
–Increase((accuracy(of(eﬀort/cost(es7ma7on(
models,(defect(predic7on(models(and(so(on
125

Case(Study:(GNOME
Observa$on:(exis7ng(generic(support(does(not(take(the(
specifici7es(of(the(ecosystem(into(account,(making(the(
support(subop7mal.(
!
Assump$on:(specialised(ecosystem?specific(change(
support(will(be(more(effec7ve(
!
Consequence:(We(need(to(understand(the(socio?technical(
specifici7es(of(the(ecosystem(under(study((in(order(to(
provide(more(effec7ve(change(support.(
!
This(is(what(we(will(do(for(the(GNOME(ecosystem.
126

Case(Study:(GNOME 
Some(references
127
To appear in 2013 in Springer’s Empirical Software Engineering journal – manuscri
(will be inserted by the editor)
On the variation and specialisation of workload – A
case study of the Gnome ecosystem community
Bogdan Vasilescu · Alexander Serebrenik ·
Mathieu Goeminne · Tom Mens
DOI: 10.1007/s10664-013-9244-1
Abstract Most empirical studies of open source software repositories focus on the
analysis of isolated projects, or restrict themselves to the study of the relation-
ships between technical artifacts. In contrast, we have carried out a case study that
focuses on the actual contributors to software ecosystems, being collections of soft-
ware projects that are maintained by the same community. To this aim, we defined
a new series of workload and involvement metrics, as well as a novel approach—
eT-graphs—for reporting the results of comparing multiple distributions. We used
these techniques to statistically study how workload and involvement of ecosys-
tem contributors varies across projects and across activity types, and we explored
to which extent projects and contributors specialise in particular activity types.
Using Gnome as a case study we observed that, next to coding, the activities of lo-
calization, development documentation and building are prevalent throughout the
ecosystem. We also observed notable di↵erences between frequent and occasional
contributors in terms of the activity types they are involved in and the number
of projects they contribute to. Occasional contributors and contributors that are
involved in many di↵erent projects tend to be more involved in the localization ac-
tivity, while frequent contributors tend to be more involved in the coding activity
in a limited number of projects.
Keywords open source · software ecosystem · metrics · developer community ·
case study
B. Vasilescu and A. Serebrenik
MDSE, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, The Nether-
UMONS
Faculté des Sciences
Département d’Informatique
Understanding the Evolution of
Socio-technical Aspects in Open Source
Ecosystems: An Empirical Analysis of
GNOME
Mathieu Goeminne
A dissertation submitted in fulfillment of the requirements of
the degree of Docteur en Sciences
Advisor Jury
Dr. TOM MENS Dr. XAVIER BLANC
Université de Mons, Belgium Université de Bordeaux 1, France
Dr. VÉRONIQUE BRUYÈRE
Université de Mons, Belgium
Dr. JESUS M. GONZALEZ-BARAHONA
Universidad Rey Juan Carlos, Spain
Dr. TOM MENS
Dr. ALEXANDER SEREBRENIK
Technische Universiteit Eindhoven, The Netherlands
Dr. JEF WIJSEN
June 2013
A historical dataset for GNOME contributors
Mathieu Goeminne, Maëlick Claes and Tom Mens
Software Engineering Lab, COMPLEXYS research institute, UMONS, Belgium
Abstract—We present a dataset of the open source
software ecosystem GNOME from a social point of view.
We have collected historical data about the contributors
to all GNOME projects stored on git.gnome.org, taking
into account the problem of identity matching, and as-
sociating different activity types to the contributors. This
type of information is very useful to complement the
traditional, source-code related information one can ob-
tain by mining and analyzing the actual source code.
The dataset can be obtained at https://bitbucket.org/
mgoeminne/sgl-flossmetric-dbmerge.
I. INTRODUCTION
In this paper, we present the process we have used
to create a dataset containing the historical information
related to contributors to the GNOME ecosystem. Our
database and the tools and scripts used to created it can
be found on a dedicated Bitbucket repository2
.
In contrast to many other datasets, we do not focus on
source code, since a significant amount of files commit-
ted to GNOME’s project repositories do not even contain
code (e.g., image files, web pages, documentation, lo-
calization and many more). Such type of information is
often ignored in MSR research while it is very relevant
to understand which types of activities contributors are
@(MSR(2013

Case(Study:(GNOME(
Characteris7cs
Open(source(desktop(environment(for(Linux(
• >(16(years(of(ac7vity((1997(—>(…()(
• Projects((Git(repositories(stored(at(htp://git.gnome.org)(
( >(1400(projects(
!
• Contributors(
( >(11000(contributor(accounts(
( a#er(iden7ty(merging,(>(5800(contributors(
( a#er(filtering(code(ac7vity,(>(4300(coders(
!
• Commits(and(file(touches(
( >(1.3M(commits((of(which(>(0.6M(code(commits)(
( >(12M(file(touches((of(which(>(6M(of(codefile(touches)
128

Case(Study:(GNOME(
Characteris7cs
129
Gnome
Use case

C
Java
Objective C
Python
Lisp
JS
ASP.Net
C/C++ Header
C++
Perl
yacc
C#
IDL
Haskell
Objective C++
lexAssembly
Visual Basic
PHP
Ruby
Tcl/Tk
1e+05
1e+07
100 1000 10000
Files
LOC
Case(Study:(GNOME(
Programming(language
130
Rela7on(between(programming(language(used(and(code(size
Mainly6C/C++ 
and6Python

Case(Study:(GNOME(
Characteris7cs
131
Dataset(shared(on 
htps://bitbucket.org/mgoeminne/sgl?ﬂossmetric?dbmerge/downloads(
FLOSSMetrics(compliant(
MySQL(database
Goeminne(et/al./“A(historical(dataset(
for(GNOME(contributors”,(MSR(2013

Case/Study:/GNOME/
Characteris$cs
132
Bipar7te(contributor?project(graph
project(1
project(2
project(3
!
>(5800(contributors(
(>(4300(coders)(
>(1400(projects

Case(Study:(GNOME(
Workload(Distribu7on
How(is(workload(distributed(over(diﬀerent 
authors(and(projects?
133

Case(Study:(GNOME(
How(is(workload(distributed(over(diﬀerent(
authors(and(projects(per6ac&vity6type?
134
Image
Code
Documentation
Traduction

Case(Study:(GNOME(
Two(dual(views((cf.(bipar7te(contributor?project(graph)(
?(Distribu7on(of(workload(over 
(((different(projects(per(ac7vity(type(
?(Distribu7on(of(workload(over 
(((different(authors(per(ac7vity(type?(
135
How(is(workload(distributed(over(different(
authors(and(projects(per6ac&vity6type?

Case(Study:(GNOME(
?(Extract(file(informa7on(for(each(commit(in(the(git(
repository(of(each(GNOME(project(
?(Associate(a(unique(ac7vity(type(t(to(each(file(
?(Count(the(number(of(file(touches
136
Based on [Robles2006]
/foo/bar.c
Fichiers Règles
...
...
.*.c -> CODE
CODE
Activité

Case(Study:(GNOME(
?(Extract(file(informa7on(for(each(commit(in(the(git(
repository(of(each(GNOME(project(
?(Associate(a(unique(ac7vity(type(t(to(each(file(
?(Count(the(number(of(file(touches(
!
Basic(workload(metric:(
APTW(a,p,t)(=(number(of(file(touches(of(an(author(a(
for(a(given(project(p(and(ac7vity(type(t/
!
Derived(metrics:(sum(and(Gini(coefficient
137

Case(Study:(GNOME(
Workload(Metrics
Comment(le(travail(varie?t?il(d’un(projet(de(
GNOME(à(un(autre?(
Comment(le(travail(varie?t?il(d’un(
contributeur(de(GNOME(à(un(autre?(
Mesure(de(l’ac7vité(u7lisée:(le(nombre(de(
modifica7ons(effectuées(sur(les(fichiers.
138

Case(Study:(GNOME(
Workload(Metrics
Main(ﬁndings(
!
Workload(is(log?
normally(
distributed(over(
GNOME(projects
139

Case(Study:(GNOME(
Workload(Metrics
Main(ﬁndings(
!
The(majority(of(
GNOME(authors(
are(involved(in(a(
very(low(number(
of(ﬁle(touches.
140
28
log(AW)
Numberofauthors
0 2 4 6 8 10 12
0100200300400500600
50%
< 14
changes
185,874
changes
frequent6
authors
occasional6
authors

Case(Study:(GNOME(
Workload(Metrics
Main(ﬁndings
141
Highest workload is
represented by coding activity,
followed by activities of
development documentation,
translation/internationalisation,
and build ﬁle creation.
TW(t)

Case(Study:(GNOME(
Rela7ve(importance(of(ac7vity(types
What(are(the(favourite(ac7vity(types(for(GNOME?(
!
Two(dual(views(
?(Rela7ve(importance(of 
(((each(ac7vity(type(per/author(
?(Rela7ve(importance(of(
(((each(ac7vity(type(per/project(
142

What(are(the(favourite(ac7vity(types(for(GNOME?(
!
Approach(
•Use(sta7s7cal(tests(to 
compare(distribu7ons(
•Verify(if(a(data(set(corresponding 
to(an(ac7vity(type(tends(to(have 
higher(values(than(a(data(set 
corresponding(to 
another(ac7vity(type
Case(Study:(GNOME(
143

Case(Study:(GNOME(
Examples of statistical comparison tests!
•(Wilcoxon?)Mann–Whitney(U(test(
•Kruskal?Wallis(test((
!
Problems(with(tradi7onal(sta7s7cal(tests:(
• Not robust to populations of unequal sizes!
• Different tests can be inconsistent with each other!
• Pairwise comparison of all activity types requires 78
different combinations (12 * 13 / 2)!
•Traditional tests are not transitive
144

Case(Study:(GNOME(
Solu7on:((
•Use(a(single(test(that(respects(transi7vity(
•T(procedure([Konietschke(et(al(2012]
145
~

Case(Study:(GNOME(
T(procedure
146
6-3-2013
Pair Low High
B-A -0.56 -0.44
C-A -0.50 -0.31
D-A -0.32 -0.03
C-B -0.01 0.24
D-B 0.24 0.47
D-C 0.09 0.40
A→B
A→C
A→D
D→B
D→C
~

Case(Study:(GNOME(
147
by author

Case(Study:(GNOME(
148
by author by project

Case(Study:(GNOME(
149
GNOME projects
and authors are
code-centric

Case(Study:(GNOME(
150
!
!
!
GNOME projects
and authors are
mainly involved in
4 activity types
!
!
!
!

Case(Study:(GNOME(
Heterogeneous(communi7es
Does(the(rela7ve(importance(of(ac7vity(types(diﬀer(
between(frequent/and/occasional/authors?(
!
Idea(
Equally(split(the(authors(in(two(bins(of 
more(or(less(equal(size,(based(on 
the(author(workload:(
• about(50%(of(all(authors(were 
involved(in(<14(ﬁle(touches
151
28
log(AW)
Numberofauthors
0 2 4 6 8 10 12
0100200300400500600
50%
< 14
changes

Case(Study:(GNOME(
152
Occasional authors

Case(Study:(GNOME(
153
Occasional authors Frequent authors

Case(Study:(GNOME(
154
Occasional authors Frequent authors
Frequent authors
are mostly coders,
occasional authors
are mostly
translators.

Case(Study:(GNOME(
Observa7ons(
• Coders have a higher workload and
are involved in less projects!
• Translators are less active but are
involved in more projects
155
Can(be(explained(in(part(by(the(use(of(Damned/Lies,(a(Web(applica7on(used(to(manage(the(
localisa7on((l10n)(ac7vi7es(of(the(GNOME(project

Case(Study:(GNOME(
156
6-3-2013
Sylvia Neu et al. “Telling stories
about GNOME with Complicity”,
VISSOFT 2011
Aﬀec7onal(bond(view:(
- size(of(rectangle(=(author’s(life7me(in(days(
- color(=(number(of(projects(
Complicity(is(a(web?based(applica7on(
suppor7ng(so#ware(ecosystem(analysis(by(
means(of(interac7ve(visualiza7ons.

Case(Study:(GNOME(
157
6-3-2013
Unveriﬁed(assump7ons:(
!
1.(Authors(contribu7ng(a(lot(to(few(projects(
are(likely(to(be(developers((D)(
2.(Authors(contribu7ng(less(o#en(to(more(
projects(are(likely(to(be(translators((T)(
3.(Authors(tend(to(have(an(aﬀec7onal(bond(
to(either(development(or(transla7on(work

Case(Study:(GNOME(
158
6-3-2013
Our work confirms
these assumptions
Potential
misclassifications
in Neu et al.

How strongly do authors focus 
on specific activities?
Case(Study:(GNOME(
Rela7ve(Workload
159
Basic measures:
• RATW(a,t) 
= % of the total workload of author a 
dedicated to activity type t
!
• RAWS(a) = author specialisation 
= Gini index of of inequality of RATW(a,t) 
aggregated over all activity types

How strongly do authors focus?
Case(Study:(GNOME(
Rela7ve(Workload
160
1606-3-2013
max Gini for
n = 14: 0.9285

Case(Study:(GNOME(
Rela7ve(Workload
161
1616-3-2013
Occasional authors tend to focus
on a single activity type.

Case(Study:(GNOME(
Rela7ve(Workload
162
1626-3-2013 1626-3-2013
Frequent authors tend to focus
on few activity types.

Case(Study:(GNOME(
Main observations for GNOME ecosystem:
• Workload is unevenly distributed over projects and
authors
• Clear distinction between frequent and occasional
authors
• Authors form heterogeneous subcommunities (coding
versus translation)
• GNOME is code-centric, i.e., most of the workload is in
code-related activities (coding, build files, development
documentation)
163

Case(Study:(GNOME(
Next(steps
Observa$on:(exis7ng(generic(support(does(not(take(the(
specifici7es(of(the(ecosystem(into(account,(making(the(support(
subop7mal.(
!
Having(gained(beter(understanding(of(the(GNOME(ecosystem(
specifici7es,(we(hope(to(come(up(with(beter(change(support(
mechanisms(
!
Dedicated(to(specific(sub(communi7es(
e.g.(Damned(Lies(applica7on(for(transla7on(community(
Es7ma7on((of(cost(or(effort)(and(predic7on(models((e.g.(of(
defects)(could(be(improved(
Tools(should(be(able(to(focus(on(those(ac7vi7es/projects(a(
contributor(is(interested(in((based(on(his(historic(ac7vity(profile)
164

MOD2014-Mens-Lecture4

Recommended

Recommended

More Related Content

Similar to MOD2014-Mens-Lecture4

Similar to MOD2014-Mens-Lecture4 (20)

More from Tom Mens

More from Tom Mens (20)

Recently uploaded

Recently uploaded (20)

MOD2014-Mens-Lecture4