Your SlideShare is downloading. ×
0
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
MOD2014-Mens-Lecture4
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

MOD2014-Mens-Lecture4

218

Published on

This is my fourth and final lecture in a series of 4 lectures on the topic of Evolving Software Ecosystems, presented during the NATO Marktoberdorf 2014 Summer School on Dependable Software System …

This is my fourth and final lecture in a series of 4 lectures on the topic of Evolving Software Ecosystems, presented during the NATO Marktoberdorf 2014 Summer School on Dependable Software System Engineering in Germany, August 2014.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
218
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Evolving6So9ware6Ecosystems6 Marktoberdorf6Summer6School62014
 Lecture64 Tom6Mens6 So#ware(Engineering(Lab( University(of(Mons informa7que.umons.ac.be/genlog
  • 2. Ecosystem(Measures
  • 3. Ecosystem(Measures • The(characteris7cs(of(a(so#ware(ecosystem(can( be(measured(in(different(ways( – Using(tradi7onal(so#ware(quality(metrics( – Using(ecological(diversity(metrics( – Using(econometrics 96
  • 4. Ecosystem(Measures( So#ware(Quality(Metrics 97
  • 5. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Ecosystem(Measures( So#ware(Quality(Metrics • So#ware(product((code)(metrics( – size(metrics( – e.g.(LOC,(NOM( – complexity(metrics( – e.g.(cycloma7c(complexity( – coupling(and(cohesion(metrics( – e.g.(LCOM,(CBO( – dependency(metrics( – e.g.(fan?in,(fan?out 98
  • 6. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Ecosystem(Measures( So#ware(Quality(Metrics • Side(remark( • Distribu7on(of(most(of(these(metrics(is(highly( skewed( • Tradi7onal(aggrega7on(measures((mean,(median)( are(only(reliable(for(centralised(distribu7ons( • We(need(other(aggrega7on(measures(for(skewed( distribu7ons 99 Mordal(et(al.(“So#ware(quality(metrics(aggrega7on(in( industry”,(J./SoCware:/Evolu$on/and/Process/(2012)
  • 7. Ecosystem(Measures( Measuring(Diversity Many(different(diversity(metrics:( • species(richness$ • the(number(of(different(species(represented(in(an(ecological( community( • species(evenness$(entropy)$ • the(rela7ve(abundance(of(the(popula7on(of(each(species(in(the( ecosystem( • Shannon$diversity$index$(rela7ve(entropy)$ • how(specialised(is(a(given(species(in(rela7on(to(the(species(in(the( other(level( • Simpson$index$ • the(degree(of(concentra7on(when(individuals(are(classified(into( species 100
  • 8. Measuring(Diversity( Evenness • Quan7fies(the(rela7ve(abundance(of(the(popula7on( of(each(species(in(the(ecosystem( • Maximum(evenness(if(all(species(are(equally(abundant((i.e.,( have(same(number(of(individuals)( • Low(evenness(if(some(species(dominate(the(others( ! • Can(be(measured(using(Shannon’s(no7on(of(informa$on/ entropy 101
  • 9. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Based on Shannon’s notion of information entropy
 and 2nd law of thermodynamics! ! ! ! where X = set of n distinct species xi! p(xi) = proportion of all individuals that belong to species xi! ! Quantifies the uncertainty in predicting the species identity of an individual that is taken at random from the dataset.! 102 Measuring(Diversity( Evenness € H(X) = − p(xi)ln p(xi) i=1 n ∑ Claude6Shannon6 1916I2001
  • 10. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering 103 Measuring(Diversity( Evenness Dual(views(in(a(so#ware(ecosystem( ! • Based(on(species(analogy( ✦ Contributors(are(species(that(thrive(( in(their(environment(of(projects( ✦ Projects(are(species(that(thrive(in( their(environment(of(contributors( (human(resources) Bipar7te( contributor?project( graph project(1 project(2 project(3
  • 11. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Two(dual(measures(of(entropy( • Based(on(bipar7te(author((contributor)(?( module((project)(graph( M(=(set(of(n(dis7nct(modules(mi( A(=(set(of(k(dis7nct(authors(aj( Mi(=(#(commits(to(module(mi( Aj(=(#(commits(by(author(aj( aij(=((#(commits(to(module(mi(by(author(aj)(/(Aj( mij(=((#commits(to(module(mi(by(author(aj)(/(Mi( • Author(diversity( ! ! • Module(diversity 104 Measuring(Diversity( Evenness € Ha j = − aij lnaij i=1 n ∑ Hmi = − mij lnmij j =1 k ∑ Posnet(et(al.(Dual/ecological/ measures/of/focus/in/soCware/ development.(ICSE(2013 module(1 module(2 module(3
  • 12. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Two(dual(measures(of(entropy( • Based(on(bipar7te(author((contributor)(?( module((project)(graph( M(=(set(of(n(dis7nct(modules(mi( A(=(set(of(k(dis7nct(authors(aj( Mi(=(#(commits(to(module(mi( Aj(=(#(commits(by(author(aj( aij(=((#(commits(to(module(mi(by(author(aj)(/(Aj( mij(=((#commits(to(module(mi(by(author(aj)(/(Mi( • Author(diversity( ! ! • Module(diversity 105 Measuring(Diversity( Evenness € Ha j = − aij lnaij i=1 n ∑ Hmi = − mij lnmij j =1 k ∑ Low(diversity(if(author(dominates( most(commit(ac7vity Low(diversity(if(module( dominates(most(commit(ac7vity Posnet(et(al.(Dual/ecological/ measures/of/focus/in/soCware/ development.(ICSE(2013
  • 13. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Measuring(Diversity( Shannon’s(diversity(index Expresses(how(specialised(a(given(species(is(in(rela7on(to(the( species(in(the(other(level( Using(a(no7on(of(rela/ve$entropy( Taking(into(account(the(contributor?project(duality 106 Projet 1 Projet 2 Projet 3 Thiruvalluvan Douglas Phillip avro.genavro avro.io.parsing avro.io avro.generic avro.reflect avro.specific avro avro.file avro.tool avro.util avro.mapred.tether avro.mapred default avro.idl avro.ipc avro.ipc.trace avro.ipc.stats
  • 14. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Shannon’s(diversity(index( Rela7ve(Entropy Specialisa&on(of(a(species(rela7ve(to(the(species(in(the(other(level( Takes(into(account(the(interac7on(between(authors(and(modules(as( well(as(the(overall(amount(of(ac7vity(per(author(or(module.( – Mi(and(Aj(defined(as(before( – mij(and(aij(defined(as(before( – C(=(total(#commits(( ! • Author((contributor)(specialisa7on( ! ! • Module((project)(specialisa7on 107 Fa j = − aij ln aij M'ii=1 n ∑ Fmi = − mij ln mij A'jj=1 k ∑
  • 15. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering • ANen&on6focus6=(normalisa7on(of(specialisa$on(by(the( theore7cal(maximum(and(minimum(possible(values((( ! ! ! • Findings(by(Posnet(et/al.( – Project(leaders(and(top(contributors(tend(to(exhibit(lower( aten7on(focus(than(others.( – Narrowly(focused(developers(introduce(fewer(defects.( – Increased(module(ac7vity(focus(results(in(a(greater(number( of(defects. Shannon’s(diversity(index( Rela7ve(Entropy 108 Can(be(computed(with(R(package(‘bipartite’
  • 16. Measuring(Diversity( Simpson(index » Measures(the(degree(of(concentra7on(when(individuals(are( classified(into(species( • I.e.,(the(probability(that(two(individuals(taken(at(random
 from(the(dataset(belong(to(the(same(species( • Is(minimal(when(all(species(are(equally(abundant( • For(small(datasets:( ! ! ! ! ! • For(large(datasets: 109 • R(=(number(of(species(types( • N(=(number(of(en77es(in(the(dataset( • ni(=(number(of(en77es(belonging(to(the(ith(species(type(
  • 17. Ecosystem(Measures( Econometrics • Econometrics(are(measures(used(in(economy( • Well?known(examples( • Pareto(principle( • Inequality(indices 110
  • 18. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering 111 Econometrics( Pareto(Principle Pareto(Principle • A.k.a. 80–20 rule! • Roughly 80% of the effects come from 20% of the causes.! • Often coincides with power law distribution! Examples! • 80% of land owned by 20% of the population! • 80% of sales come from 20% of clients! • 80% of crashes come from 20% most reported bugs Pareto
  • 19. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Example in the GNOME ecosystem! • 20% of all contributors account for about 80% of the total workload in GNOME code repository 112 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Cumulative percentage of contributors Cumulativepercentageofworkload Econometrics( Pareto(Principle
  • 20. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering • Example for individual GNOME projects! • Brasero! • Evince! • Analysing different data sources! • Commits in a version control repository! • Mails in a mailing lists! • Issue reports in a bug tracker! • Pareto principle is confirmed in all case 113 Econometrics( Pareto(Principle Evidence for the Pareto principle in Open Source Software Activity Mathieu Goeminne and Tom Mens Institut d’Informatique, Facult´e des Sciences Universit´e de Mons – UMONS Mons, Belgium { mathieu.goeminne | tom.mens }@umons.ac.be Abstract—Numerous empirical studies analyse evolving open source software (OSS) projects, and try to estimate the activity and effort in these projects. Most of these studies, however, only focus on a limited set of artefacts, being source code and defect data. In our research, we extend the analysis by also taking into account mailing list information. The main goal of this article is to find evidence for the Pareto principle in this context, by studying how the activity of developers and users involved in OSS projects is distributed: it appears that most of the activity is carried out by a small group of people. Following the GQM paradigm, we provide evidence for this principle. We selected a range of metrics used in economy to measure inequality in distribution of wealth, and adapted these metrics to assess how OSS project activity is distributed. Regardless of whether we analyse version repositories, bug trackers, or mailing lists, and for all three projects we studied, it turns out that the distribution of activity is highly imbalanced. Index Terms—software evolution, activity, software project, data mining, empirical study, open source software, GQM, Pareto I. INTRODUCTION Numerous empirical studies aim to understand and model how open source software (OSS) evolves over time [1]. In order to gain a deeper understanding of this evolution, it is essential to study not only the software artefacts that evolve (e.g. source code, bug reports, and so on), but also their interplay with the different project members (mainly developers and users) that communicate (e.g., via mailing lists) and collaborate in order to construct and evolve the software. In this article, we wish to understand how activity is spread over the different members of an OSS project, and how this activity distribution evolves over time. Our hypothesis is that the distribution of activity follows the Pareto principle, in the sense that there is a small group of key persons that carry out most of the activity, regardless of the type of considered activity. To verify this hypothesis, we carry out an empirical study based on the GQM paradigm [2]. We rely on concepts borrowed from econometrics (the use of measurement in economy), and apply them to the field of OSS evolution. In particular, we apply indices that have been introduced for measuring distribution (and inequality) of wealth, and use them to measure the distribution of activity in software development. The remainder of this paper is structured as follows. Sec- tion II explains the methodology we followed and defines the metrics that we rely upon. Section III presents the ex- perimental setup of our empirical study that we have carried out. Section IV presents the results of our analysis of activity distribution in three OSS projects. Section V discusses the evidence we found for the Pareto principle. Section VI presents related work, and Section VII concludes. II. METHODOLOGY A. GQM paradigm To gain a deeper understanding of how OSS projects evolve, we follow the well-known Goal-Question-Metric (GQM) paradigm. Our main research Goal is to understand how ac- tivity is distributed over the different stakeholders (developers and users) involved in OSS projects. Once we have gained deeper insight in this issue, we will be able to exploit it to provide dedicated tool support to the OSS community, e.g., by helping newcomers to understand how the community is structured, by improving the way in which the community members communicate and collaborate, by trying to reduce the potential risk of the so-called bus factor1 , and so on. To reach the aforementioned research goal, we raise the following research Questions: 1) Is there a core group of OSS project members (develop- ers and/or users) that are significantly more active than the other members? 2) How does the distribution of activity within an OSS community evolve over time? 3) Is there an overlap between the different types of activity (e.g., committing, mailing, submitting and changing bug reports) the community members contribute to? 4) How does the distribution of activity vary across differ- ent OSS projects? As a third step, we need to select appropriate Metrics that will enable us to provide a satisfactory answer to each of the above research questions. For our empirical study, we will make use of basic metrics to compute the activity of OSS project members, and aggregate metrics that allow us to com- pare these basic metric values across members (to understand how activity is distributed), over time (to understand how they 1The bus factor refers to the total number of key persons (involved in the project) that would, if they were to be hit by a bus, lead the project into serious problems SQM(2011
  • 21. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Econometrics( Pareto(Principle 114 Brasero Evince 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 commits mails bug report changes
  • 22. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Econometrics( Lorenz(curve •A(graphical(representa7on(for(a( cumula7ve(distribu7on(of(values( • Example(for(income/wealth(distribu7on( • A(point((x,y)(on(the(graph(indicates(that(the( poorest(x%(of(persons(have(a(total(cumula7ve( income(of(y%.( • (Example(for(ecology/biodiversity( • cumula7ve(propor7on(of(species(is(ploted( against(cumula7ve(propor7on(of(individuals.( ! • Can(be(used(to(check(Pareto(principle 115
  • 23. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Econometrics( Inequality(Indices •(are(used(to(measure(the(amount(of(inequality(in( a(sta7s7cal(distribu7on( – Examples:(Gini,(Theil,(Hoover,(Kolm,(Atkinson,(…( •Values(typically(range(between(0(and(1( •0(=(perfect(equality( •1(=(maximal(inequality( ! •Are(useful(for(skewed(distribu7ons,(where(use(of( mean(and(median(as(aggrega7on(measure(is(not( very(meaningful( •Are(all(correlated,(in(prac7ce(… 116
  • 24. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Econometrics( Inequality(Indices •(Examples((and(defini7ons)( ! ! ! ! ! •Inequality(indices(have(been(used(in( empirical(so#ware(engineering(to(study( the(evolu7on(of(so#ware(metrics 117 Gini/ Theil Atkinson/ Hoover/ Kolm
  • 25. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering 118 • Gini(coefficient(measures(the( inequality(among(values(of(a( frequency(distribu7on( • 0(=(perfect(equality( • 1?1/n(=(maximal(inequality( • Is(computed(based(on(the( areas(above(and(below(the( Lorenz(curve:( Gini(=(A(/((A+B) Inequality(Indices( Gini(coefficient
  • 26. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Inequality(Indices( Gini(coefficient 119 ution profiles similar to the ones we observed fortunately, the number of freely-available, ystems developed in C# framework that met criteria is rather limited. So, we began our tems that were originally written in Java and ed to the .NET platform in order to take ad- the knowledge gained in the analysis of their a counterparts. ET metrics extraction, we used CLI [18], an der library that provides access to both the byte code. We added a small wrapper for the f the Gini coefficients and stored the resulting file for further processing with JSeat. ed metrics data from four .NET systems: NHibernate, SharpDevelop, and NAnt. The ur 10 measures produced Gini coefficients he ones determined for Java systems. How- re also exceptions. We observed a shift ex- i.e., individual Gini coefficients doubled in most all measures in NAnt version 0.8.3-rc1. fficients stayed high until version 0.84-rc1, sumed “normal” values again. An inspection per logs provided an explanation: in version NAntContrib project was integrated into the tion. This project defines a number of utili- trics exhibit very uneven distribution profiles changes do happen and may result in significant fluctua- tions in Gini coefficients that warrant a deeper analysis (see Figure 4 showing selected Gini profiles for 51 consecutive releases of the Spring framework). But why do we see such a remarkable stability of Gini coefficients? Figure 4. Selected Gini profiles in Spring. Developers accumulate system competence over time. Proven techniques to solve a given problem prevail, where untested or weak practices have little chance of survival. If a team has historically built software in a certain way, then it will continue to prefer a certain approach over oth- ers. Moreover, we can expect that most problems in a given domain are similar, hence the means taken to tackle them would be similar, too. Tversky and Kahneman coined the Vasa(et(al.(Compara$ve/analysis/of/ evolving/soCware/systems/using/ the/Gini/coefficient.(ICSM(2009
  • 27. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Inequality(Indices( Gini(coefficient 120 ution profiles similar to the ones we observed fortunately, the number of freely-available, ystems developed in C# framework that met criteria is rather limited. So, we began our tems that were originally written in Java and ed to the .NET platform in order to take ad- the knowledge gained in the analysis of their a counterparts. ET metrics extraction, we used CLI [18], an der library that provides access to both the byte code. We added a small wrapper for the f the Gini coefficients and stored the resulting file for further processing with JSeat. ed metrics data from four .NET systems: NHibernate, SharpDevelop, and NAnt. The ur 10 measures produced Gini coefficients he ones determined for Java systems. How- re also exceptions. We observed a shift ex- i.e., individual Gini coefficients doubled in most all measures in NAnt version 0.8.3-rc1. fficients stayed high until version 0.84-rc1, sumed “normal” values again. An inspection per logs provided an explanation: in version NAntContrib project was integrated into the tion. This project defines a number of utili- trics exhibit very uneven distribution profiles changes do happen and may result in significant fluctua- tions in Gini coefficients that warrant a deeper analysis (see Figure 4 showing selected Gini profiles for 51 consecutive releases of the Spring framework). But why do we see such a remarkable stability of Gini coefficients? Figure 4. Selected Gini profiles in Spring. Developers accumulate system competence over time. Proven techniques to solve a given problem prevail, where untested or weak practices have little chance of survival. If a team has historically built software in a certain way, then it will continue to prefer a certain approach over oth- ers. Moreover, we can expect that most problems in a given domain are similar, hence the means taken to tackle them would be similar, too. Tversky and Kahneman coined the Vasa(et(al.(Compara$ve/analysis/of/ evolving/soCware/systems/using/ the/Gini/coefficient.(ICSM(2009 !" !#$" !#%" !#&" !#'" !#(" !#)" !#*" !#+" !#," $" -./0!)" 1230!*" -./0!*" 1230!+" -./0!+" 1230!," -./0!," 1230$!" -./0$!" .4556/7" 58697" :;"0".<8=>?7" !" !#$" !#%" !#&" !#'" !#(" !#)" !#*" !#+" !#," $" -./0,," -./0!!" -./0!$" -./0!%" -./0!&" -./0!'" -./0!(" -./0!)" -./0!*" -./0!+" -./0!," -./0$!" 1233456" 37486" 9:;"/<.2/5"1=7>;<6" Gnome/Brasero Gnome/Evince Goeminne et al. Evidence for the Pareto principle in open source software activity. SQM 2011
  • 28. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Inequality(Indices( Theil(index •Is(defined(as( ! ! and(gives(a(value(between(0(and(ln/N/ •Corresponds(to(the(no7on(of(redundancy(in(informa7on( theory( ! •Normalised6Theil6index6is(obtained(by(dividing(by(ln/N
 and(gives(values(between(0(and(1( •0(=(equal(distribu7on( •1(=(unequal(distribu7on( ! 121
  • 29. Inequality(Indices( Theil(index Commits(sent( E?mails(sent( Bug(reports(modified Evince 122 Evidence for the Pareto principle in Open Source Software Activity Mathieu Goeminne and Tom Mens Institut d’Informatique, Facult´e des Sciences Universit´e de Mons – UMONS Mons, Belgium { mathieu.goeminne | tom.mens }@umons.ac.be Abstract—Numerous empirical studies analyse evolving open source software (OSS) projects, and try to estimate the activity and effort in these projects. Most of these studies, however, only focus on a limited set of artefacts, being source code and defect data. In our research, we extend the analysis by also taking into account mailing list information. The main goal of this article is to find evidence for the Pareto principle in this context, by studying how the activity of developers and users involved in OSS projects is distributed: it appears that most of the activity is carried out by a small group of people. Following the GQM paradigm, we provide evidence for this principle. We selected a range of metrics used in economy to measure inequality in distribution of wealth, and adapted these metrics to assess how OSS project activity is distributed. Regardless of whether we analyse version repositories, bug trackers, or mailing lists, and for all three projects we studied, it turns out that the distribution of activity is highly imbalanced. Index Terms—software evolution, activity, software project, data mining, empirical study, open source software, GQM, Pareto I. INTRODUCTION Numerous empirical studies aim to understand and model how open source software (OSS) evolves over time [1]. In order to gain a deeper understanding of this evolution, it is essential to study not only the software artefacts that evolve (e.g. source code, bug reports, and so on), but also their interplay with the different project members (mainly developers and users) that communicate (e.g., via mailing lists) and collaborate in order to construct and evolve the software. In this article, we wish to understand how activity is spread over the different members of an OSS project, and how this activity distribution evolves over time. Our hypothesis is that the distribution of activity follows the Pareto principle, in the sense that there is a small group of key persons that carry out most of the activity, regardless of the type of considered activity. To verify this hypothesis, we carry out an empirical study based on the GQM paradigm [2]. We rely on concepts borrowed from econometrics (the use of measurement in economy), and apply them to the field of OSS evolution. In particular, we apply indices that have been introduced for measuring distribution (and inequality) of wealth, and use them to measure the distribution of activity in software development. The remainder of this paper is structured as follows. Sec- tion II explains the methodology we followed and defines the metrics that we rely upon. Section III presents the ex- perimental setup of our empirical study that we have carried out. Section IV presents the results of our analysis of activity distribution in three OSS projects. Section V discusses the evidence we found for the Pareto principle. Section VI presents related work, and Section VII concludes. II. METHODOLOGY A. GQM paradigm To gain a deeper understanding of how OSS projects evolve, we follow the well-known Goal-Question-Metric (GQM) paradigm. Our main research Goal is to understand how ac- tivity is distributed over the different stakeholders (developers and users) involved in OSS projects. Once we have gained deeper insight in this issue, we will be able to exploit it to provide dedicated tool support to the OSS community, e.g., by helping newcomers to understand how the community is structured, by improving the way in which the community members communicate and collaborate, by trying to reduce the potential risk of the so-called bus factor1 , and so on. To reach the aforementioned research goal, we raise the following research Questions: 1) Is there a core group of OSS project members (develop- ers and/or users) that are significantly more active than the other members? 2) How does the distribution of activity within an OSS community evolve over time? 3) Is there an overlap between the different types of activity (e.g., committing, mailing, submitting and changing bug reports) the community members contribute to? 4) How does the distribution of activity vary across differ- ent OSS projects? As a third step, we need to select appropriate Metrics that will enable us to provide a satisfactory answer to each of the above research questions. For our empirical study, we will make use of basic metrics to compute the activity of OSS project members, and aggregate metrics that allow us to com- pare these basic metric values across members (to understand how activity is distributed), over time (to understand how they 1The bus factor refers to the total number of key persons (involved in the project) that would, if they were to be hit by a bus, lead the project into serious problems Brasero Evolu7on(of(Theil(index(for(2(GNOME(projects SQM2011
  • 30. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Econometrics( Inequality(Indices Example: Comparison of (evolution of) inequality indices for Evince 123 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Apr-99 Dec-99 Aug-00 Apr-01 Dec-01 Aug-02 Apr-03 Dec-03 Aug-04 Apr-05 Dec-05 Aug-06 Apr-07 Dec-07 Aug-08 Gini Hoover Theil (normalised)
  • 31. So#ware(Ecosystems Case(Study:(GNOME Vasilescu(et(al.(On/the/varia$on/and/ specialisa$on/of/workload:/A/case/study/ of/the/GNOME/ecosystem/community.( Emp.(So#w.(Eng.(2014
  • 32. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Overall(goal(revisited Improve(support((tools/guidelines/models/…)(for( dealing(with(changes(in(open(source(soCware( ecosystems/ –Improve(chance(of(survival(of(a(project(within(its( ecosystem( –Improve(resilience(of(an(ecosystem(as(a(whole( –Allow(to(make(changes(more(effec7vely( e.g.(higher(produc7vity,(faster(reac7on(to/ implementa7on(of(change/bug(requests)( –Increase((accuracy(of(effort/cost(es7ma7on( models,(defect(predic7on(models(and(so(on 125
  • 33. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME Observa$on:(exis7ng(generic(support(does(not(take(the( specifici7es(of(the(ecosystem(into(account,(making(the( support(subop7mal.( ! Assump$on:(specialised(ecosystem?specific(change( support(will(be(more(effec7ve( ! Consequence:(We(need(to(understand(the(socio?technical( specifici7es(of(the(ecosystem(under(study((in(order(to( provide(more(effec7ve(change(support.( ! This(is(what(we(will(do(for(the(GNOME(ecosystem. 126
  • 34. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME
 Some(references 127 To appear in 2013 in Springer’s Empirical Software Engineering journal – manuscri (will be inserted by the editor) On the variation and specialisation of workload – A case study of the Gnome ecosystem community Bogdan Vasilescu · Alexander Serebrenik · Mathieu Goeminne · Tom Mens DOI: 10.1007/s10664-013-9244-1 Abstract Most empirical studies of open source software repositories focus on the analysis of isolated projects, or restrict themselves to the study of the relation- ships between technical artifacts. In contrast, we have carried out a case study that focuses on the actual contributors to software ecosystems, being collections of soft- ware projects that are maintained by the same community. To this aim, we defined a new series of workload and involvement metrics, as well as a novel approach— eT-graphs—for reporting the results of comparing multiple distributions. We used these techniques to statistically study how workload and involvement of ecosys- tem contributors varies across projects and across activity types, and we explored to which extent projects and contributors specialise in particular activity types. Using Gnome as a case study we observed that, next to coding, the activities of lo- calization, development documentation and building are prevalent throughout the ecosystem. We also observed notable di↵erences between frequent and occasional contributors in terms of the activity types they are involved in and the number of projects they contribute to. Occasional contributors and contributors that are involved in many di↵erent projects tend to be more involved in the localization ac- tivity, while frequent contributors tend to be more involved in the coding activity in a limited number of projects. Keywords open source · software ecosystem · metrics · developer community · case study B. Vasilescu and A. Serebrenik MDSE, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, The Nether- UMONS Faculté des Sciences Département d’Informatique Understanding the Evolution of Socio-technical Aspects in Open Source Ecosystems: An Empirical Analysis of GNOME Mathieu Goeminne A dissertation submitted in fulfillment of the requirements of the degree of Docteur en Sciences Advisor Jury Dr. TOM MENS Dr. XAVIER BLANC Université de Mons, Belgium Université de Bordeaux 1, France Dr. VÉRONIQUE BRUYÈRE Université de Mons, Belgium Dr. JESUS M. GONZALEZ-BARAHONA Universidad Rey Juan Carlos, Spain Dr. TOM MENS Université de Mons, Belgium Dr. ALEXANDER SEREBRENIK Technische Universiteit Eindhoven, The Netherlands Dr. JEF WIJSEN Université de Mons, Belgium June 2013 A historical dataset for GNOME contributors Mathieu Goeminne, Ma¨elick Claes and Tom Mens Software Engineering Lab, COMPLEXYS research institute, UMONS, Belgium Abstract—We present a dataset of the open source software ecosystem GNOME from a social point of view. We have collected historical data about the contributors to all GNOME projects stored on git.gnome.org, taking into account the problem of identity matching, and as- sociating different activity types to the contributors. This type of information is very useful to complement the traditional, source-code related information one can ob- tain by mining and analyzing the actual source code. The dataset can be obtained at https://bitbucket.org/ mgoeminne/sgl-flossmetric-dbmerge. I. INTRODUCTION In this paper, we present the process we have used to create a dataset containing the historical information related to contributors to the GNOME ecosystem. Our database and the tools and scripts used to created it can be found on a dedicated Bitbucket repository2 . In contrast to many other datasets, we do not focus on source code, since a significant amount of files commit- ted to GNOME’s project repositories do not even contain code (e.g., image files, web pages, documentation, lo- calization and many more). Such type of information is often ignored in MSR research while it is very relevant to understand which types of activities contributors are @(MSR(2013
  • 35. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Characteris7cs Open(source(desktop(environment(for(Linux( • >(16(years(of(ac7vity((1997(—>(…()( • Projects((Git(repositories(stored(at(htp://git.gnome.org)( ( >(1400(projects( ! • Contributors( ( >(11000(contributor(accounts( ( a#er(iden7ty(merging,(>(5800(contributors( ( a#er(filtering(code(ac7vity,(>(4300(coders( ! • Commits(and(file(touches( ( >(1.3M(commits((of(which(>(0.6M(code(commits)( ( >(12M(file(touches((of(which(>(6M(of(codefile(touches) 128
  • 36. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Characteris7cs 129 Gnome Use case
  • 37. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering C Java Objective C Python Lisp JS ASP.Net C/C++ Header C++ Perl yacc C# IDL Haskell Objective C++ lexAssembly Visual Basic PHP Ruby Tcl/Tk 1e+05 1e+07 100 1000 10000 Files LOC Case(Study:(GNOME( Programming(language 130 Rela7on(between(programming(language(used(and(code(size Mainly6C/C++
 and6Python
  • 38. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Characteris7cs 131 Dataset(shared(on
 htps://bitbucket.org/mgoeminne/sgl?flossmetric?dbmerge/downloads( FLOSSMetrics(compliant( MySQL(database Goeminne(et/al./“A(historical(dataset( for(GNOME(contributors”,(MSR(2013
  • 39. Case/Study:/GNOME/ Characteris$cs 132 Bipar7te(contributor?project(graph project(1 project(2 project(3 ! >(5800(contributors( (>(4300(coders)( >(1400(projects
  • 40. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Workload(Distribu7on How(is(workload(distributed(over(different
 authors(and(projects? 133
  • 41. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Workload(Distribu7on How(is(workload(distributed(over(different( authors(and(projects(per6ac&vity6type? 134 Image Code Documentation Traduction
  • 42. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Workload(Distribu7on Two(dual(views((cf.(bipar7te(contributor?project(graph)( ?(Distribu7on(of(workload(over
 (((different(projects(per(ac7vity(type( ?(Distribu7on(of(workload(over
 (((different(authors(per(ac7vity(type?( 135 How(is(workload(distributed(over(different( authors(and(projects(per6ac&vity6type?
  • 43. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Workload(Distribu7on ?(Extract(file(informa7on(for(each(commit(in(the(git( repository(of(each(GNOME(project( ?(Associate(a(unique(ac7vity(type(t(to(each(file( ?(Count(the(number(of(file(touches 136 Based on [Robles2006] /foo/bar.c Fichiers Règles ... ... .*.c -> CODE CODE Activité
  • 44. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Workload(Distribu7on ?(Extract(file(informa7on(for(each(commit(in(the(git( repository(of(each(GNOME(project( ?(Associate(a(unique(ac7vity(type(t(to(each(file( ?(Count(the(number(of(file(touches( ! Basic(workload(metric:( APTW(a,p,t)(=(number(of(file(touches(of(an(author(a( for(a(given(project(p(and(ac7vity(type(t/ ! Derived(metrics:(sum(and(Gini(coefficient 137
  • 45. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Workload(Metrics Comment(le(travail(varie?t?il(d’un(projet(de( GNOME(à(un(autre?( Comment(le(travail(varie?t?il(d’un( contributeur(de(GNOME(à(un(autre?( Mesure(de(l’ac7vité(u7lisée:(le(nombre(de( modifica7ons(effectuées(sur(les(fichiers. 138
  • 46. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Workload(Metrics Main(findings( ! Workload(is(log? normally( distributed(over( GNOME(projects 139
  • 47. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Workload(Metrics Main(findings( ! The(majority(of( GNOME(authors( are(involved(in(a( very(low(number( of(file(touches. 140 28 log(AW) Numberofauthors 0 2 4 6 8 10 12 0100200300400500600 50% < 14 changes 185,874 changes frequent6 authors occasional6 authors
  • 48. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Workload(Metrics Main(findings 141 Highest workload is represented by coding activity, followed by activities of development documentation, translation/internationalisation, and build file creation. TW(t)
  • 49. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Rela7ve(importance(of(ac7vity(types What(are(the(favourite(ac7vity(types(for(GNOME?( ! Two(dual(views( ?(Rela7ve(importance(of
 (((each(ac7vity(type(per/author( ?(Rela7ve(importance(of( (((each(ac7vity(type(per/project( 142
  • 50. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering What(are(the(favourite(ac7vity(types(for(GNOME?( ! Approach( •Use(sta7s7cal(tests(to
 compare(distribu7ons( •Verify(if(a(data(set(corresponding
 to(an(ac7vity(type(tends(to(have
 higher(values(than(a(data(set
 corresponding(to
 another(ac7vity(type Case(Study:(GNOME( Rela7ve(importance(of(ac7vity(types 143
  • 51. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Rela7ve(importance(of(ac7vity(types Examples of statistical comparison tests! •(Wilcoxon?)Mann–Whitney(U(test( •Kruskal?Wallis(test(( ! Problems(with(tradi7onal(sta7s7cal(tests:( • Not robust to populations of unequal sizes! • Different tests can be inconsistent with each other! • Pairwise comparison of all activity types requires 78 different combinations (12 * 13 / 2)! •Traditional tests are not transitive 144
  • 52. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Rela7ve(importance(of(ac7vity(types Solu7on:(( •Use(a(single(test(that(respects(transi7vity( •T(procedure([Konietschke(et(al(2012] 145 ~
  • 53. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Rela7ve(importance(of(ac7vity(types T(procedure 146 6-3-2013 Pair Low High B-A -0.56 -0.44 C-A -0.50 -0.31 D-A -0.32 -0.03 C-B -0.01 0.24 D-B 0.24 0.47 D-C 0.09 0.40 A→B A→C A→D D→B D→C ~
  • 54. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Rela7ve(importance(of(ac7vity(types 147 by author
  • 55. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Rela7ve(importance(of(ac7vity(types 148 by author by project
  • 56. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Rela7ve(importance(of(ac7vity(types 149 GNOME projects and authors are code-centric by author by project
  • 57. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Rela7ve(importance(of(ac7vity(types 150 ! ! ! GNOME projects and authors are mainly involved in 4 activity types ! ! ! ! by author by project
  • 58. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Heterogeneous(communi7es Does(the(rela7ve(importance(of(ac7vity(types(differ( between(frequent/and/occasional/authors?( ! Idea( Equally(split(the(authors(in(two(bins(of
 more(or(less(equal(size,(based(on
 the(author(workload:( • about(50%(of(all(authors(were
 involved(in(<14(file(touches 151 28 log(AW) Numberofauthors 0 2 4 6 8 10 12 0100200300400500600 50% < 14 changes
  • 59. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Heterogeneous(communi7es 152 Occasional authors
  • 60. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Heterogeneous(communi7es 153 Occasional authors Frequent authors
  • 61. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Heterogeneous(communi7es 154 Occasional authors Frequent authors Frequent authors are mostly coders, occasional authors are mostly translators.
  • 62. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Heterogeneous(communi7es Observa7ons( • Coders have a higher workload and are involved in less projects! • Translators are less active but are involved in more projects 155 Can(be(explained(in(part(by(the(use(of(Damned/Lies,(a(Web(applica7on(used(to(manage(the( localisa7on((l10n)(ac7vi7es(of(the(GNOME(project
  • 63. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Heterogeneous(communi7es 156 6-3-2013 Sylvia Neu et al. “Telling stories about GNOME with Complicity”, VISSOFT 2011 Affec7onal(bond(view:( - size(of(rectangle(=(author’s(life7me(in(days( - color(=(number(of(projects( Complicity(is(a(web?based(applica7on( suppor7ng(so#ware(ecosystem(analysis(by( means(of(interac7ve(visualiza7ons.
  • 64. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Heterogeneous(communi7es 157 6-3-2013 Unverified(assump7ons:( ! 1.(Authors(contribu7ng(a(lot(to(few(projects( are(likely(to(be(developers((D)( 2.(Authors(contribu7ng(less(o#en(to(more( projects(are(likely(to(be(translators((T)( 3.(Authors(tend(to(have(an(affec7onal(bond( to(either(development(or(transla7on(work
  • 65. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Heterogeneous(communi7es 158 6-3-2013 Our work confirms these assumptions Potential misclassifications in Neu et al.
  • 66. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering How strongly do authors focus
 on specific activities? Case(Study:(GNOME( Rela7ve(Workload 159 Basic measures: • RATW(a,t)
 = % of the total workload of author a
 dedicated to activity type t ! • RAWS(a) = author specialisation
 = Gini index of of inequality of RATW(a,t)
 aggregated over all activity types
  • 67. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering How strongly do authors focus? Case(Study:(GNOME( Rela7ve(Workload 160 1606-3-2013 max Gini for n = 14: 0.9285
  • 68. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering How strongly do authors focus? Case(Study:(GNOME( Rela7ve(Workload 161 1616-3-2013 Occasional authors tend to focus on a single activity type.
  • 69. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering How strongly do authors focus? Case(Study:(GNOME( Rela7ve(Workload 162 1626-3-2013 1626-3-2013 Frequent authors tend to focus on few activity types.
  • 70. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Workload(Distribu7on Main observations for GNOME ecosystem: • Workload is unevenly distributed over projects and authors • Clear distinction between frequent and occasional authors • Authors form heterogeneous subcommunities (coding versus translation) • GNOME is code-centric, i.e., most of the workload is in code-related activities (coding, build files, development documentation) 163
  • 71. July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering Case(Study:(GNOME( Next(steps Observa$on:(exis7ng(generic(support(does(not(take(the( specifici7es(of(the(ecosystem(into(account,(making(the(support( subop7mal.( ! Having(gained(beter(understanding(of(the(GNOME(ecosystem( specifici7es,(we(hope(to(come(up(with(beter(change(support( mechanisms( ! Dedicated(to(specific(sub(communi7es( e.g.(Damned(Lies(applica7on(for(transla7on(community( Es7ma7on((of(cost(or(effort)(and(predic7on(models((e.g.(of( defects)(could(be(improved( Tools(should(be(able(to(focus(on(those(ac7vi7es/projects(a( contributor(is(interested(in((based(on(his(historic(ac7vity(profile) 164

×