Studying Evolving Software Ecosystems Inspired by Ecological Models

1,096 views

Published on

Research in progress presented by Tom Mens and Maelick Claes (Software Engineering Lab, University of Mons) at the SATToSE 2013 software evolution research seminar at the University of Bern, 9 July 2013

Published in: Education, Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,096
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
19
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Studying Evolving Software Ecosystems Inspired by Ecological Models

  1. 1. Studying  Evolving  SoHware  Ecosystems inspired  by  ecological  models Tom  Mens,  Maelick  Claes Service  de  Génie  Logiciel Philippe  Grosjean Service  d’écologie  numérique  des  milieux  aqua@ques informa@que.umons.ac.be/genlog/projects/ecos
  2. 2. 9  July  2013  -­‐  SATToSE,  Bern Collaborators 2 ?
  3. 3. 9  July  2013  -­‐  SATToSE,  Bern Long-­‐term  goals • Determine the main factors that drive the success or failure of OSS projects within their ecosystem • Investigate new techniques and mechanisms to predict and improve survivability of OSS projects – Inspired by research in biological ecology • Use these insights to help – the developer community to improve upon their practices – companies and users to compare and adopt OSS projects 3
  4. 4. 9  July  2013  -­‐  SATToSE,  Bern 4 Standing  on  the  shoulders  of  giants Lamarck Lotka Volterra Darwin Pareto
  5. 5. 9  July  2013  -­‐  SATToSE,  Bern 5 Terminology Biological  ecosystem Defini@ons • Ecology:  the  scien)fic  study  of  the   interac)ons  that  determine  the   distribu)on  and  abundance  of   organisms • Ecosystem:  the  physical  and   biological  components  of  an   environment  considered  in  rela)on   to  each  other  as  a  unit – combines  all  living  organisms   (plants,  animals,  micro-­‐ organisms)  and  physical   components  (light,  water,  soil,   rocks,  minerals) Example:  coral  reefs • High  biodiversity:  polyps,  sea   anemones,  fish,  mollusks,   sponges,  algae
  6. 6. 9  July  2013  -­‐  SATToSE,  Bern 6 Terminology SoHware  ecosystem Business-­‐oriented  view • “a  set  of  actors  func)oning  as  a  unit   and  interac)ng  with  a  shared  market   for  so<ware  and  services,  together   with  the  rela)onships  among   them.”  (Jansen  et  al.  2009) Examples • Eclipse • Android  and  iOS  app  store
  7. 7. 9  July  2013  -­‐  SATToSE,  Bern 7 SoHware  ecosystem Development-­‐centric  view • “a  collec)on  of  so<ware   products  that  have  some   given  degree  of  symbio)c   rela)onships.” (MesserschmiK  &  Szyperski  2003) • “a  collec)on  of  so<ware   projects  that  are  developed   and  evolve  together  in  the   same  environment.” (Lungu  2008) Examples • Gnome KDE • Debian Ubuntu • R’s  CRAN • Apache
  8. 8. 9  July  2013  -­‐  SATToSE,  Bern 8 Comparison
  9. 9. Biological  evolu@on AND  BY  A   DUMMY
  10. 10. 9  July  2013  -­‐  SATToSE,  Bern 10 Ecological  theories  of evolu@on  of  species • Lamarckism -­‐  animal  organs  and  behaviour  can   change  according  to  the  way  they  are   used -­‐  those  characteris@cs  can  transmit   from  one  genera@on  to  the  next  to   reach  a  greater  level  of  perfec@on • Example -­‐  giraffe’s  necks  have  become  longer   while  trying  to  reach  the  upper  leaves   of  a  tree Jean-­‐Bap@ste  Lamarck (1744–1829)
  11. 11. 9  July  2013  -­‐  SATToSE,  Bern 11 Ecological  theories  of evolu@on  of  species • Darwinism -­‐  all  species  of  life  have  descended   over  @me  from  common  ancestors -­‐  this  branching  padern  of  evolu@on   resulted  from  natural  selec@on,  similar   to  ar@ficial  selec@on  in  selec@ve   breeding • Example – 13  types  of  Galapagos  finches,   same  habits  and  characteris@cs,   but  different  beaks Charles  Darwin (1809–1882)
  12. 12. 9  July  2013  -­‐  SATToSE,  Bern 12 Ecological  theories  of evolu@on  of  species Hologenome  theory • The  unit  of  natural  selec@on   is  the  holobiont:  the   organism  together  with  its   associated  microbial   communi@es,  that  live   together  in  symbiosis. • The  holobiont  can  adapt  to   changing  environmental   condi@ons  far  more  rapidly   than  by  gene@c  muta@on   and  selec@on  alone. Compe@@on  vs  coopera@on • While  Darwin’s  theory   emphasises  compe))on   (survival  of  the  fidest),   hologenome  theory  also   includes  coopera)on   (through  symbiosis)
  13. 13. 9  July  2013  -­‐  SATToSE,  Bern 13 Evolu@on  History Ecology Darwin  (1837)• Evolu@on  history  of  species  can  be   represented  by  a  phylogene)c  tree. • Describes  the  evolu@onary   rela@onships  among  species   assuming  that  they  share  a common  ancestor.
  14. 14. 9  July  2013  -­‐  SATToSE,  Bern 14 Evolu@on  History Ecology Re$culate  evolu$on • Unlike  in  Darwinian  model,   evolu@on  history  is  represented   using  a  graph  structure • When  re@cula@on  of  species   occurs,  two  or  more   evolu@onary  lineages  are   combined  at  some  level  of   biological  organiza@on. • Causes – hybrid  specia)on  (two  lineages   recombine  to  create  a  new  one)   – horizontal  gene  transfer  (genes   are  transferred  across  species)
  15. 15. 9  July  2013  -­‐  SATToSE,  Bern 15 Evolu@on  History SoHware
  16. 16. 9  July  2013  -­‐  SATToSE,  Bern 16 Trophic  web  (food  chain) in  natural  ecosystems
  17. 17. 9  July  2013  -­‐  SATToSE,  Bern 17 Trophic  web  in soHware  ecosystems Producer-­‐consumer  rela@on Users Peripheral   developers Core  developers Onion  model TOP-­‐DOWN change  requests   &  bug  reports BOTTOM-­‐UP changes  in  core   projects  and   architecture  
  18. 18. 9  July  2013  -­‐  SATToSE,  Bern 18 Core  Architecture  -­‐  or Why  developers  are  polyps Coral  reef  ecosystem • Sclerac)nian  coral  polyps  are   responsible  for  crea@ng  the   coral  reef  structure • This  coral  reef  is  required  for   the  other  species  of  the   ecosystem  to  thrive. SoHware  ecosystem • Core  developers  are   responsible  for  crea@ng  the   core  soHware  architecture • Based  on  this  core   architecture,  other   developers  and  third  par@es   can  create  other  projects,   services,  and  so  on.
  19. 19. 9  July  2013  -­‐  SATToSE,  Bern 19 Ecosystem  Dynamics Predator-­‐prey  rela@onship • An  instance  of  the   consumer-­‐resource   rela@onship • Predators  (hun@ng  animals)   feed  upon  their  prey   (adacked  animals) Dynamic  model • Two  mutually  dependent   parametric  differen@al   equa@ons (Lotka-­‐Volterra  1925/1926)
  20. 20. 9  July  2013  -­‐  SATToSE,  Bern 20 SoHware  Ecosystem Dynamics Analogies  based  on  predator-­‐prey  rela$onship • Debuggers  are  predators,  soHware  defects  are  prey [Calzolari  et  al.  Maintenance  and  tes)ng  effort  modeled  by  linear  and   nonlinear  dynamic  systems,”  Informa)on  and  So<ware  Technology,  43(8):   477  –  486,  2001] • Developers  are  predators,  the  informa@on  they  seek  is  prey [Lawrance  et  al.    Scents  in  programs:  Does  informa)on  foraging  theory   apply  to  program  maintenance?  VL/HCC  2007,  pp.  15–22]
  21. 21. 9  July  2013  -­‐  SATToSE,  Bern 21 SoHware  Ecosystem Dynamics Analogies  based  on  predator-­‐prey  rela$onship • Dual  views  in  a  soHware   ecosystem – Developers  are  predators,  the   projects  they  work  on  are  prey – Projects  are  predators  that  feed   upon  the  cogni@ve  resources  of   their  developers Bipar@te  developer-­‐project  graph project  1 project  2 project  3
  22. 22. 9  July  2013  -­‐  SATToSE,  Bern • Stability:  the  capacity  to  maintain   an  equilibrium  over  longer   periods  of  @me • Resistance:  the  ability  to   withstand  environmental  changes   without  too  much  disturbances  of   its  biological  communi@es • Resilience:  the  ability  to  return  to   an  equilibrium  aHer  a  disturbance • Higher  biodiversity  favours  these   characteris@cs Other  desirable  ecosystem   characteris@cs 22 Defining and Evaluating a Measure of Open Source Project Survivability Uzma Raja, Member, IEEE Computer Society, and Marietta J. Tretter Abstract—In this paper, we define and validate a new multidimensional measure of Open Source Software (OSS) project survivability, called Project Viability. Project viability has three dimensions: vigor, resilience, and organization. We define each of these dimensions and formulate an index called the Viability Index (V I) to combine all three dimensions. Archival data of projects hosted at SourceForge.net are used for the empirical validation of the measure. An Analysis Sample (n ¼ 136) is used to assign weights to each dimension of project viability and to determine a suitable cut-off point for V I. Cross-validation of the measure is performed on a hold- out Validation Sample (n ¼ 96). We demonstrate that project viability is a robust and valid measure of OSS project survivability that can be used to predict the failure or survival of an OSS project accurately. It is a tangible measure that can be used by organizations to compare various OSS projects and to make informed decisions regarding investment in the OSS domain. Index Terms—Evaluation framework, external validity, open source software, project evaluation, software measurement, software survivability. Ç 1 INTRODUCTION OPEN Source Software (OSS) projects are developed and distributed for free, with full access to the project source code. Recently there has been a significant increase in the use of these projects. Some OSS projects have earned themselves a high reputation and corporate sponsorships. Large corporations (e.g., IBM, SUN microsystems) are becoming involved with the OSS movement in various capacities. Projections indicate that the corporate interest in OSS projects will grow stronger in the future [1] and these projects will see integration in enterprise architecture [2]. This increased use of OSS projects creates the need for better project evaluation measures. Traditionally, software projects are evaluated by con- formance to budget, schedule, and user requirements [3], [4], [5], [6], [7], [8]. These measures, however, are difficult to map to OSS projects, which are developed through a network of volunteer participants, with no defined budget, schedule, or customer. Although there is a surge in the investment in OSS projects [1], research indicates that a large number of OSS projects fail [9], [10]. Some have questioned the operational reliability and quality of OSS projects [11]. Since there are no contractual or legal bindings for providing OSS updates or maintenance services, businesses investing human or financial capital on adoption of OSS projects need the ability to evaluate whether the project will continue to exist or not [12]. Development teams need to measure project survivability to control and improve performance. Individual and corporate users need a measure of project survivability to compare the available OSS projects before making decisions regarding project adoption. In this paper, we define and validate a new multi- dimensional measure of OSS project survivability, called Project Viability. OSS projects provide access to their development archives, thereby providing a unique oppor- tunity to conduct empirical research [13] and develop reliable measures [14], [15]. In the following sections, we define, formulate, and validate project viability. Section 2 provides a brief overview of the existing empirical research in OSS and the background of project survivability. Section 3 defines the dimensions of project viability and formulates an index to measure it. Section 4 discusses the empirical evaluation framework and validates the new measure using OSS project data. Discussion of the results is presented in Section 5 and conclusions are given Section 6. 2 BACKGROUND A large number of OSS projects are available for use. However, the failure rate of these projects is high [9]. The evaluation of OSS projects is different than Commercial Software Systems (CSS) [16]. The adopters of OSS projects need a mechanism to compare the chances of failure or survival of the available projects. This would allow better decisions regarding corporate resource investment. A range of measures has been used in prior research to evaluate OSS projects. Godfrey and Tu [17] examined the evolution of the Linux kernel and its growth pattern in one IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 38, NO. 1, JANUARY/FEBRUARY 2012 163 . U. Raja is with the Department of Information Systems, Statistics and Management Science, The University of Alabama, Box #870226, 300 Campus Drive, Tuscaloosa, AL 35487. E-mail: uraja@cba.ua.edu. . M.J. Tretter is with the Department of Information and Operations Management, Texas A&M University, Mail Stop #310D, Wehner
  23. 23. 9  July  2013  -­‐  SATToSE,  Bern Measuring  diversity Based on Shannon’s notion of entropy and 2nd law of thermodynamics Species diversity X = set of n distinct species xi p(xi) = proportion of all individuals that belong to species xi 23 Ecosystem  Biodiversity • Biodiversity:  The  degree  of   varia@on  of  species  within  a   given  ecosystem • Interpreta@on -­‐  Maximum  diversity  if  all  species   have  same  number  of  individuals -­‐  Low  diversity  if  a  par@cular   species  dominates  the  others € H(X) = − p(xi)ln p(xi) i=1 n ∑
  24. 24. 9  July  2013  -­‐  SATToSE,  Bern 24 Evolu@on  of  diversity  in  open  source  soHware Econometric  indices • Gini  and  Theil  are  measures   of  inequality  in  a  distribu@on Moreover, C# and Java are very closely related and we therefore asked ourselves whether programs written in C# exhibit distribution profiles similar to the ones we observed in Java. Unfortunately, the number of freely-available, open-source systems developed in C# framework that met our selection criteria is rather limited. So, we began our study with systems that were originally written in Java and had been ported to the .NET platform in order to take ad- vantage from the knowledge gained in the analysis of their respective Java counterparts. For the .NET metrics extraction, we used CLI [18], an assembly reader library that provides access to both the metadata and byte code. We added a small wrapper for the computation of the Gini coefficients and stored the resulting data in a text file for further processing with JSeat. We collected metrics data from four .NET systems: iTextSharp, NHibernate, SharpDevelop, and NAnt. The analysis of our 10 measures produced Gini coefficients equivalent to the ones determined for Java systems. How- ever, there were also exceptions. We observed a shift ex- ceeding 0.4 (i.e., individual Gini coefficients doubled in value) for almost all measures in NAnt version 0.8.3-rc1. The Gini coefficients stayed high until version 0.84-rc1, where they assumed “normal” values again. An inspection of the developer logs provided an explanation: in version 0.8.3-rc1, the NAntContrib project was integrated into the NAnt distribution. This project defines a number of utili- ties whose metrics exhibit very uneven distribution profiles caused by a centralization of event handling in a few classes. In version 0.84-rc1, the developers removed NAntContrib from NAnt resulting in a change by 0.4, returning the Gini coefficients for NAnt to their previous values. We discovered in our analysis that Gini coefficients nor- mally change little between adjacent releases. However, changes do happen and may result in significant fluctua- tions in Gini coefficients that warrant a deeper analysis (see Figure 4 showing selected Gini profiles for 51 consecutive releases of the Spring framework). But why do we see such a remarkable stability of Gini coefficients? Figure 4. Selected Gini profiles in Spring. Developers accumulate system competence over time. Proven techniques to solve a given problem prevail, where untested or weak practices have little chance of survival. If a team has historically built software in a certain way, then it will continue to prefer a certain approach over oth- ers. Moreover, we can expect that most problems in a given domain are similar, hence the means taken to tackle them would be similar, too. Tversky and Kahneman coined the term “decision frame” [27] to refer to this principle in which decision-makers proactively organize their solutions within well-established and strong boundaries defined by cultural environment and personal preferences. These boundaries 184 Comparative Analysis of Evolving Software Systems Using the Gini Coefficient Rajesh Vasa, Markus Lumpe, Philip Branch Faculty of Information & Communication Technologies Swinburne University of Technology P.O. Box 218, Hawthorn, VIC 3122, AUSTRALIA {rvasa,mlumpe,pbranch}@swin.edu.au Oscar Nierstrasz Institute of Computer Science University of Bern Bern, CH-3012, SWITZERLAND oscar@iam.unibe.ch Abstract Software metrics offer us the promise of distilling useful information from vast amounts of software in order to track development progress, to gain insights into the nature of the software, and to identify potential problems. Unfortunately, however, many software metrics exhibit highly skewed, non- Gaussian distributions. As a consequence, usual ways of interpreting these metrics — for example, in terms of “av- erage” values — can be highly misleading. Many metrics, it turns out, are distributed like wealth — with high concen- trations of values in selected locations. We propose to an- alyze software metrics using the Gini coefficient, a higher- order statistic widely used in economics to study the dis- tribution of wealth. Our approach allows us not only to observe changes in software systems efficiently, but also to assess project risks and monitor the development process it- self. We apply the Gini coefficient to numerous metrics over a range of software projects, and we show that many met- rics not only display remarkably high Gini values, but that these values are remarkably consistent as a project evolves over time. tions is to identify a number of characterizing properties, collect corresponding software metrics, and render the ob- tained data into meaningful information that can assist both developers and project managers in their decision making [13, 27]. Unfortunately, software metrics data are, in gen- eral, heavily skewed [7,12,30], which makes precise inter- pretation with standard descriptive statistical analysis diffi- cult. Summary measures like “average” or “mean” assume a Gaussian distribution to capture the central tendency in a given data set. However, when applied to non-Gaussian dis- tributions, central tendency measures become increasingly more unreliable the greater the distance is between a given distribution and a normal distribution. The shortcomings of central tendency measures are am- plified when we wish to compare skewed distributions. Any meaningful comparison requires additional effort to fit the distributions in question to a specially-designed third model distribution [1, 26]. This transformation is not only cum- bersome but also expensive and may not yield the desired result. Moreover, additional problems may arise due to changes in both the degree of concentration of individual values and and the total value of a distribution. Consider, for Gini  =  A/(A+B)
  25. 25. 9  July  2013  -­‐  SATToSE,  Bern Evolu@on  of  diversity  in  open  source  soHware Econometric  indices Commits  sent E-­‐mails  sent Bug  reports  modified Evince 25 -­‐  Theil  index Corresponds  to  Shannon’s  no@on  of  entropy   Evidence for the Pareto principle in Open Source Software Activity Mathieu Goeminne and Tom Mens Institut d’Informatique, Facult´e des Sciences Universit´e de Mons – UMONS Mons, Belgium { mathieu.goeminne | tom.mens }@umons.ac.be Abstract—Numerous empirical studies analyse evolving open source software (OSS) projects, and try to estimate the activity and effort in these projects. Most of these studies, however, only focus on a limited set of artefacts, being source code and defect data. In our research, we extend the analysis by also taking into account mailing list information. The main goal of this article is to find evidence for the Pareto principle in this context, by studying how the activity of developers and users involved in OSS projects is distributed: it appears that most of the activity is carried out by a small group of people. Following the GQM paradigm, we provide evidence for this principle. We selected a range of metrics used in economy to measure inequality in distribution of wealth, and adapted these metrics to assess how OSS project activity is distributed. Regardless of whether we analyse version repositories, bug trackers, or mailing lists, and for all three projects we studied, it turns out that the distribution of activity is highly imbalanced. Index Terms—software evolution, activity, software project, data mining, empirical study, open source software, GQM, Pareto I. INTRODUCTION Numerous empirical studies aim to understand and model how open source software (OSS) evolves over time [1]. In order to gain a deeper understanding of this evolution, it is essential to study not only the software artefacts that evolve (e.g. source code, bug reports, and so on), but also their interplay with the different project members (mainly developers and users) that communicate (e.g., via mailing lists) and collaborate in order to construct and evolve the software. In this article, we wish to understand how activity is spread over the different members of an OSS project, and how this activity distribution evolves over time. Our hypothesis is that the distribution of activity follows the Pareto principle, in the sense that there is a small group of key persons that carry out most of the activity, regardless of the type of considered activity. To verify this hypothesis, we carry out an empirical study based on the GQM paradigm [2]. We rely on concepts borrowed from econometrics (the use of measurement in economy), and apply them to the field of OSS evolution. In particular, we apply indices that have been introduced for measuring distribution (and inequality) of wealth, and use them to measure the distribution of activity in software development. The remainder of this paper is structured as follows. Sec- tion II explains the methodology we followed and defines the metrics that we rely upon. Section III presents the ex- perimental setup of our empirical study that we have carried out. Section IV presents the results of our analysis of activity distribution in three OSS projects. Section V discusses the evidence we found for the Pareto principle. Section VI presents related work, and Section VII concludes. II. METHODOLOGY A. GQM paradigm To gain a deeper understanding of how OSS projects evolve, we follow the well-known Goal-Question-Metric (GQM) paradigm. Our main research Goal is to understand how ac- tivity is distributed over the different stakeholders (developers and users) involved in OSS projects. Once we have gained deeper insight in this issue, we will be able to exploit it to provide dedicated tool support to the OSS community, e.g., by helping newcomers to understand how the community is structured, by improving the way in which the community members communicate and collaborate, by trying to reduce the potential risk of the so-called bus factor1 , and so on. To reach the aforementioned research goal, we raise the following research Questions: 1) Is there a core group of OSS project members (develop- ers and/or users) that are significantly more active than the other members? 2) How does the distribution of activity within an OSS community evolve over time? 3) Is there an overlap between the different types of activity (e.g., committing, mailing, submitting and changing bug reports) the community members contribute to? 4) How does the distribution of activity vary across differ- ent OSS projects? As a third step, we need to select appropriate Metrics that will enable us to provide a satisfactory answer to each of the above research questions. For our empirical study, we will make use of basic metrics to compute the activity of OSS project members, and aggregate metrics that allow us to com- pare these basic metric values across members (to understand how activity is distributed), over time (to understand how they 1The bus factor refers to the total number of key persons (involved in the project) that would, if they were to be hit by a bus, lead the project into serious problems Brasero
  26. 26. 9  July  2013  -­‐  SATToSE,  Bern 26 SoHware  Ecosystem  Biodiversity • Uses  no@on  of  biodiversity  to   measure  developer  ac)vity  focus   and  module  ac)vity  focus Cf.  bipar@te  author-­‐module  graph • Based  on  no@on  of  rela)ve  entropy • More  details:  see  results  of   hackaton. Dual Ecological Measures of Focus in Software Development Daryl Posnett†, Raissa D’Souza∗, Premkumar Devanbu,† and, Vladimir Filkov† †∗University of California Davis, USA †{dpposnett,ptdevanbu,vfilkov}@ucdavis.edu,∗raissa@cse.ucdavis.edu Abstract—Work practices vary among software developers. Some are highly focused on a few artifacts; others make wide- ranging contributions. Similarly, some artifacts are mostly au- thored, or “owned”, by one or few developers; others have very wide ownership. Focus and ownership are related but different phenomena, both with strong effect on software quality. Prior studies have mostly targeted ownership; the measures of own- ership used have generally been based on either simple counts, information-theoretic views of ownership, or social-network views of contribution patterns. We argue for a more general concep- tual view that unifies developer focus and artifact ownership. We analogize the developer-artifact contribution network to a predator-prey food web, and draw upon ideas from ecology to produce a novel, and conceptually unified view of measuring focus and ownership. These measures relate to both cross-entropy and Kullback-Liebler divergence, and simultaneously provide two normalized measures of focus from both the developer and artifact perspectives. We argue that these measures are theoret- ically well-founded, and yield novel predictive, conceptual, and actionable value in software projects. We find that more focused developers introduce fewer defects than defocused developers. In contrast, files that receive narrowly focused activity are more likely to contain defects than other files. I. INTRODUCTION Developers are the lifeblood of open source software, OSS, and their contributions are vital for OSS to thrive. Rather than being assigned tasks by management, OSS developers are generally free to choose the style, focus, and breadth of their contributions. Some might be quite focused, working on one specific subsystem; others may contribute to many different subsystems. An device driver expert, for example, may con- tribute very specialized knowledge to an open source project, focusing on only a few files or packages. His contributions to a small subset of modules1 may be his only contribution during his tenure with the project. In contrast, a project leader may work on a variety of different tasks touching many modules within a project. While OSS developers are free to choose their contribution styles, such choices are not inconsequential, especially to the central issue of software quality. A dominant theme emerging from previous work in this area is module ownership [1], [2], [3]. Low ownership of a module, i.e., too many contributors, can adversely impact code quality. There is, however, an entirely different perspective, developer’s attention focus, which is relatively unexplored. Human attention and cognition are finite resoucres [4]. When different tasks are simultaneously engaged, they can compete 1We use modules to mean either packages or files, depending on the context. for mental resources and task performance can suffer [5]. A developer engaged in many different tasks carries a greater cognitive burden than a more focused developer. Interestingly, the developer and module perspectives are, conceptually sym- metric, dualistic views of focus. From a module’s perspective, strong ownership indicates a strong focused contribution. We refer to this as module activity focus, or MAF, a measure of how focused the activities are on a module. Symmetrically, we refer to the developer’s attention focus, or DAF, a measure of how focused the activities are of a particular developer. A surprising, but natural analogy for MAF and DAF, are predator-prey food webs from ecology. In a sense, modules are predators that “feed upon” the cognitive resources of developers. As the number of developers contributing to a module increases, the diversity of cognitive resources upon which the module “feeds” also increases; likewise, a developer is a “prey” whose limited cognitive resources are spread over the modules that “prey” upon her. Ecosystem diversity is of great interest to ecologists. Williams and Martinez call the roles complexity and diversity play “[o]ne of the most important and least settled questions in ecology.” [6] This diversity has two symmetric perspectives, both from a prey’s perspective, and a predator’s perspective. Ecologists have developed sophisticated symmetric measures of predator-prey relationships, drawing upon ideas such as entropy and Kulback-Leibler divergence, that simultaneously capture both perspectives. We adapt these measures for soft- ware engineering projects into the metrics MAF and DAF. In this work, we employ the methodology presented by El Emam to validate our measures [7]. In particular, we show that the DAF and MAF measures succeed in distinguishing important cases that extant measures don’t capture. We make the following contributions: • We adapt terminology and motivation from ecology, based on bipartite graphs; • We incorporate and generalize previous results on devel- oper and artifact diversity; • We provide easy to compute measures of focus, MAF and DAF, normalized to facilitate comparison within and across projects; • We show these measures more precisely capture out- comes relevant to software researchers and practitioners. This novel analysis simultaneously considers focus both from the artifact perspective and the author perspective. Researchers can use our MAF and DAF metrics to more 978-1-4673-3074-9/13/$31.00 c 2013 IEEE ICSE 2013, San Francisco, CA, USA452 ICSE   2013
  27. 27. Migra@on  in  soHware  ecosystems Gnome  case  study
  28. 28. 9  July  2013  -­‐  SATToSE,  Bern 28 Migra@on  in  soHware  ecosystems • How  do  soHware  projects  evolve? – Analogy  to  “gene  transfer”  in  re@culate  evolu@on • Transfer  of  knowledge – Sharing  or  migra@on  of  contributors  across  projects • Transfer  of  code – Copy-­‐paste  reuse  and  code  cloning – Branching  and  merging  of  code  repositories
  29. 29. C Java Objective C Python Lisp JS ASP.Net C/C++ Header C++ Perl yacc C# IDL Haskell Objective C++ lexAssembly Visual Basic PHP Ruby Tcl/Tk 1e+05 1e+07 100 1000 10000 Files LOC 9  July  2013  -­‐  SATToSE,  Bern Migra@on  in  soHware  ecosystems Gnome  case  study 29 • 16  years  of  ac@vity • >  1400  projects • >  5800  contributors  (>  4300  coders) • >  1.3M  of  commits  (>  0.6M  of  code  commits) • >  12M  of  file  touches  (>  6M  of  codefile  touches)
  30. 30. 9  July  2013  -­‐  SATToSE,  Bern Migra@on  in  soHware  ecosystems Gnome  case  study 30 Hierarchical  clustering  of  projects Developers  tend  to  collaborate  more  if  they  use  the  same  programming  language
  31. 31. 9  July  2013  -­‐  SATToSE,  Bern Migra@on  in  soHware  ecosystems Gnome  case  study 31
  32. 32. Time Joiners 1997 1999 2001 2003 2005 2007 2009 2011 2013 05101520253035 9  July  2013  -­‐  SATToSE,  Bern 32 Migra@on  in  soHware  ecosystems Gnome  case  study Evolu@on Gimp GTK+ joiners are incoming coders in the considered project that were not active in any of the GNOME projects during the preceding period. A similar definition holds for the local and global leavers. Formally, the metrics are defined as follows. Let p be a GNOME project, t a 6-month activity period (and t 1 the previous period), c a coder, Gnome the set of GNOME’s code projects, and isDev(c,t, p) is a predicate which is true if and only if c made a code commit in p during t: localLeavers(p,t) = {c|isDev(c,t 1, p)^¬isDev(c,t, p)^9p2 (p2 2 Gnome^isDev(c,t, p2))} globalLeavers(p,t) = {c|isDev(c,t 1, p)^8p2 (p2 2 Gnome ) ¬isDev(c,t, p2))} localJoiners(p,t) = {c|isDev(c,t, p)^¬isDev(c,t 1, p)^9p2 (p2 2 Gnome^isDev(c,t 1, p2))} globalJoiners(p,t) = {c|isDev(c,t, p)^8p2 (p2 2 Gnome ) ¬isDev(c,t 1, p2))} Time Joiners 1997 1999 2001 2003 2005 2007 2009 2011 2013 051015202530 Time Joiners 1997 1999 2001 2003 2005 2007 2009 2011 2013 051015202530 Time Joiners 1997 1999 2001 2003 2005 2007 2009 2011 2013 051015202530 evolution gtk+ gimp Fig. 1.11 Historical evolution (timeline) of the number of local (black solid) and global (red dashed) joiners (y-axis) for three GNOME projects. We did not find any general trend, the patterns of intake and loss of coders are highly project-specific. Figure 1.11 illustrates the evolution of the number of local and global joiners for some of the more important GNOME projects (the figures for leavers are very similar). For some projects (e.g., evolution) we do not observe a big difference between the number of local and global joiners, respectively. These projects seem to attract new developers both from within and outside of GNOME. Other projects, like gimp, attract most of its incoming developers from outside GNOME. A third category of projects attracts most of its incoming developers from other GNOME projects. This is the case for gtk+, glib and libgnome, which can be considered as belonging to the core of GNOME. This observation seems to Timeline  (6-­‐month  intervals) of  joiners  to  Gnome  projects -­‐  Black  =  local  joiners  from  other  Gnome  projects -­‐  Red  =  global  joiners  from  outside  of  Gnome -­‐  Blue  =  stayers Time Joiners 1997 1999 2001 2003 2005 2007 2009 2011 2013 05101520253035 Time Joiners 1997 1999 2001 2003 2005 2007 2009 2011 2013 05101520253035
  33. 33. Time 1997 1999 2001 2003 2005 2007 2009 2011 2013 05101520253035 Time Leavers 1997 1999 2001 2003 2005 2007 2009 2011 2013 05101520253035 Time Leavers 1997 1999 2001 2003 2005 2007 2009 2011 2013 05101520253035 9  July  2013  -­‐  SATToSE,  Bern 33 Migra@on  in  soHware  ecosystems Gnome  case  study Evolu@on Gimp GTK+ -­‐  Black  =  local  joiners  from  other  Gnome  projects -­‐  Red  =  global  joiners  from  outside  of  Gnome -­‐  Blue  =  stayers 28 Tom Mens, Ma¨elick Claes, Philippe Grosjean and Alexander Serebrenik project that were not active in this project during the preceding 6-month period, but that were involved in some activity in other GNOME projects instead. Global joiners are incoming coders in the considered project that were not active in any of the GNOME projects during the preceding period. A similar definition holds for the local and global leavers. Formally, the metrics are defined as follows. Let p be a GNOME project, t a 6-month activity period (and t 1 the previous period), c a coder, Gnome the set of GNOME’s code projects, and isDev(c,t, p) is a predicate which is true if and only if c made a code commit in p during t: localLeavers(p,t) = {c|isDev(c,t 1, p)^¬isDev(c,t, p)^9p2 (p2 2 Gnome^isDev(c,t, p2))} globalLeavers(p,t) = {c|isDev(c,t 1, p)^8p2 (p2 2 Gnome ) ¬isDev(c,t, p2))} localJoiners(p,t) = {c|isDev(c,t, p)^¬isDev(c,t 1, p)^9p2 (p2 2 Gnome^isDev(c,t 1, p2))} globalJoiners(p,t) = {c|isDev(c,t, p)^8p2 (p2 2 Gnome ) ¬isDev(c,t 1, p2))} Time Joiners 1997 1999 2001 2003 2005 2007 2009 2011 2013 051015202530 Time Joiners 1997 1999 2001 2003 2005 2007 2009 2011 2013 051015202530 Time Joiners 1997 1999 2001 2003 2005 2007 2009 2011 2013 051015202530 evolution gtk+ gimp Fig. 1.11 Historical evolution (timeline) of the number of local (black solid) and global (red dashed) joiners (y-axis) for three GNOME projects. We did not find any general trend, the patterns of intake and loss of coders are highly project-specific. Figure 1.11 illustrates the evolution of the number of local and global joiners for some of the more important GNOME projects (the figures for leavers are very similar). For some projects (e.g., evolution) we do not observe a big difference between the number of local and global joiners, respectively. These projects seem to attract new developers both from within and outside of GNOME. Timeline  (6-­‐month  intervals) of  leavers  from  Gnome  projects
  34. 34. Time 1997 1999 2001 2003 2005 2007 2009 2011 2013 05101520253035 Time Leavers 1997 1999 2001 2003 2005 2007 2009 2011 2013 05101520253035 Time Leavers 1997 1999 2001 2003 2005 2007 2009 2011 2013 05101520253035 9  July  2013  -­‐  SATToSE,  Bern 34 Migra@on  in  soHware  ecosystems Gnome  case  study Evolu@on Gimp GTK+ CF(p)  =  Collabora@on  factor  for  Gnome  project  p =  percentage  of  coders  in  p  having  contributed  to  other  Gnome  projects CF(Gimp)  =  65,3% (low  collabora@on) CF(GTK+)  =  94,8% (very  high  collabora@on) CF(Evolu@on)  =  85,1% (high  collabora@on)
  35. 35. 9  July  2013  -­‐  SATToSE,  Bern Migra@on  in  soHware  ecosystems Gnome  case  study 35
  36. 36. 9  July  2013  -­‐  SATToSE,  Bern Some  references 36 To appear in 2013 in Springer’s Empirical Software Engineering journal – manuscript (will be inserted by the editor) On the variation and specialisation of workload – A case study of the Gnome ecosystem community Bogdan Vasilescu · Alexander Serebrenik · Mathieu Goeminne · Tom Mens DOI: 10.1007/s10664-013-9244-1 Chapter 10 Studying Evolving Software Ecosystems based on Ecological Models Tom Mens, Ma¨elick Claes, Philippe Grosjean and Alexander Serebrenik Research on software evolution is very active, but evolutionary principles, models and theories that properly explain why and how software systems evolve over time are still lacking. Similarly, more empirical research is needed to understand how different software projects co-exist and co-evolve, and how contributors collaborate within their encompassing software ecosystem. In this chapter, we explore the differences and analogies between natural ecosys- tems and biological evolution on the one hand, and software ecosystems and soft- ware evolution on the other hand. The aim is to learn from research in ecology to advance the understanding of evolving software ecosystems. Ultimately, we wish to use such knowledge to derive diagnostic tools aiming to analyse and optimise the fitness of software projects in their environment, and to help software project communities in managing their projects better. Tom Mens and Ma¨elick Claes and Philippe Grosjean COMPLEXYS Research Institute, University of Mons, Belgium e-mail: tom.mens,maelick.claes,philippe.grosjean@umons.ac.be Alexander Serebrenik Eindhoven University of Technology, The Netherlands e-mail: a.serebrenik@tue.nl This work has been partially supported by F.R.S-F.N.R.S. research grant BSS-2012/V 6/5/015 author’s stay at the Universit´e de Mons, supported by the F.R.S-F.N.R.S. under the grant BSS- 2012/V 6/5/015. and ARC research project AUWB-12/17-UMONS-3,“Ecological Studies of Open Source Software Ecosystems” financed by the Minist`ere de la Communaut´e franc¸aise - Direction g´en´erale de l’Enseignement non obligatoire et de la Recherche scientifique, Belgium. 245 UMONS Faculté des Sciences Département d’Informatique Understanding the Evolution of Socio-technical Aspects in Open Source Ecosystems: An Empirical Analysis of GNOME Mathieu Goeminne A dissertation submitted in fulfillment of the requirements of the degree of Docteur en Sciences Advisor Jury Dr. TOM MENS Dr. XAVIER BLANC Université de Mons, Belgium Université de Bordeaux 1, France Dr. VÉRONIQUE BRUYÈRE Université de Mons, Belgium Dr. JESUS M. GONZALEZ-BARAHONA Universidad Rey Juan Carlos, Spain Dr. TOM MENS Université de Mons, Belgium Dr. ALEXANDER SEREBRENIK Technische Universiteit Eindhoven, The Netherlands Dr. JEF WIJSEN Université de Mons, Belgium June 2013 A historical dataset for GNOME contributors Mathieu Goeminne, Ma¨elick Claes and Tom Mens Software Engineering Lab, COMPLEXYS research institute, UMONS, Belgium Abstract—We present a dataset of the open source software ecosystem GNOME from a social point of view. We have collected historical data about the contributors to all GNOME projects stored on git.gnome.org, taking into account the problem of identity matching, and as- sociating different activity types to the contributors. This type of information is very useful to complement the traditional, source-code related information one can ob- tain by mining and analyzing the actual source code. The dataset can be obtained at https://bitbucket.org/ mgoeminne/sgl-flossmetric-dbmerge. I. INTRODUCTION In this paper, we present the process we have used to create a dataset containing the historical information related to contributors to the GNOME ecosystem. Our database and the tools and scripts used to created it can be found on a dedicated Bitbucket repository2 . In contrast to many other datasets, we do not focus on source code, since a significant amount of files commit- ted to GNOME’s project repositories do not even contain code (e.g., image files, web pages, documentation, lo- calization and many more). Such type of information is often ignored in MSR research while it is very relevant to understand which types of activities contributors are @  MSR  2013 [to  appear  in  2014]

×