This is my second in a series of 4 lectures on the topic of Evolving Software Ecosystems, presented during the NATO Marktoberdorf 2014 Summer School on Dependable Software System Engineering in Germany, August 2014.
7. July?August(2014(ā(NATO(Marktoberdorf(Summer(School(ā(Dependable(So#ware(Systems(Engineering
February(2014(?(CSMR?WCRE(So#ware(Evolu7on(Week,(Antwerp,(Belgium
So#ware(Evolu7onāØ
Relevant(Books
35
Mens, Tom; Serebrenik, Alexander; Cleve, Anthony (Eds.)
2014, XXIII, 404 p.
!
Springer, ISBN 978-3-642-45398-4
Chapter 10
Studying Evolving Software Ecosystems
based on Ecological Models
Tom Mens, MaĀØelick Claes, Philippe Grosjean and Alexander Serebrenik
Research on software evolution is very active, but evolutionary principles, models
and theories that properly explain why and how software systems evolve over time
are still lacking. Similarly, more empirical research is needed to understand how
different software projects co-exist and co-evolve, and how contributors collaborate
within their encompassing software ecosystem.
In this chapter, we explore the differences and analogies between natural ecosys-
tems and biological evolution on the one hand, and software ecosystems and soft-
ware evolution on the other hand. The aim is to learn from research in ecology to
advance the understanding of evolving software ecosystems. Ultimately, we wish
to use such knowledge to derive diagnostic tools aiming to analyse and optimise
the ļ¬tness of software projects in their environment, and to help software project
communities in managing their projects better.
32. July?August(2014(ā(NATO(Marktoberdorf(Summer(School(ā(Dependable(So#ware(Systems(Engineering
Technical(Challenges(
Iden7ty(merging
60
ā¢ several merge algorithms
exist
!
ā¢ the ānoisierā the data, the
worse they perform!
!
ā¢ simple algorithms have
higher precision and recall
than more complex ones
A Comparison of Identity Merge
Algorithms for Software Repositories
Mathieu Goeminneā¤
, Tom Mensā¤
Institut dāInformatique, FacultĀ“e des Sciences, UniversitĀ“e de Mons
Abstract
Software repository mining research extracts and analyses data originating from
multiple software repositories to understand the historical development of soft-
ware systems, and to propose better ways to evolve such systems in the future.
Of particular interest is the study of the activities and interactions between the
persons involved in the software development process. The main challenge with
such studies lies in the ability to determine the identities (e.g., logins or e-mail
accounts) in software repositories that represent the same physical person. To
achieve this, diāµerent identity merge algorithms have been proposed in the past.
This article provides an objective comparison of identity merge algorithms, in-
cluding some improvements over existing algorithms. The results are validated
on a selection of large ongoing open source software projects.
Keywords: software repository mining, empirical software engineering,
identity merging, open source, software evolution, comparison
1. Introduction
Science(of(Computer(Programming(28(8),(August(2013
33. July?August(2014(ā(NATO(Marktoberdorf(Summer(School(ā(Dependable(So#ware(Systems(Engineering
Technical(Challenges(
Iden7ty(merging
61
Alternative automated
approach
ā¢ Use of Latent Semantic
Analysis (LSA)
ā¢ equally good as other
algorithms in average
case
ā¢ better performance in
worst case
parameters, we ļ¬rst performed a sensitivity analysis by
ļ¬xing 3 and varying the remaining. After the sensitivity
analysis we restricted the range of minLen to {2, 3, 4}
levThr to {0.5, 0.75}, cosThr to {0.65, 0.70, 0.75}, and k
was ļ¬xed to half of the number of terms. In the average
case, for each of the ten repetitions, training was performed
on one tenth of the GNOME aliases (' 860), and testing on
ten random subsets with the same size from the remaining
aliases. Samples were chosen instead of the entire remaining
data for computational efļ¬ciency reasons. In the worst case
because of fewer aliases in the dataset (673), for each of the
ten repetitions, training was performed on one third of the
data and testing on the other two thirds. All algorithms as
well as the data, can be made available upon request.
Whoās who in GNOME: using LSA to merge software repository identities
Erik Kouters, Bogdan Vasilescuā¤, Alexander Serebrenik, Mark G. J. van den Brand
Technische Universiteit Eindhoven,
Den Dolech 2, P.O. Box 513,
5600 MB Eindhoven, The Netherlands
e.t.m.kouters@student.tue.nl, {b.n.vasilescu, a.serebrenik, m.g.j.v.d.brand}@tue.nl
AbstractāUnderstanding an individualās contribution to an
ecosystem often necessitates integrating information from mul-
tiple repositories corresponding to different projects within
the ecosystem or different kinds of repositories (e.g., mail
archives and version control systems). However, recognising
that different contributions belong to the same contributor is
challenging, since developers may use different aliases.
It is known that existing identity merging algorithms are
sensitive to large discrepancies between the aliases used by
the same individual: the noisier the data, the worse their
performance. To assess the scale of the problem for a large
software ecosystem, we study all GNOME Git repositories,
classify the differences in aliases, and discuss robustness of
existing algorithms with respect to these types of differences.
We then propose a new identity merging algorithm based on
Latent Semantic Analysis (LSA), designed to be robust against
more types of differences in aliases, and evaluate it empirically
by means of cross-validation on GNOME Git authors. Our
results show a clear improvement over existing algorithms in
terms of precision and recall on worst-case input data.
Keywords-identity merging; Gnome; latent semantic analysis
I. INTRODUCTION
One of the challenges when mining software repositories
is identity merging [5]. To study contributors to software
projects or software ecosystems, one often tries to integrate
information about their contributions in different software
repositories, such as version control systems, bug trackers, or
mailing lists. However, developers may use different aliases
To integrate information about individual contributio
we therefore need a unique identity representing
same contributor across different repositories and differ
projects. To this end, we need to use an identity mergi
algorithm [1, 3, 5, 8, 9]. However, performance of existi
approaches degrades sharply in presence of ānoisyā data, i
data containing large discrepancies between the aliases us
by the same individual: āthe more noisy and complex
project data is, the worse the merge algorithms behaveā [
In this paper we concentrate on aliases used by develop
in version control systems (VCS); here the term āalia
refers to a hname, emaili tuple, typically available in V
logs. Even for a single repository type such as VCS,
same contributor may use different aliases at different tim
or in different projects within the ecosystem. Our g
is to design an identity merge algorithm with improv
robustness with respect to noisy data, common in ecosyste
maintained by large developer communities. We start
extracting commit authorship information from all GNOM
Git repositories, and discuss differences in the aliases us
by GNOME developers in Section II. Next, we evalu
robustness of two state of the art identity merging algorith
with respect to types of differences in aliases in Section
Based on lessons learned from existing approaches,
propose a new identity merging algorithm using Lat
Semantic Analysis (LSA) [6] in Section IV, and evalu
it empirically by means of cross-validation in Section
Our results show equally-good performance as the state
ICSM(2012(ERA(track