MOD2014-Mens-Lecture2

Evolving(So*ware(Ecosystems(
Marktoberdorf(Summer(School(2014 
Lecture(2
Tom(Mens(
So#ware(Engineering(Lab(
University(of(Mons
informa7que.umons.ac.be/genlog

So#ware(Evolu7on(
Lehman’s(Laws
• Manny(Lehman((1925(?(2010)(
– Studied(30?year(evolu7on(of 
IBM(OS/360(mainframe(
– Proposed(“laws”(that(reﬂect(established(
observa/ons(based(on*empirical*evidence(
– EPSRC?funded(FEAST(project(
• Addi7onal(evidence(on(more(industrial(so#ware(projects
31
Lehman and Belady (1985). Software Evolution –
Processes of Software Change. Academic Press.
Lehman (1997). Laws of Software Evolution
Revisited. Springer LNCS 1149, pp. 108-124

So#ware(Evolu7on(
Lehman’s(Laws
• ConGnuing(change(
• A([…](program(that(is(used(in(a(real?world(environment(must(be(con7nually(
adapted,(else*it*becomes*progressively*less*sa/sfactory.*
• Increasing(complexity(
• As(a(program(is(evolved(its(complexity(increases(unless*work*is*done*to*
maintain*or*reduce*it.*
• ConGnuing(growth(
• Func7onal(content(of(a(program(must(be(con7nually(increased(to(maintain(
user(sa7sfac7on(over(its(life7me.(
• Declining(quality(
• […](programs(will(be(perceived(as(of(declining(quality(unless(rigorously(
maintained(and(adapted(to(a(changing(opera7onal(environment(
• Feedback(system(
• […](programming(processes(cons7tute(mul7?loop,(mul7?level(feedback(
systems(and(must(be(treated(as(such(to(be(successfully(modiﬁed(or(improved
32

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering
February(2014(?(CSMR?WCRE(So#ware(Evolu7on(Week,(Antwerp,(Belgium
So#ware(Evolu7on 
Relevant(Books
33
2006
Consider software evolution
process as a multi-loop
multi-level feedback system
!
- Reports on results from the EPSRC-
funded FEAST project
- Supporting empirical evidence for
Lehman’s laws of software evolution

So#ware(Evolu7on 
Relevant(Books
34
Relevant chapters
!
- Analyzing Software Repositories to
Understand Software Evolution
- D’Ambros et al.
!
- Predicting Bugs From History
- Zimmermann et al.
!
- Empirical Studies of Open Source
Evolution
- Fernandez-Ramil et al.
2008

So#ware(Evolu7on 
Relevant(Books
35
Mens, Tom; Serebrenik, Alexander; Cleve, Anthony (Eds.)
2014, XXIII, 404 p.
!
Springer, ISBN 978-3-642-45398-4
Chapter 10
Studying Evolving Software Ecosystems
based on Ecological Models
Tom Mens, Ma¨elick Claes, Philippe Grosjean and Alexander Serebrenik
Research on software evolution is very active, but evolutionary principles, models
and theories that properly explain why and how software systems evolve over time
are still lacking. Similarly, more empirical research is needed to understand how
different software projects co-exist and co-evolve, and how contributors collaborate
within their encompassing software ecosystem.
In this chapter, we explore the differences and analogies between natural ecosys-
tems and biological evolution on the one hand, and software ecosystems and soft-
ware evolution on the other hand. The aim is to learn from research in ecology to
advance the understanding of evolving software ecosystems. Ultimately, we wish
to use such knowledge to derive diagnostic tools aiming to analyse and optimise
the ﬁtness of software projects in their environment, and to help software project
communities in managing their projects better.

So#ware(Ecosystems
Deﬁni&ons

So#ware(Ecosystems 
Relevant(Books
37
MIT(Press,(2005
2013

So#ware(Ecosystems 
Relevant(PhD(Disserta7ons
38
Reverse Engineering Software Ecosystems
Doctoral Dissertation submitted to the
Faculty of Informatics of the University of Lugano
in partial fulﬁllment of the requirements for the degree of
Doctor of Philosophy
presented by
Mircea F. Lungu
under the supervision of
Michele Lanza
September 2009
Social Aspects of Collaboration in
Online Software Communities
Bogdan Vasilescu
Eindhoven University of Technology
2014

So#ware(Ecosystems(
Deﬁni7ons
• Messerschmit(&(Szyperski,(2003([book](
• “a*collec/on*of*so,ware*products*that*have*some*given*
degree*of*symbio/c*rela/onships.”
39

So#ware(Ecosystems(
Deﬁni7ons
degree*of*symbio/c*rela/onships.”*
• Lungu,(2008([disserta7on]*
• “a*collec/on*of*so,ware*projects*that*are*developed*and*
evolve*together*in*the*same*environment.”
40

So#ware(Ecosystems(
Deﬁni7ons
degree*of*symbio/c*rela/onships.”*
• Lungu,(2008([disserta7on]*
• “a*collec/on*of*so,ware*projects*that*are*developed*and*
evolve*together*in*the*same*environment.”*
• Jansen(et(al.,(2013([book]*
• “a*set*of*actors*func/oning*as*a*unit*and*interac/ng*with*
a*shared*market*for*so,ware*and*services,*together*with*
the*rela/onships*among*them.”
41

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering 42
So#ware(Ecosystems(
Deﬁni7ons
Business?oriented(view
• “a*set*of*actors*func/oning*as*a*unit*
and*interac/ng*with*a*shared*market*
for*so,ware*and*services,*together*
with*the*rela/onships*among*them.”
Examples
• Eclipse(
• Android*and*iOS*app*store

July?August(2014(—(NATO(Marktoberdorf(Summer(School(—(Dependable(So#ware(Systems(Engineering 43
So#ware(Ecosystems(
Deﬁni7ons
Development?centric(view
• “a*collec/on*of*so,ware*
products*that*have*some*given*
degree*of*symbio/c*
rela/onships.”*
!
!
• “a*collec/on*of*so,ware*
projects*that*are*developed*
and*evolve*together*in*the*
same*environment.”*
Examples
• Gnome 
KDE(
!
• Debian 
Ubuntu(
!
• R’s*CRAN(
!
• Apache

So#ware(Ecosystems(
Deﬁni7ons
Projet 1
Projet 2
Projet 3
44
Socio?technical(view
• a*community*of*persons*
(end&users,*developers,*
debuggers,*…)*contribu/ng*
to*a*collec/on*of*projects

So#ware(Ecosystems(
Deﬁni7ons
Ecosystem(<>(System(of(systems(
( (cf.(John(McDermid)(
!
An ecosystem is a set of systems that is 
“designed as a whole”.!
These systems!
cannot function in isolation (symbiotic relationships)!
are usually very diverse!
function together as a unit!
are evolved together towards a common 
(but evolving) goal

So#ware(Ecosystem(Analysis(
Challenges
47
Empirically(analysing(so#ware(ecosystems(involves(many(
challenges
• Technical*challenges*
• Scien/ﬁc*challenges*
• Prac/cal*challenges*
• Ethical*challenges*
• …

Challenges
Projet 1
Projet 2
Projet 3
48
Technical(challenges
• Extrac/ng*and*combining*data*
from*diﬀerent*sources*
• Iden/ty*merging*
• Dealing*with*inconsistent*and*
incomplete*data*
• Big$data*analy/cs*
• special*skills*and*tools*
needed*to*store,*process*and*
analyse*huge*amounts*of*
data*

Challenges
49
Scien&ﬁc(challenges
• Accessibility*of*data*
• E.g.*many*apps*in*Google*Play*are*proprietary 
and*historical*informa/on*is*not*accessible*
• Focus*on*open*source*so,ware*
• Reproducibility*of*results*
• Generalisability*of*results*
• Which*research*methodology,*which*metrics,*which*sta/s/cal*
tools,*…

Challenges
50
Prac&cal(challenges
• How*can*we*share*our*big*data*with*other*researchers?*
• Different*formats,*different*tools,*storage*problems,*…*
• How*can*we*make*our*research*results*useful*to*prac//oners*
and*development*communi/es?*
• How*can*we*build*tools*and*dashboards*that*integrate*our*
findings?

Challenges
51
Ethical(challenges
• Privacy*issues*
• Can*we*use*and*combine*informa/on*about*actual*
developers?*
• Can*we*make*these*results*freely*available?*
• How*to*reconcile*privacy*with*reproducibility*?
Privacy Reproducibility

Technical(Challenges(
Extrac7ng(data(from(diﬀerent(sources
•(Source(code(and(other(commits(stored(in(version(
control(repositories(
E.g.,(Subversion,(Git(
•(Developer(mailing(lists(and(user(mailing(lists(
!
•(Bug(reports(and(change(requests(stored(in(issue(
tracking(systems((
E.g.,(Bugzilla,(JIRA(
Ques7on(and(Answer(websites(
E.g.(StackOverﬂow
52

Extrac7ng(data(from(diﬀerent(sources
Using(open(source(MetricsGrimoire(tool(
suite((htps://github.com/MetricsGrimoire)(
CVSAnalY(
•extracts(informa7on(from(SVN(or(Git(source(code(
repository(logs(and(stores(it(into(rela7onal(database(
MailingListStats(
•extracts(mailing(list(informa7on(from(mbox(format(
Bicho(
•extracts(informa7on(from(issue(tracking(systems(such(
as(Bugzilla(and(JIRA
53

Iden7ty(merging
The(same(contributor(may(use(diﬀerent(
aliases
54
Euphegenia Doubtfire,
euphegenia@hotmail.com
Robin Williams,
robinw@gmail.com

Iden7ty(merging
55
DépôtsContributeurs
john
John Smith
Dépôt de code source
Mailing list
Bug tracker
john <js@gmail.com>
john@doe.org
johnny
john
John, Doe
Doe, John
john.doe@gmail.com
john_doe@hotmail.com
jdoe@gmail.com
John W. Doe
Jane

56
6-3-2013
Ordering Rajesh Sola Sola Rajesh
Spelling: misspelling,
diacritics, punctuation
Rene Engelhard Fene Engelhard
Démurget Demurget
J. A. M. Carneiro J A M Carneiro
Middle initials, patronyms,
nicknames, additional
surnames, incomplete
names
Daniel M. Mueth Daniel Mueth
Alexander Alexandrov
Shopov
Alexander Shopov
Carlos Garnacho Parro Carlos Garnacho
Jacob “Ulysses” Berkman Jacob Berkman
A S Alam Amanpreet Singh Alam
Name variants:
transliteration, diminutives
Γιωργοσ Georgios
Mike Gratton Michael Gratton
Software-specific:
usernames, projects,
tooling artefacts
mrhappypants Aaron Brown
Arturo Tena/libole2 Arturo Tena
(16:06) Alex Roberts Alex Roberts
Mix Any combination of those

Iden7ty(merging
57
id(=(17 
{(John(Doe,(
Doe(John, 
john@doe.org, 
john_doe@hotmail.com, 
john.doe@gmail.com(}
Semi-automatic approach:
• eliminate specific quirks
observed during extraction
Example: “(16:06) Alex Roberts”
• compute similarity between
each pair of aliases
(based on Levenshtein distance)
• cluster together aliases with
high similarity
• post-process manually
•rely on external information (websites)
•precise but labor-intensive

Iden7ty(merging
Levenshtein(distance((1965):(
• Computes(the(minimal(distance(between(2(strings(
in(terms(of(single(character(edits((dele$on,(
addi$on(or(replacement)(
• Example:(lev(“Mike”,(“Michael”)(=(4(
• “Mike”(=>(“Mice”(=>(“Miche”(=>(“Michae”(=>(“Michael”
58

Iden7ty(merging
Levenshtein(distance((1965):(
• Computes(the(minimal(distance(between(2(strings(in(
terms(of(single(character(edits((dele$on,(addi$on(or(
replacement)(
• Example:(lev(“Mike”,(“Michael”)(=(4(
• “Mike”(=>(“Mice”(=>(“Miche”(=>(“Michae”(=>(“Michael”(
!
• Side(note(
• Damerau?Levenshtein(distance(also(considers(
transposi$on/of/adjacent/characters/
• Applied(in(biology(for(DNA(sequence(alignment
59

Iden7ty(merging
60
• several merge algorithms
exist
!
• the “noisier” the data, the
worse they perform!
!
• simple algorithms have
higher precision and recall
than more complex ones
A Comparison of Identity Merge
Algorithms for Software Repositories
Mathieu Goeminne⇤
, Tom Mens⇤
Institut d’Informatique, Facult´e des Sciences, Universit´e de Mons
Abstract
Software repository mining research extracts and analyses data originating from
multiple software repositories to understand the historical development of soft-
ware systems, and to propose better ways to evolve such systems in the future.
Of particular interest is the study of the activities and interactions between the
persons involved in the software development process. The main challenge with
such studies lies in the ability to determine the identities (e.g., logins or e-mail
accounts) in software repositories that represent the same physical person. To
achieve this, di↵erent identity merge algorithms have been proposed in the past.
This article provides an objective comparison of identity merge algorithms, in-
cluding some improvements over existing algorithms. The results are validated
on a selection of large ongoing open source software projects.
Keywords: software repository mining, empirical software engineering,
identity merging, open source, software evolution, comparison
1. Introduction
Science(of(Computer(Programming(28(8),(August(2013

Iden7ty(merging
61
Alternative automated
approach
• Use of Latent Semantic
Analysis (LSA)
• equally good as other
algorithms in average
case
• better performance in
worst case
parameters, we first performed a sensitivity analysis by
fixing 3 and varying the remaining. After the sensitivity
analysis we restricted the range of minLen to {2, 3, 4}
levThr to {0.5, 0.75}, cosThr to {0.65, 0.70, 0.75}, and k
was fixed to half of the number of terms. In the average
case, for each of the ten repetitions, training was performed
on one tenth of the GNOME aliases (' 860), and testing on
ten random subsets with the same size from the remaining
aliases. Samples were chosen instead of the entire remaining
data for computational efficiency reasons. In the worst case
because of fewer aliases in the dataset (673), for each of the
ten repetitions, training was performed on one third of the
data and testing on the other two thirds. All algorithms as
well as the data, can be made available upon request.
Who’s who in GNOME: using LSA to merge software repository identities
Erik Kouters, Bogdan Vasilescu⇤, Alexander Serebrenik, Mark G. J. van den Brand
Technische Universiteit Eindhoven,
Den Dolech 2, P.O. Box 513,
5600 MB Eindhoven, The Netherlands
e.t.m.kouters@student.tue.nl, {b.n.vasilescu, a.serebrenik, m.g.j.v.d.brand}@tue.nl
Abstract—Understanding an individual’s contribution to an
ecosystem often necessitates integrating information from mul-
tiple repositories corresponding to different projects within
the ecosystem or different kinds of repositories (e.g., mail
archives and version control systems). However, recognising
that different contributions belong to the same contributor is
challenging, since developers may use different aliases.
It is known that existing identity merging algorithms are
sensitive to large discrepancies between the aliases used by
the same individual: the noisier the data, the worse their
performance. To assess the scale of the problem for a large
software ecosystem, we study all GNOME Git repositories,
classify the differences in aliases, and discuss robustness of
existing algorithms with respect to these types of differences.
We then propose a new identity merging algorithm based on
Latent Semantic Analysis (LSA), designed to be robust against
more types of differences in aliases, and evaluate it empirically
by means of cross-validation on GNOME Git authors. Our
results show a clear improvement over existing algorithms in
terms of precision and recall on worst-case input data.
Keywords-identity merging; Gnome; latent semantic analysis
I. INTRODUCTION
One of the challenges when mining software repositories
is identity merging [5]. To study contributors to software
projects or software ecosystems, one often tries to integrate
information about their contributions in different software
repositories, such as version control systems, bug trackers, or
mailing lists. However, developers may use different aliases
To integrate information about individual contributio
we therefore need a unique identity representing
same contributor across different repositories and differ
projects. To this end, we need to use an identity mergi
algorithm [1, 3, 5, 8, 9]. However, performance of existi
approaches degrades sharply in presence of “noisy” data, i
data containing large discrepancies between the aliases us
by the same individual: “the more noisy and complex
project data is, the worse the merge algorithms behave” [
In this paper we concentrate on aliases used by develop
in version control systems (VCS); here the term “alia
refers to a hname, emaili tuple, typically available in V
logs. Even for a single repository type such as VCS,
same contributor may use different aliases at different tim
or in different projects within the ecosystem. Our g
is to design an identity merge algorithm with improv
robustness with respect to noisy data, common in ecosyste
maintained by large developer communities. We start
extracting commit authorship information from all GNOM
Git repositories, and discuss differences in the aliases us
by GNOME developers in Section II. Next, we evalu
robustness of two state of the art identity merging algorith
with respect to types of differences in aliases in Section
Based on lessons learned from existing approaches,
propose a new identity merging algorithm using Lat
Semantic Analysis (LSA) [6] in Section IV, and evalu
it empirically by means of cross-validation in Section
Our results show equally-good performance as the state
ICSM(2012(ERA(track

Research(challenges(
Accessibility
Focus(on(open6source(so#ware(
•(Free(access(to(source(code,(defect(data,(
developer(and(user(communica7on(
•(Historical(data(available(in(open(repositories(
– Observable(communi7es(
– Observable(ac7vi7es(
•(Increasing(popularity(for(personal(and(
commercial(use(
•(A(huge(range(of(community(and(so#ware(sizes 
62

MOD2014-Mens-Lecture2

Recommended

Recommended

More Related Content

Similar to MOD2014-Mens-Lecture2

Similar to MOD2014-Mens-Lecture2 (20)

More from Tom Mens

More from Tom Mens (20)

Recently uploaded

Recently uploaded (20)

MOD2014-Mens-Lecture2