How Clones are Maintained
2007 - 2017
Luigi
Cerulo
Max
Di Penta
Lerina
Aversano
University of Sannio
How Clones are Maintained
2007 - 2017
Luigi
Cerulo
Max
Di Penta
Lerina
Aversano
University of Sannio
Chapter 1 - How everything started
Chapter 2 - The follow-up
Chapter 3 - The impact
Chapter 4 - Take-aways
Chapter 1 - How everything started
Chapter 2 - The follow-up
Chapter 3 - The impact
Chapter 4 - Take-aways
Chapter 1 - How everything started
Chapter 2 - The follow-up
Chapter 3 - The impact
Chapter 4 - Take-aways
Chapter 1 - How everything started
Chapter 2 - The follow-up
Chapter 3 - The impact
Chapter 4 - Take-aways
Chapter Zero
Prologue
… SE Prophets envisioned a
new future
… SE Prophets envisioned a
new future
Clone genealogies
(ESEC/FSE 2005)
SAME SHIFT
INCONSISTENT
CHANGE
ADD
CONSISTENT
CHANGE
SUBTRACT
Figure 1: The relationship among evolution patter
traces code clones in consecutive versions using a metr
based clone detector and classifies clones into four cat
gories: new clones, modified clones, never modified clon
and deleted clones. Their analysis does not address how e
ments in a group of code clones change with respect to oth
elements in the group. To the best of our knowledge, o
clone genealogy extractor (detailed in Section 4) is the fir
tool that systematically analyzes clone evolution patter
by monitoring how a clone group evolves.
Techniques for Analyzing Structural Changes
… SE Prophets envisioned a
new future
Clone genealogies
(ESEC/FSE 2005)
SAME SHIFT
INCONSISTENT
CHANGE
ADD
CONSISTENT
CHANGE
SUBTRACT
Figure 1: The relationship among evolution patter
traces code clones in consecutive versions using a metr
based clone detector and classifies clones into four cat
gories: new clones, modified clones, never modified clon
and deleted clones. Their analysis does not address how e
ments in a group of code clones change with respect to oth
elements in the group. To the best of our knowledge, o
clone genealogy extractor (detailed in Section 4) is the fir
tool that systematically analyzes clone evolution patter
by monitoring how a clone group evolves.
Techniques for Analyzing Structural Changes
Change coupling and clones

(FASE 2006)
Relation of Code Clones and Change Couplings 7
Number of
Couplings
Clone Coverage
CouplingCoverage
Length of Clone
Fig. 2. Description of the metrics used in the visualization.
… SE Prophets envisioned a
new future
“Cloning considered harmful” considered harmful
(WCRE 2006)
Clone genealogies
(ESEC/FSE 2005)
SAME SHIFT
INCONSISTENT
CHANGE
ADD
CONSISTENT
CHANGE
SUBTRACT
Figure 1: The relationship among evolution patter
traces code clones in consecutive versions using a metr
based clone detector and classifies clones into four cat
gories: new clones, modified clones, never modified clon
and deleted clones. Their analysis does not address how e
ments in a group of code clones change with respect to oth
elements in the group. To the best of our knowledge, o
clone genealogy extractor (detailed in Section 4) is the fir
tool that systematically analyzes clone evolution patter
by monitoring how a clone group evolves.
Techniques for Analyzing Structural Changes
Change coupling and clones

(FASE 2006)
Relation of Code Clones and Change Couplings 7
Number of
Couplings
Clone Coverage
CouplingCoverage
Length of Clone
Fig. 2. Description of the metrics used in the visualization.
Somebody was analyzing
source code line trails (ldiff)…
Somebody was analyzing
source code line trails (ldiff)…
MSR
2007
Track the lifetime of software
entities
ldiff’s ability to identify moved line blocks and
thus its ability to track a software entity when its
position in a file changes. To this end, we ran-
domly generated new releases of 100 source code
files selected from two open source projects (Post-
greSQL and openSSH) by randomly moving code
fragments within the source code file. The frag-
ments varied from 1 line to a maximum of 1/10
of the total number of lines. We assessed the algo-
rithm in terms of precision and recall:
precision=numberofcorrectlydetectedmoves/
extracted change sets from the ArgoUML Con-
current Versions System (CVS) repository, repre-
senting different types of changes, such as bug fix-
ing, refactoring, or enhancement. We assessed the
tool’s precision by manually identifying false posi-
tives in classifications the algorithm made. The 11
change sets affected from 11 to 72 files (median
19) and from 32 to 401 lines (median 42). Figure
3b shows the median ldiff and Unix diff accuracy
and the interquartile range (between the third and
first quartile). (For the ldiff syntax, see the “Ldiff:
A Support Tool” sidebar.)
/*
* foo(revision 1.3)
*/
int foo(float a, int b) {
return a;
}
Snapshots extracted from
Concurrent Versions System/
Subversion archive
Entity A
added
Entity B
changed
Entity A
changed
Entity B
deleted Time
Snapshot 1
Entity A
tracking
Entity B
tracking
Snapshot 2
LDA(1,2) LDA(2,3)
Snapshot 3
LDA(3,4)
Snapshot 4
LDA(4,5)
Snapshot 5
LDA(n – 1, n)
Snapshot n
DEL
CHG
DEL
CHG
CHG
CHG
CHG
CHG
ADD ADD
CHG
ADD
ADD
ADD
DEL
CHG CHG
CHG// foo (revision 1.4)
float foo(int a, int b) {
if (b!=0)
return (float)a/b;
else
return 0;
}
// foo (revision 1.5)
float foo(int a, int b) {
int c=0
if (b!=0)
return (float)a/b;
return c;
}
IEEESoftware26.1(2009)
Somebody else
used to study clone evolution
Nice surprise! We got a
grant on software evolution
Ok… that was
not so much money…
Chapter One
How Everything Started
What we wanted to
study…
Software clones are
devils?
To what extend they can
be assimilated as (bad/
good?) software
engineering practices?
Measure how clones
are maintained
Tracking clone changes
Clone class A
Clone class B
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn
Consistent change
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn
Consistent change
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn
Consistent change
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn
Consistent change
Late propagation
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn
Consistent change
Late propagation
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn
Consistent change
Late propagation
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn
Consistent change
Late propagation
Independent evolution
The Work
Only two projects
Only two projects
One clone detector
Only two projects
One clone detector
Automated clone tracking
Only two projects
One clone detector
Automated clone tracking
Manual classification
Some findings
Some findings
Class-level clones mostly consistently
changed. Not the case for method and block
Some findings
Class-level clones mostly consistently
changed. Not the case for method and block
13%-32% of independent evolution
Some findings
Class-level clones mostly consistently
changed. Not the case for method and block
Between 13% and 16% of late propagation
13%-32% of independent evolution
Some findings
Class-level clones mostly consistently
changed. Not the case for method and block
Between 13% and 16% of late propagation
13%-32% of independent evolution
Some findings
Late propagation often due to different
schedule, caused bugs only in few cases
Class-level clones mostly consistently
changed. Not the case for method and block
Between 13% and 16% of late propagation
13%-32% of independent evolution
We got the Paper!
How Clones are Maintained: An Empirical Study
Lerina Aversano, Luigi Cerulo, Massimiliano Di Penta
RCOST — Research Centre on Software Technology
Department of Engineering - University of Sannio
Viale Traiano - 82100 Benevento, Italy
{aversano, lcerulo, dipenta}@unisannio.it
Abstract
Despite the conventional wisdom concerning the risks
related to the use of source code cloning as a software de-
velopment strategy, several studies appeared in literature
indicated that this is not true. In most cases clones are prop-
erly maintained and, when this does not happen, is because
cloned code evolves independently.
Stemming from previous works, this paper combines
clone detection and co–change analysis to investigate how
clones are maintained when an evolution activity or a bug
fixing impact a source code fragment belonging to a clone
class. The two case studies reported confirm that, either for
bug fixing or for evolution purposes, most of the cloned code
is consistently maintained during the same co–change or
during temporally close co–changes.
Keywords: Clone detection, software evolution, mining
software repositories
1. Introduction
Several recent studies contradict the common wisdom
that cloning constitutes a risky practice: as found by Kim et
al. [16]. As shown in a paper by Kasper and Godfrey [15],
source code clones are not necessarily to be considered
harmful but, many times, as a way to develop software cre-
ating, for example, new features starting for existing, simi-
lar ones. Whilst this creates duplications, it also permits the
use of stable, already tested and used code.
This paper aims to report results from an empiri-
cal study aiming to investigate how clones, detected in a
given release of a software system, are affected by mainte-
nance intervention. The analysis is performed by intersect-
ing cloned classes with data from Modification Transactions
(MTs) mined from source code repositories. A MT iden-
tifies groups of source code lines co-changed in the same
time window. The work is built upon the idea of clone pat-
terns described by Kasper and Godfrey and of clone
evolution patterns described by Kim et al., and investi-
gates whether clones (i) are updated consistently during
the same MT or near MTs, confirming the correlation be-
tween MTs and clones, as experienced by Geiger et al.
[10]; (ii) evolve independently; or (iii) are subject to up-
Submit where?
How Clones are Maintained: An Empirical Study
Lerina Aversano, Luigi Cerulo, Massimiliano Di Penta
RCOST — Research Centre on Software Technology
Department of Engineering - University of Sannio
Viale Traiano - 82100 Benevento, Italy
{aversano, lcerulo, dipenta}@unisannio.it
Abstract
Despite the conventional wisdom concerning the risks
related to the use of source code cloning as a software de-
velopment strategy, several studies appeared in literature
indicated that this is not true. In most cases clones are prop-
erly maintained and, when this does not happen, is because
cloned code evolves independently.
Stemming from previous works, this paper combines
clone detection and co–change analysis to investigate how
clones are maintained when an evolution activity or a bug
fixing impact a source code fragment belonging to a clone
class. The two case studies reported confirm that, either for
bug fixing or for evolution purposes, most of the cloned code
is consistently maintained during the same co–change or
during temporally close co–changes.
Keywords: Clone detection, software evolution, mining
software repositories
1. Introduction
Several recent studies contradict the common wisdom
that cloning constitutes a risky practice: as found by Kim et
al. [16]. As shown in a paper by Kasper and Godfrey [15],
source code clones are not necessarily to be considered
harmful but, many times, as a way to develop software cre-
ating, for example, new features starting for existing, simi-
lar ones. Whilst this creates duplications, it also permits the
use of stable, already tested and used code.
This paper aims to report results from an empiri-
cal study aiming to investigate how clones, detected in a
given release of a software system, are affected by mainte-
nance intervention. The analysis is performed by intersect-
ing cloned classes with data from Modification Transactions
(MTs) mined from source code repositories. A MT iden-
tifies groups of source code lines co-changed in the same
time window. The work is built upon the idea of clone pat-
terns described by Kasper and Godfrey and of clone
evolution patterns described by Kim et al., and investi-
gates whether clones (i) are updated consistently during
the same MT or near MTs, confirming the correlation be-
tween MTs and clones, as experienced by Geiger et al.
[10]; (ii) evolve independently; or (iii) are subject to up-
WCRE?
Submit where?
How Clones are Maintained: An Empirical Study
Lerina Aversano, Luigi Cerulo, Massimiliano Di Penta
RCOST — Research Centre on Software Technology
Department of Engineering - University of Sannio
Viale Traiano - 82100 Benevento, Italy
{aversano, lcerulo, dipenta}@unisannio.it
Abstract
Despite the conventional wisdom concerning the risks
related to the use of source code cloning as a software de-
velopment strategy, several studies appeared in literature
indicated that this is not true. In most cases clones are prop-
erly maintained and, when this does not happen, is because
cloned code evolves independently.
Stemming from previous works, this paper combines
clone detection and co–change analysis to investigate how
clones are maintained when an evolution activity or a bug
fixing impact a source code fragment belonging to a clone
class. The two case studies reported confirm that, either for
bug fixing or for evolution purposes, most of the cloned code
is consistently maintained during the same co–change or
during temporally close co–changes.
Keywords: Clone detection, software evolution, mining
software repositories
1. Introduction
Several recent studies contradict the common wisdom
that cloning constitutes a risky practice: as found by Kim et
al. [16]. As shown in a paper by Kasper and Godfrey [15],
source code clones are not necessarily to be considered
harmful but, many times, as a way to develop software cre-
ating, for example, new features starting for existing, simi-
lar ones. Whilst this creates duplications, it also permits the
use of stable, already tested and used code.
This paper aims to report results from an empiri-
cal study aiming to investigate how clones, detected in a
given release of a software system, are affected by mainte-
nance intervention. The analysis is performed by intersect-
ing cloned classes with data from Modification Transactions
(MTs) mined from source code repositories. A MT iden-
tifies groups of source code lines co-changed in the same
time window. The work is built upon the idea of clone pat-
terns described by Kasper and Godfrey and of clone
evolution patterns described by Kim et al., and investi-
gates whether clones (i) are updated consistently during
the same MT or near MTs, confirming the correlation be-
tween MTs and clones, as experienced by Geiger et al.
[10]; (ii) evolve independently; or (iii) are subject to up-
Sorry! I’m WCRE
PC co-chair
Submit where?
How Clones are Maintained: An Empirical Study
Lerina Aversano, Luigi Cerulo, Massimiliano Di Penta
RCOST — Research Centre on Software Technology
Department of Engineering - University of Sannio
Viale Traiano - 82100 Benevento, Italy
{aversano, lcerulo, dipenta}@unisannio.it
Abstract
Despite the conventional wisdom concerning the risks
related to the use of source code cloning as a software de-
velopment strategy, several studies appeared in literature
indicated that this is not true. In most cases clones are prop-
erly maintained and, when this does not happen, is because
cloned code evolves independently.
Stemming from previous works, this paper combines
clone detection and co–change analysis to investigate how
clones are maintained when an evolution activity or a bug
fixing impact a source code fragment belonging to a clone
class. The two case studies reported confirm that, either for
bug fixing or for evolution purposes, most of the cloned code
is consistently maintained during the same co–change or
during temporally close co–changes.
Keywords: Clone detection, software evolution, mining
software repositories
1. Introduction
Several recent studies contradict the common wisdom
that cloning constitutes a risky practice: as found by Kim et
al. [16]. As shown in a paper by Kasper and Godfrey [15],
source code clones are not necessarily to be considered
harmful but, many times, as a way to develop software cre-
ating, for example, new features starting for existing, simi-
lar ones. Whilst this creates duplications, it also permits the
use of stable, already tested and used code.
This paper aims to report results from an empiri-
cal study aiming to investigate how clones, detected in a
given release of a software system, are affected by mainte-
nance intervention. The analysis is performed by intersect-
ing cloned classes with data from Modification Transactions
(MTs) mined from source code repositories. A MT iden-
tifies groups of source code lines co-changed in the same
time window. The work is built upon the idea of clone pat-
terns described by Kasper and Godfrey and of clone
evolution patterns described by Kim et al., and investi-
gates whether clones (i) are updated consistently during
the same MT or near MTs, confirming the correlation be-
tween MTs and clones, as experienced by Geiger et al.
[10]; (ii) evolve independently; or (iii) are subject to up-
Lets try with
CSMR, it is in
Amsterdam!
We got
accepted!
Amsterdam we’re coming
We got
accepted!
Amsterdam we’re coming
From: Massimiliano Di Penta <dipenta@unisannio.it>
Subject: [Fwd: CSMR 2007 Notification]
Date: 30 Nov 2006 15:28:59 CET
To: Lerina Aversano <aversano@unisannio.it>, "Luigi Cerulo"
<lcerulo@unisannio.it>
great...ecco le revisioni ... non so in effetti tra il primo e il terzo quale e' il piu'
negativo (magari il primo)
La critica del primo e' tutto sommato condivisibile, nel senso che considera il
lavoro buono anche se molte cose si sapevano gia' (come del resto nel paper
di Godfrey che nonostante una A aveva ricevuto qualche commento simile a
WCRE) e questo e' yet another study.. (magari con qualche livello di dettaglio
in piu')... da spiegare meglio nel camera ready copy
…
Guardate qui: se la gente dovesse seguire questa regola non si
pubblicherebbe mai neanche su TSE ... !!
General advice: Please submit your paper to a workshop to discuss the setup
of your experiments. A submission for a conference should analyse more (>=
10) throughly selected software systems. As you suggest, your clone
detection tool is very conservative, and you should perform the analyses
with several different tools. Only then, your claim would be sufficiently
supported.
….
Ciao
Max
Amsterdam
The Conference
The talk
Chapter Two
The follow-up
We need to do much
better… the
classification is not fully
automated yet
Folks, one reviewer was upset!
We also need to enlarge the
study. More systems, …
more…
It would be great to get
a student to help us on
the project
One young student wrote us to
spend a few months in our lab..
Suresh
Thummalapenta
at the time PhD student at NCSU
with Tao Xie
now with Microsoft Research
This is great!
Let’s ask Suresh to join
the force on this project
CF CF CF CFCF CF3 2. Identification of clone
fragment pairs evolution
3. Identification of
clone class evolution
Clone
class
CS
2
1. Identification of
clone section pairs
evolution
LP LP CO
LP
LP
LP
LP
CO
CO
CO
CF1 CF2 CF3
CS
1
CS
1
CS
2
CS
2
CS
1
121 2 3
CF CF21 CF3
1,2
1,2
2,3
2,3
1,3
1,3
Fine-level automated
tracking approach
The Study
Four projects, C and Java
Both token-based and AST-based detectors
Relation of clone evolution patterns with
• Clone granularity
• Clone radius
• Defect-proneness
Evolution Patterns
0%
20%
40%
60%
80%
ArgoUML JBoss OpenSSH PostgreSQL
0%0%
3%4%
16%
4%5%7%
39%
24%
52%
34%
38%
71%
40%
55%
Consistent Indep. Evolution Late Propagation Unknown
Late Propagation
Two PostfreSQL Functions containing clones
The first underwent a bug fixing
The second changed six months after:

“...I had previously fixed the identical bug in
oper_select_candidate, but didn't realize that the
same error was repeated over here...”
Independent Evolution
ArgoUML Classes GeneratorJava and
GeneratorDisplay containing cloned
methods
GeneratorDisplay starts to implement
enhanced visualization features
After that, both changes independently

(no more clones)
Other Findings
Clone radius and granularity do not influence
evolution patterns
Late propagation more correlated to defects
than other evolution patterns
The EMSE Paper
Empir Software Eng (2010) 15:1–34
DOI 10.1007/s10664-009-9108-x
An empirical study on the maintenance
of source code clones
Suresh Thummalapenta · Luigi Cerulo ·
Lerina Aversano · Massimiliano Di Penta
Published online: 25 March 2009
© Springer Science + Business Media, LLC 2009
Editor: Murray Wood
Abstract Code cloning has been very often indicated as a bad software development
practice. However, many studies appearing in the literature indicate that this is not
always the case. In fact, either changes occurring in cloned code are consistently
propagated, or cloning is used as a sort of templating strategy, where cloned
source code fragments evolve independently. This paper (a) proposes an automatic
approach to classify the evolution of source code clone fragments, and (b) reports
a fine-grained analysis of clone evolution in four different Java and C software
systems, aimed at investigating to what extent clones are consistently propagated or
they evolve independently. Also, the paper investigates the relationship between the
presence of clone evolution patterns and other characteristics such as clone radius,
clone size and the kind of change the clones underwent, i.e., corrective maintenance
or enhancement.
Keywords Software clones · Software maintenance · Mining software repositories ·
Clone evolution
Chapter Three:
The Impact
People
Topics
Late Propagation
Clone changes
Clones and bugs
Tracking Entities
Late Propagation
Clone changes
Clones and bugs
Tracking Entities
Tracking Design Patterns
An Empirical Study on the Evolution of Design Patterns
Lerina Aversano, Gerardo Canfora, Luigi Cerulo,
Concettina Del Grosso, Massimiliano Di Penta
RCOST – Research Centre on Software Technology, University of Sannio
Via Traiano, 82100 Benevento, Italy
aversano@unisannio,it, canfora@unisannio.it, lcerulo@unisannio.it,
tina.delgrosso@unisannio.it, dipenta@unisannio.it
ABSTRACT
Design patterns are solutions to recurring design problems,
conceived to increase benefits in terms of reuse, code quality
and, above all, maintainability and resilience to changes.
This paper presents results from an empirical study aimed
at understanding the evolution of design patterns in three
open source systems, namely JHotDraw, ArgoUML, and
Eclipse-JDT. Specifically, the study analyzes how frequently
patterns are modified, to what changes they undergo and
what classes co-change with the patterns. Results show
how patterns more suited to support the application pur-
pose tend to change more frequently, and that different kind
of changes have a different impact on co-changed classes
and a different capability of making the system resilient to
changes.
Categories and Subject Descriptors
D.2.2 [Software Engineering]: Design Tools And Tech-
niques—Object-oriented design methods
General Terms
Design, Experimentation, Measurement
Keywords
Design patterns, Software Evolution, Mining Software Repo-
sitories, Empirical Software Engineering
1. INTRODUCTION
some aspect of system structure vary independently of other
aspects, thereby making a system more robust to a particu-
lar kind of change”. Advantages of design patterns include
decoupling a request from specific operations (Chain of Re-
sponsibility and Command), making a system independent
from software and hardware platforms (Abstract Factory
and Bridge), independent from algorithmic solutions (Itera-
tor, Strategy, Visitor), or avoid modifying implementations
(Adapter, Decorator, Visitor). Further discussion on design
pattern advantages, and extensive pattern catalogues can be
found in books such as [11] or [9].
While many benefits related to the use of design patterns
have been stated, a little has been done to empirically in-
vestigate pattern change proneness [3] or whether there is a
relationships between the presence of defects in the source
code and the use of design patterns [24]. In particular, there
is lack of empirical studies aimed at analyzing what kind of
changes each type of pattern undergoes during software evo-
lution, and whether such a change can be related to changes
contextually made on other classes not belonging to the pat-
tern. The availability of source repositories for many object-
oriented open source systems realized making use of design
patterns, of techniques for identifying change sets [10] —
i.e., sets of artifacts changed together by the same author
— from source code repositories, and of design pattern de-
tection techniques and tools [1, 8, 15, 19, 23], triggers op-
portunities for this kind of studies.
This paper reports and discusses results from an empir-
ical study aimed at analyzing how design patterns change
during a software system lifetime, and to what extent such
changes cause modifications to other classes not part of the
Tracking Design
Pattern Evolution
JHotDraw ArgoUML Eclipse-JDT
Patterns
Observer,
Composite
Adapter-Command,
Decorator, Factory
Visitor
Used for
Model View
Controller of
Draws,
Handling
composite
figures
Adapting/ decorating
UML objects to different
views
Execute menu actions
Visiting Java
AST
Purpose 

of change
Adding new
draw elements
Adding new menu
actions and presentations
Adding new
code analyses
Patterns with More 

Co-Changed Code
Pattern
#ofLinesadded/removedinco-changedClasses
Visitor
Template
State-Strategy
Singleton
Prototype
Observer
Factory
Decorator
Composite
Adapter-Command
16000
14000
12000
10000
8000
6000
4000
2000
0
Eclipse-JDT
Tracking Vulnerabilities
The life and death of statically detected vulnerabilities: An empirical study
Massimiliano Di Penta a,*, Luigi Cerulo b
, Lerina Aversano a
a
Dept. of Engineering, University of Sannio, Via Traiano, 82100 Benevento, Italy
b
Dept. of Biological and Environmental Studies, University of Sannio, Via Port’Arsa, 11 – 82100 Benevento, Italy
a r t i c l e i n f o
Available online xxxx
Keywords:
Software vulnerabilities
Mining software repositories
Empirical study
a b s t r a c t
Vulnerable statements constitute a major problem for developers and maintainers of networking sys-
tems. Their presence can ease the success of security attacks, aimed at gaining unauthorized access to
data and functionality, or at causing system crashes and data loss. Examples of attacks caused by source
code vulnerabilities are buffer overflows, command injections, and cross-site scripting.
This paper reports on an empirical study, conducted across three networking systems, aimed at observ-
ing the evolution and decay of vulnerabilities detected by three freely available static analysis tools. In
particular, the study compares the decay of different kinds of vulnerabilities, characterizes the decay like-
lihood through probability density functions, and reports a quantitative and qualitative analysis of the
reasons for vulnerability removals. The study is performed by using a framework that traces the evolution
of source code fragments across subsequent commits.
Ó 2009 Elsevier B.V. All rights reserved.
1. Introduction
Vulnerable instructions are, very often, the cause of serious
problems such as security attacks, system failures or crashes. In
his Ph.D. thesis [1] Krsul defined a software vulnerability as ‘‘an in-
stance of an error in the specification, development, or configuration of
software such that its execution can violate the security policy”. For
business-critical systems, the presence of vulnerable instructions
in the source code is often the cause of security attacks or, in other
cases, of system failures or crashes. The problem is particularly rel-
Detecting the presence of such instructions is therefore crucial
to ensure high security and reliability. Indeed, security advisories
are regularly published – see for example those of Linux distribu-
tions3
Microsoft,4
those published by CERT, or by securityfocus.5
These advisories, however, are posted when a problem already
occurred in the application, a problem that was very often caused
by the introduction in the source code of vulnerable statements. This
highlights the needs to identify potential problems when they are
introduced, and to keep track of them during the software system
lifetime, as it is done, for example for source code clones [2].
Information and Software Technology xxx (2009) xxx–xxx
Contents lists available at ScienceDirect
Information and Software Technology
journal homepage: www.elsevier.com/locate/infsof
ARTICLE IN PRESS
Vulnerability Decay
Vulnerability Decay
Buffer Overflows
Vulnerability Decay
Buffer Overflows
Memory Problems
Code Siblings and Licensing
Code siblings: technical and legal implications of copying code between
applications
Daniel M. German†
, Massimiliano Di Penta‡
, Yann-Ga¨el Gu´eh´eneuc⋆
, and Giuliano Antoniol⋆
†
University of Victoria, Victoria, BC, Canada
‡
RCOST–University of Sannio, Benevento, Italy
⋆
PTIDEJ Team–SOCCER Lab., DGIGL, ´Ecole Polytechnique de Montr´eal, QC, Canada
dmg@uvic.ca, dipenta@unisannio.it, yann-gael.gueheneuc@polymtl.ca, antoniol@ieee.org
Abstract
Source code cloning does not happen within a single sys-
tem only. It can also occur between one system and another.
We use the term code sibling to refer to a code clone that
evolves in a different system than the code from which it
originates. Code siblings can only occur when the source
code copyright owner allows it and when the conditions
imposed by such license are not incompatible with the li-
cense of the destination system. In some situations copying
of source code fragments are allowed—legally—in one di-
rection, but not in the other.
In this paper, we use clone detection, license mining and
classification, and change history techniques to understand
how code siblings—under different licenses—flow in one di-
rection or the other between Linux and two BSD Unixes,
different operating systems and environments. In all cases,
cross-system clones are introduced.
Usually, source code is distributed according to the terms
of a software license. Once the developer chooses to dis-
tribute her work with a particular license, she explicitly im-
poses limits on what can be done with the code: if and how
it can be used, modified, copied, distributed, and extended.
Software licenses may prevent or favor the migration of
code fragments in one or the other direction, or both. Once
having migrated, code fragments evolve constrained by the
new environment. In the following, we use the term sibling
to refer to a fragment of code that has been cloned from one
file in one system to another file in a different system. In
some cases, a sibling may span an entire file.
Then, we propose an analysis process to identify siblings
Code Siblings and Licensing
FreeBSD
Linux
siblings
Cloned fragments
Cloned fragments
Migration
direction
Preferential Migration from OS with
permissive License (FreeBSD-OpenBSD)
towards Linux (mainly GPL)
Migration From
Third-Party Code
commit a9474917099e007c0f51d5474394b5890111614f
Author: Sean Hefty <sean.hefty@intel.com>
Date: Mon Jul 14 23:48:43 2008 -0700
RDMA: Fix license text
The license text for several files references a third software license
that was inadvertently copied in. Update the license to what was
intended. This update was based on a request from HP. [..]
Blame-based tracking
Distinguishing Copies from Originals in Software Clones
Jens Krinke, Nicolas Gold, Yue Jia
King’s College London
Centre for Research on Evolution, Search and
Testing (CREST)
{jens.krinke,nicolas.gold,yue.jia}@kcl.ac.uk
David Binkley
Loyola University Maryland
Baltimore, MD, USA
binkley@cs.loyola.edu
ABSTRACT
Cloning is widespread in today’s systems where automated assis-
tance is required to locate cloned code. Although the evolution of
clones has been studied for many years, no attempt has been made
so far to automatically distinguish the original source code leading
to cloned copies. This paper presents an approach to classify the
clones of a clone pair based on the version information available
in version control systems. This automatic classification attempts
to distinguish the original from the copy. It allows for the fact that
the clones may be modified and thus consist of lines coming from
different versions. An evaluation, based on two case studies, shows
that when comments are ignored and a small tolerance is accepted,
for the majority of clone pairs the proposed approach can automat-
ically distinguish between the original and the copy.
Categories and Subject Descriptors
D.2.9 [Software Engineering]: Management—Software config-
uration management; D.2.13 [Software Engineering]: Reusable
Software—Reusable libraries
General Terms
Algorithms
Keywords
Clone detection, mining software archives, software evolution
1. INTRODUCTION
The duplication of code is a common practice to make software
existing code. However, such practices can complicate software
maintenance so it has been suggested that too much cloned code is
a risk, albeit the practice itself is not generally harmful [16]. Be-
cause of these problems, many approaches to detecting cloned code
have been developed [2, 3, 8, 15, 18–20, 24, 26]. While methods to
identify clones automatically and efficiently are to some extent un-
derstood, it is still disputable whether the presence of clones is a
risk. To better understand why and how code is cloned, recent em-
pirical studies of cloned code have focused mainly on examining
the evolution of clones, such as whether cloned code is more stable
or changed consistently [1,10,12,17,21,22,27].
A lot of research has been done on finding and identifying soft-
ware clones, but without additional information it is impossible to
distinguish the original from the copy. Most of the above men-
tioned previous empirical studies used version control systems to
extract limited information about the discovered clones; for exam-
ple, when a clone appears in some previous version. However, so
far there has been no general approach proposed to distinguish orig-
inals from copies except for a study done by German et al. [11] who
tracked when clones appeared in the version history to identify the
clone of a pair that appeared first. This paper presents an approach
that uses line-by-line version information available from version
control systems to distinguish the original from the copied code
clone in a clone pair.
Most version control systems have a ‘blame’ command which
shows author and version information for each line in a file. This
information, which includes the version when the line was added or
last modified, can be used as a line age: if all lines in one clone have
older versions than the lines in the other clone of a clone pair, then
the clone with the older lines may be the original and the other may
be the copy (assuming that the clone with the oldest lines existed
Cloning and Copying between GNOME Projects
Jens Krinke, Nicolas Gold, Yue Jia
King’s College London,
Centre for Research on Evolution, Search and Testing (CREST)
{jens.krinke,nicolas.gold,yue.jia}@kcl.ac.uk
David Binkley
Loyola University Maryland,
Baltimore, MD, USA
binkley@cs.loyola.edu
Abstract—This paper presents an approach to automatically
distinguish the copied clone from the original in a pair of clones.
It matches the line-by-line version information of a clone to the
pair’s other clone. A case study on the GNOME Desktop Suite
revealed a complex flow of reused code between the different
subprojects. In particular, it showed that the majority of larger
clones (with a minimal size of 28 lines or higher) exist between
the subprojects and more than 60% of the clone pairs can be
automatically separated into original and copy.
I. INTRODUCTION
The duplication of code is a common practice to make
software development faster, to enable “experimental” devel-
is most likely the original and the other the copy. However,
usually, it is not that simple because the original and the copy
may have been modified in turn after the copy was created.
This paper makes the following contributions:
• It extends previous work [19] to automatically distinguish
between copy and original by allowing the clones of a
clone pair to be in different systems.
• A case study on the GNOME Desktop Suite subprojects
shows that the majority of larger clones (with a minimal
size of 28 lines or higher) exist between the subprojects
and more than 60% of the clone pairs can be automat-
ically separated automatically into original and copied
Smell Evolution
When and Why Your Code Starts to Smell Bad
(and Whether the Smells Go Away)
Michele Tufano1, Fabio Palomba2, Gabriele Bavota3
Rocco Oliveto4, Massimiliano Di Penta5, Andrea De Lucia2, Denys Poshyvanyk1
1The College of William and Mary, Williamsburg, VA, USA 2University of Salerno, Fisciano (SA), Italy,
3Universit`a della Svizzera italiana (USI), Switzerland, 4University of Molise, Pesche (IS), Italy,
5University of Sannio, Benevento (BN), Italy
mtufano@email.wm.edu, fpalomba@unisa.it, gabriele.bavota@usi.ch
rocco.oliveto@unimol.it, dipenta@unisannio.it, adelucia@unisa.it, denys@cs.wm.edu
Abstract—Technical debt is a metaphor introduced by Cunningham to indicate “not quite right code which we postpone making it right”.
One noticeable symptom of technical debt is represented by code smells, defined as symptoms of poor design and implementation
choices. Previous studies showed the negative impact of code smells on the comprehensibility and maintainability of code. While
the repercussions of smells on code quality have been empirically assessed, there is still only anecdotal evidence on when and
why bad smells are introduced, what is their survivability, and how they are removed by developers. To empirically corroborate such
anecdotal evidence, we conducted a large empirical study over the change history of 200 open source projects. This study required the
development of a strategy to identify smell-introducing commits, the mining of over half a million of commits, and the manual analysis
and classification of over 10K of them. Our findings mostly contradict common wisdom, showing that most of the smell instances are
introduced when an artifact is created and not as a result of its evolution. At the same time, 80% of smells survive in the system. Also,
among the 20% of removed instances, only 9% are removed as a direct consequence of refactoring operations.
Index Terms—Code Smells, Empirical Study, Mining Software Repositories
F
1 INTRODUCTION
THE technical debt metaphor introduced by Cunning-
ham [23] explains well the trade-offs between deliv-
ering the most appropriate but still immature product,
removed [14]. This represents an obstacle for an effec-
tive and efficient management of technical debt. Also,
understanding the typical life-cycle of code smells and
the actions undertaken by developers to remove them
is of paramount importance in the conception of recom-
Smell-introducing Commits
100
200
300
400
500
c1
c2
c3
c4
c5
c6
c7
c8
Metric
When Are Smells Introduced
Commits required to a class for becoming smell
50 1000 25 75
Generally, blobs affect a
class since its creation
There are several cases in which a blob is
introduced during maintenance activities
Why are smell introduced?
BLOB
CDSBP
CC
FD
SC
BF E NF R
Blob
Class Data
Should Be
Private
Complex Class
Functional
Decomposition
Spaghetti Code
Bug Fixing
0 1005025 75
Enhancement New
Feature
Refactoring
Smell Removal
Code Removal
Code Replacement
Code Insertion
Refactoring
Major Restructuring
0% 10% 20% 30% 40%
4%
9%
15%
33%
40%
Clone changes
Clones and bugs
Tracking Entities
Late Propagation
Clone changes
Clones and bugs
Tracking Entities
Late Propagation
Late Propagation in Software Clones
Liliane Barbour, Foutse Khomh, Ying Zou
Department of Electrical and Computer Engineering
Queen’s University
Kingston, ON
{l.barbour, foutse.khomh, ying.zou}@queensu.ca
Abstract—Two similar code segments, or clones, form a clone
pair within a software system. The changes to the clones over
time create a clone evolution history. In this work we study
late propagation, a specific pattern of clone evolution. In late
propagation, one clone in the clone pair is modified, causing
the clone pair to become inconsistent. The code segments
are then re-synchronized in a later revision. Existing work
has established late propagation as a clone evolution pattern,
and suggests that the pattern is related to a high number
of faults. In this study we examine the characteristics of
late propagation in two long-lived software systems using the
Simian and CCFinder clone detection tools. We define 8 types
of late propagation and compare them to other forms of clone
evolution. Our results not only verify that late propagation
is more harmful to software systems, but also establish that
some specific cases of late propagations are more harmful than
others. Specifically, two cases are most risky: (1) when a clone
experiences inconsistent changes and then a re-synchronizing
change without any modification to the other clone in a
clone pair; and (2) when two clones undergo an inconsistent
modification followed by a consistent change that modifies both
the clones in a clone pair.
Keywords-clone genealogies; late propagation; fault-
proneness.
I. INTRODUCTION
A code segment is labeled as a code clone if it is identical
or highly similar to another code segment. Similar code
segments form a clone pair. Clone pairs can be introduced
into systems deliberately (e.g., “copy and paste” actions)
or inadvertently by a developer during development and
the new context. For example, if a driver is required for a
new printer model, a developer could copy the driver code
from an older printer model and then modify it. Inconsistent
changes can also occur accidentally. A developer may be
unaware of a clone pair, and cause an inconsistency by only
changing one half of the clone pair. This inconsistency could
cause a software fault. If a fault is found in one clone and
fixed, but not propagated to the other clone in the clone pair,
the fault remains in the system. For example, a fault might
be found in the old printer driver code and fixed, but the fix
is not propagated to the new printer driver. For these reasons,
previous studies [1] have argued that accidental inconsistent
changes make code clones more prone to faults.
Late propagation occurs when a clone pair that under-
goes one or more inconsistent changes followed by a re-
synchronizing change [2]. The re-synchronization of the
code clones indicates that the gap in consistency is acci-
dental. Since accidental inconsistencies are considered risky
[3], the presence of late propagation in clone genealogies
can be an indicator of risky, fault-prone code.
Many studies have been performed on the evolution of
clones. A few (e.g., [2], [3]) have studied late propagation,
and indicated that late propagation genealogies are more
fault-prone than other clone genealogies. Thummalapenta et
al. began the initial work in examining the characteristics of
late propagation. The authors measured the delay between
an inconsistent change and a re-synchronizing change and
related the delay to software faults. In our work, we examine
More Detailed Genealogy
More Detailed Genealogy
More Detailed Genealogy
Propagation always occurs
More Detailed Genealogy
Propagation always occurs
Propagation may not occur
More Detailed Genealogy
Propagation always occurs
Propagation may not occur
Propagation never occurs
Breakdown
PercentageofAllLP
Occurrences
0%
20%
40%
60%
80%
LP1 LP2 LP3 LP4 LP5 LP6 LP7 LP8
ArgoUML - Simian ArgoUML - CCFinder Ant - Simian Ant - CCFinder
May
not

occur
Never

occurs
Breakdown
PercentageofAllLP
Occurrences
0%
20%
40%
60%
80%
LP1 LP2 LP3 LP4 LP5 LP6 LP7 LP8
ArgoUML - Simian ArgoUML - CCFinder Ant - Simian Ant - CCFinder
Faults by LP Type
PercentageofFaultOccurrences
0%
20%
40%
60%
80%
LP Type
LP1 LP2 LP3 LP4 LP5 LP6 LP7 LP8
Ant - Simian ArgoUML - CCFinder Ant - CCFinder
LP In Type-3 Clones
Late propagation of Type-3 Clones
Saman Bazrafshan
Universit¨at Bremen
saman.bazrafshan@informatik.uni-bremen.de
Abstract
Type-3 clones are duplicated source code fragments
that span two or more identical sequences of tokens
(whitespace and comments are ignored) that form a
contiguous source code fragment interrupted by non-
identical token sequences. Several studies on the evo-
lution of code clones have been conducted to detect
patterns that can help to manage clones [3,6]. One of
those patterns that is assumed to be of special inter-
est is late propagation [1,2,4]. In this paper, ways of
detecting late propagation in the evolution of type-3
clones are proposed and discussed.
1 Introduction
During the last years, di↵erent studies focused on de-
tecting clone patterns that are considered to have
a negative impact on code quality and therefore on
maintainability of software. Missing or inconsistent
propagation of changes to clones is identified as one
pattern that may introduce new defects or prevent the
removal of existing ones. To find these clone patterns
and enable clone management, a series of tools have
been introduced—including clone detectors and clone
genealogy extractors. Clones reported by a clone de-
tector are generally distinguished according to their
level of similarity. Clones that are identical except for
comments and whitespaces are called type-1 clones.
Type-2 clones extend type-1 clones by tolerating dif-
intentionally changed inconsistently [1,2,4].
2 Late Propagation of Near-Miss
Clones
The definition of a late propagation regarding identi-
cal clones is straightforward: an inconsistent modifica-
tion of an identical clone causing the fragments to be
non-identical until another inconsistent change to the
fragments makes them identical again. However, the
definition is not suitable for near-miss clones because
they are not completely identical–changes between the
identical and the non-identical parts have to be dif-
ferentiated. The challenging question that arises from
this fact is:
What are the essential characteristics of a
change that makes an inconsistent change to
a near-miss clone consistent at a later point
of time?
One way to define the late propagation pattern for
near-miss clones is to focus exclusively on the identical
parts of a clone disregarding the gaps as the gaps are
already not common between the cloned fragments.
In this case, we would regard a near-miss clone to
be changed consistently if the identical parts undergo
the same modifications and continue to be identical–
analogously to the definition of a late propagation of
identical clones. Hence, to recognize an inconsistent
ECEASST
Late Propagation in Near-Miss Clones: An Empirical Study
Manishankar Mondal1, Chanchal K. Roy2, Kevin A. Schneider3
1 mshankar.mondal@usask.ca, https://homepage.usask.ca/⇠mam815/
2 croy@cs.usask.ca, http://www.cs.usask.ca/⇠croy/
3 kevin.schneider@usask.ca, http://www.cs.usask.ca/⇠kas/
University of Saskatchewan, Canada
Abstract:
If two or more code fragments in the code-base of a software system are exactly
or nearly similar to one another, we call them code clones. It is often important
that updates (i.e., changes) in one clone fragment should be propagated to the other
similar clone fragments to ensure consistency. However, if there is a delay in this
propagation because of unawareness, the system might behave inconsistently. This
delay in propagation, also known as late propagation, has been investigated by a
number of existing studies. However, the existing studies did not investigate the
intensity as well as the effect of late propagation in different types of clones sepa-
rately. Also, late propagation in Type 3 clones is yet to investigate. In this research
work we investigate late propagation in three types of clones (Type 1, Type 2, and
LP In Type-3 Clones
Late propagation of Type-3 Clones
Saman Bazrafshan
Universit¨at Bremen
saman.bazrafshan@informatik.uni-bremen.de
Abstract
Type-3 clones are duplicated source code fragments
that span two or more identical sequences of tokens
(whitespace and comments are ignored) that form a
contiguous source code fragment interrupted by non-
identical token sequences. Several studies on the evo-
lution of code clones have been conducted to detect
patterns that can help to manage clones [3,6]. One of
those patterns that is assumed to be of special inter-
est is late propagation [1,2,4]. In this paper, ways of
detecting late propagation in the evolution of type-3
clones are proposed and discussed.
1 Introduction
During the last years, di↵erent studies focused on de-
tecting clone patterns that are considered to have
a negative impact on code quality and therefore on
maintainability of software. Missing or inconsistent
propagation of changes to clones is identified as one
pattern that may introduce new defects or prevent the
removal of existing ones. To find these clone patterns
and enable clone management, a series of tools have
been introduced—including clone detectors and clone
genealogy extractors. Clones reported by a clone de-
tector are generally distinguished according to their
level of similarity. Clones that are identical except for
comments and whitespaces are called type-1 clones.
Type-2 clones extend type-1 clones by tolerating dif-
intentionally changed inconsistently [1,2,4].
2 Late Propagation of Near-Miss
Clones
The definition of a late propagation regarding identi-
cal clones is straightforward: an inconsistent modifica-
tion of an identical clone causing the fragments to be
non-identical until another inconsistent change to the
fragments makes them identical again. However, the
definition is not suitable for near-miss clones because
they are not completely identical–changes between the
identical and the non-identical parts have to be dif-
ferentiated. The challenging question that arises from
this fact is:
What are the essential characteristics of a
change that makes an inconsistent change to
a near-miss clone consistent at a later point
of time?
One way to define the late propagation pattern for
near-miss clones is to focus exclusively on the identical
parts of a clone disregarding the gaps as the gaps are
already not common between the cloned fragments.
In this case, we would regard a near-miss clone to
be changed consistently if the identical parts undergo
the same modifications and continue to be identical–
analogously to the definition of a late propagation of
identical clones. Hence, to recognize an inconsistent
ECEASST
Late Propagation in Near-Miss Clones: An Empirical Study
Manishankar Mondal1, Chanchal K. Roy2, Kevin A. Schneider3
1 mshankar.mondal@usask.ca, https://homepage.usask.ca/⇠mam815/
2 croy@cs.usask.ca, http://www.cs.usask.ca/⇠croy/
3 kevin.schneider@usask.ca, http://www.cs.usask.ca/⇠kas/
University of Saskatchewan, Canada
Abstract:
If two or more code fragments in the code-base of a software system are exactly
or nearly similar to one another, we call them code clones. It is often important
that updates (i.e., changes) in one clone fragment should be propagated to the other
similar clone fragments to ensure consistency. However, if there is a delay in this
propagation because of unawareness, the system might behave inconsistently. This
delay in propagation, also known as late propagation, has been investigated by a
number of existing studies. However, the existing studies did not investigate the
intensity as well as the effect of late propagation in different types of clones sepa-
rately. Also, late propagation in Type 3 clones is yet to investigate. In this research
work we investigate late propagation in three types of clones (Type 1, Type 2, and
More late propagations in type-3
clones than in others
LP In Type-3 Clones
Late propagation of Type-3 Clones
Saman Bazrafshan
Universit¨at Bremen
saman.bazrafshan@informatik.uni-bremen.de
Abstract
Type-3 clones are duplicated source code fragments
that span two or more identical sequences of tokens
(whitespace and comments are ignored) that form a
contiguous source code fragment interrupted by non-
identical token sequences. Several studies on the evo-
lution of code clones have been conducted to detect
patterns that can help to manage clones [3,6]. One of
those patterns that is assumed to be of special inter-
est is late propagation [1,2,4]. In this paper, ways of
detecting late propagation in the evolution of type-3
clones are proposed and discussed.
1 Introduction
During the last years, di↵erent studies focused on de-
tecting clone patterns that are considered to have
a negative impact on code quality and therefore on
maintainability of software. Missing or inconsistent
propagation of changes to clones is identified as one
pattern that may introduce new defects or prevent the
removal of existing ones. To find these clone patterns
and enable clone management, a series of tools have
been introduced—including clone detectors and clone
genealogy extractors. Clones reported by a clone de-
tector are generally distinguished according to their
level of similarity. Clones that are identical except for
comments and whitespaces are called type-1 clones.
Type-2 clones extend type-1 clones by tolerating dif-
intentionally changed inconsistently [1,2,4].
2 Late Propagation of Near-Miss
Clones
The definition of a late propagation regarding identi-
cal clones is straightforward: an inconsistent modifica-
tion of an identical clone causing the fragments to be
non-identical until another inconsistent change to the
fragments makes them identical again. However, the
definition is not suitable for near-miss clones because
they are not completely identical–changes between the
identical and the non-identical parts have to be dif-
ferentiated. The challenging question that arises from
this fact is:
What are the essential characteristics of a
change that makes an inconsistent change to
a near-miss clone consistent at a later point
of time?
One way to define the late propagation pattern for
near-miss clones is to focus exclusively on the identical
parts of a clone disregarding the gaps as the gaps are
already not common between the cloned fragments.
In this case, we would regard a near-miss clone to
be changed consistently if the identical parts undergo
the same modifications and continue to be identical–
analogously to the definition of a late propagation of
identical clones. Hence, to recognize an inconsistent
ECEASST
Late Propagation in Near-Miss Clones: An Empirical Study
Manishankar Mondal1, Chanchal K. Roy2, Kevin A. Schneider3
1 mshankar.mondal@usask.ca, https://homepage.usask.ca/⇠mam815/
2 croy@cs.usask.ca, http://www.cs.usask.ca/⇠croy/
3 kevin.schneider@usask.ca, http://www.cs.usask.ca/⇠kas/
University of Saskatchewan, Canada
Abstract:
If two or more code fragments in the code-base of a software system are exactly
or nearly similar to one another, we call them code clones. It is often important
that updates (i.e., changes) in one clone fragment should be propagated to the other
similar clone fragments to ensure consistency. However, if there is a delay in this
propagation because of unawareness, the system might behave inconsistently. This
delay in propagation, also known as late propagation, has been investigated by a
number of existing studies. However, the existing studies did not investigate the
intensity as well as the effect of late propagation in different types of clones sepa-
rately. Also, late propagation in Type 3 clones is yet to investigate. In this research
work we investigate late propagation in three types of clones (Type 1, Type 2, and
More late propagations in type-3
clones than in others
Late propagations occur in small
(block-size) clones
A Study of Consistent and Inconsistent Changes to Code Clones
Jens Krinke
FernUniversit¨at in Hagen, Germany
krinke@acm.org
Abstract
Code Cloning is regarded as a threat to software main-
tenance, because it is generally assumed that a change to
a code clone usually has to be applied to the other clones
of the clone group as well. However, there exists little
empirical data that supports this assumption. This paper
presents a study on the changes applied to code clones in
open source software systems based on the changes between
versions of the system. It is analyzed if changes to code
clones are consistent to all code clones of a clone group or
not. The results show that usually half of the changes to
code clone groups are inconsistent changes. Moreover, the
study observes that when there are inconsistent changes to
a code clone group in a near version, it is rarely the case
that there are additional changes in later versions such that
the code clone group then has only consistent changes.
1 Introduction
Duplicated code is common in all kind of software sys-
tems. Although cut-copy-paste (-and-adapt) techniques are
considered bad practice, every programmer uses them.
Since these practices involve both duplication and mod-
ification, they are collectively called code cloning. While
the duplicated code is called a code clone. A clone group
whether or not the above mentioned problems are relevant
in practice. Kim et al. [15] investigated the evolution of
code clones and provided a classification for evolving code
clones. Their work already showed that during the evolution
of the code clones, consistent changes to the code clones
of a group are fewer than anticipated. Aversano et al. [4]
did a similar study and they state “that the majority of clone
classes is always maintained consistently.” Geiger et al. [10]
studied the relation of code clone groups and change cou-
plings (files which are committed at the same time, by the
same author, and with the same modification description),
but could not find a (strong) relation. Therefore, this work
will present an empirical study that verifies the following
hypothesis:
During the evolution of a system, code clones of
a clone group are changed consistently.
Of course, a system may contain bugs where a change
has been applied to some code clones, but has been forgot-
ten for other code clones of the clone group. For stable
systems it can be assumed that such bugs will be resolved
at a later time. This results in a second hypothesis:
During the evolution of a system, if code clones
of a clone group are not changed consistently, the
missing changes will appear in a later version.
ECEASST
Studying Late Propagations in Code Clone Evolution Using
Software Repository Mining
Hsiao Hui Mui1, Andy Zaidman1 and Martin Pinzger1
1 hsiaomui@gmail.com, a.e.zaidman@tudelft.nl
Software Engineering Research Group
Delft University of Technology, the Netherlands
2 martin.pinzger@aau.at
Software Engineering Research Group
University of Klagenfurt, Austria
Abstract: In the code clone evolution community, the Late Propagation (LP) has
been identified as one of the clone evolution patterns that can potentially lead to
software defects. An LP occurs when instances of a clone pair are changed consis-
tently, but not at the same time. The clone instance, which receives the update at a
later time, might exhibit unintended behavior if the modification was a bugfix. In
this paper, we present an approach to extract LPs from software repositories. Sub-
Inconsistent? LP?
A Study of Consistent and Inconsistent Changes to Code Clones
Jens Krinke
FernUniversit¨at in Hagen, Germany
krinke@acm.org
Abstract
Code Cloning is regarded as a threat to software main-
tenance, because it is generally assumed that a change to
a code clone usually has to be applied to the other clones
of the clone group as well. However, there exists little
empirical data that supports this assumption. This paper
presents a study on the changes applied to code clones in
open source software systems based on the changes between
versions of the system. It is analyzed if changes to code
clones are consistent to all code clones of a clone group or
not. The results show that usually half of the changes to
code clone groups are inconsistent changes. Moreover, the
study observes that when there are inconsistent changes to
a code clone group in a near version, it is rarely the case
that there are additional changes in later versions such that
the code clone group then has only consistent changes.
1 Introduction
Duplicated code is common in all kind of software sys-
tems. Although cut-copy-paste (-and-adapt) techniques are
considered bad practice, every programmer uses them.
Since these practices involve both duplication and mod-
ification, they are collectively called code cloning. While
the duplicated code is called a code clone. A clone group
whether or not the above mentioned problems are relevant
in practice. Kim et al. [15] investigated the evolution of
code clones and provided a classification for evolving code
clones. Their work already showed that during the evolution
of the code clones, consistent changes to the code clones
of a group are fewer than anticipated. Aversano et al. [4]
did a similar study and they state “that the majority of clone
classes is always maintained consistently.” Geiger et al. [10]
studied the relation of code clone groups and change cou-
plings (files which are committed at the same time, by the
same author, and with the same modification description),
but could not find a (strong) relation. Therefore, this work
will present an empirical study that verifies the following
hypothesis:
During the evolution of a system, code clones of
a clone group are changed consistently.
Of course, a system may contain bugs where a change
has been applied to some code clones, but has been forgot-
ten for other code clones of the clone group. For stable
systems it can be assumed that such bugs will be resolved
at a later time. This results in a second hypothesis:
During the evolution of a system, if code clones
of a clone group are not changed consistently, the
missing changes will appear in a later version.
ECEASST
Studying Late Propagations in Code Clone Evolution Using
Software Repository Mining
Hsiao Hui Mui1, Andy Zaidman1 and Martin Pinzger1
1 hsiaomui@gmail.com, a.e.zaidman@tudelft.nl
Software Engineering Research Group
Delft University of Technology, the Netherlands
2 martin.pinzger@aau.at
Software Engineering Research Group
University of Klagenfurt, Austria
Abstract: In the code clone evolution community, the Late Propagation (LP) has
been identified as one of the clone evolution patterns that can potentially lead to
software defects. An LP occurs when instances of a clone pair are changed consis-
tently, but not at the same time. The clone instance, which receives the update at a
later time, might exhibit unintended behavior if the modification was a bugfix. In
this paper, we present an approach to extract LPs from software repositories. Sub-
Consistent changes occur half of the time
Inconsistent? LP?
A Study of Consistent and Inconsistent Changes to Code Clones
Jens Krinke
FernUniversit¨at in Hagen, Germany
krinke@acm.org
Abstract
Code Cloning is regarded as a threat to software main-
tenance, because it is generally assumed that a change to
a code clone usually has to be applied to the other clones
of the clone group as well. However, there exists little
empirical data that supports this assumption. This paper
presents a study on the changes applied to code clones in
open source software systems based on the changes between
versions of the system. It is analyzed if changes to code
clones are consistent to all code clones of a clone group or
not. The results show that usually half of the changes to
code clone groups are inconsistent changes. Moreover, the
study observes that when there are inconsistent changes to
a code clone group in a near version, it is rarely the case
that there are additional changes in later versions such that
the code clone group then has only consistent changes.
1 Introduction
Duplicated code is common in all kind of software sys-
tems. Although cut-copy-paste (-and-adapt) techniques are
considered bad practice, every programmer uses them.
Since these practices involve both duplication and mod-
ification, they are collectively called code cloning. While
the duplicated code is called a code clone. A clone group
whether or not the above mentioned problems are relevant
in practice. Kim et al. [15] investigated the evolution of
code clones and provided a classification for evolving code
clones. Their work already showed that during the evolution
of the code clones, consistent changes to the code clones
of a group are fewer than anticipated. Aversano et al. [4]
did a similar study and they state “that the majority of clone
classes is always maintained consistently.” Geiger et al. [10]
studied the relation of code clone groups and change cou-
plings (files which are committed at the same time, by the
same author, and with the same modification description),
but could not find a (strong) relation. Therefore, this work
will present an empirical study that verifies the following
hypothesis:
During the evolution of a system, code clones of
a clone group are changed consistently.
Of course, a system may contain bugs where a change
has been applied to some code clones, but has been forgot-
ten for other code clones of the clone group. For stable
systems it can be assumed that such bugs will be resolved
at a later time. This results in a second hypothesis:
During the evolution of a system, if code clones
of a clone group are not changed consistently, the
missing changes will appear in a later version.
ECEASST
Studying Late Propagations in Code Clone Evolution Using
Software Repository Mining
Hsiao Hui Mui1, Andy Zaidman1 and Martin Pinzger1
1 hsiaomui@gmail.com, a.e.zaidman@tudelft.nl
Software Engineering Research Group
Delft University of Technology, the Netherlands
2 martin.pinzger@aau.at
Software Engineering Research Group
University of Klagenfurt, Austria
Abstract: In the code clone evolution community, the Late Propagation (LP) has
been identified as one of the clone evolution patterns that can potentially lead to
software defects. An LP occurs when instances of a clone pair are changed consis-
tently, but not at the same time. The clone instance, which receives the update at a
later time, might exhibit unintended behavior if the modification was a bugfix. In
this paper, we present an approach to extract LPs from software repositories. Sub-
LP seldom occurs, and most of them re-
synchronize within one day
Consistent changes occur half of the time
Inconsistent? LP?
Clones and bugs
Tracking Entities
Late Propagation
Clone changes
Clones and bugs
Tracking Entities
Late Propagation
Clone changes
Release Level Analysis
Science of Computer Programming ( ) –
Contents lists available at ScienceDirect
Science of Computer Programming
journal homepage: www.elsevier.com/locate/scico
An empirical study on inconsistent changes to code clones at the
release level
Nicolas Bettenburg⇤
, Weiyi Shang, Walid M. Ibrahim, Bram Adams, Ying Zou,
Ahmed E. Hassan
Queen’s University, Kingston, Ontario, Canada
a r t i c l e i n f o
Article history:
Available online xxxx
Keywords:
Software engineering
Maintenance management
Reuse models
Clone detection
Maintainability
Software evolution
a b s t r a c t
To study the impact of code clones on software quality, researchers typically carry out
their studies based on fine-grained analysis of inconsistent changes at the revision level.
As a result, they capture much of the chaotic and experimental nature inherent in any on-
going software development process. Analyzing highly fluctuating and short-lived clones
is likely to exaggerate the ill effects of inconsistent changes on the quality of the released
software product, as perceived by the end user. To gain a broader perspective, we perform
an empirical study on the effect of inconsistent changes on software quality at the release
level. Based on a case study on three open source software systems, we observe that
only 1.02%–4.00% of all clone genealogies introduce software defects at the release level,
as opposed to the substantially higher percentages reported by previous studies at the
revision level. Our findings suggest that clones do not have a significant impact on the
post-release quality of the studied systems, and that the developers are able to effectively
manage the evolution of cloned code.
© 2010 Elsevier B.V. All rights reserved.
1. Introduction
Code clones are the source of heated debates among software maintenance researchers. Developers typically clone (copy)
existing pieces of code in order to jumpstart the development of a new feature, or to reuse robust parts of the source code
for new development. However, unless a clone is reused as is, developers quickly lose track of the link between the clone
and the cloned piece of code, especially after some local modifications. Losing the links between clones increases the risk
of inconsistent changes. These are code changes that are applied to only one clone, whereas they should propagate to all
clones, such as defect fixing changes.
There is no consensus on whether the positive traits of cloning, such as effective reuse, outweigh its drawbacks, such as
increased risk of deteriorated software quality. Many researchers consider clones to be harmful [3,6,14,21,22,27,36], due to
the belief that inconsistent changes increase both maintenance effort and the likelihood of introducing defects. Yet, other
researchers do not find empirical evidence of harm [39,47], or even establish cloning as a valuable software engineering
method to overcome language limitations or to specialize common parts of the code [10,24–26]. It is not yet clear which of
these two visions prevails, or whether the right vision depends on the software system at hand [15,43,47].
Empirical studies on code clones almost exclusively focus on the impact of cloning on developers, such as the developers’
ability to keep track of all related clones in a clone group and their ability to consistently propagate changes to all clones.
Many studies analyze inconsistent changes to clones and the general evolution (genealogy) of clone groups across very small
Evaluating Code Clone Genealogies at Release Level: An Empirical Study
Ripon K. Saha, Muhammad Asaduzzaman, Minhaz F. Zibran, Chanchal K. Roy, and Kevin A. Schneider
Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada S7N 5C9
{ripon.saha, md.asad, minhaz.zibran, chanchal.roy, kevin.schneider}@usask.ca
Abstract
Code clone genealogies show how clone groups
evolve with the evolution of the associated software
system, and thus could provide important insights on
the maintenance implications of clones. In this paper,
we provide an in-depth empirical study for evaluating
clone genealogies in evolving open source systems at
the release level. We develop a clone genealogy
extractor, examine 17 open source C, Java, C++ and
C# systems of diverse varieties and study different
dimensions of how clone groups evolve with the
evolution of the software systems. Our study shows that
majority of the clone groups of the clone genealogies
either propagate without any syntactic changes or
change consistently in the subsequent releases, and
that many of the genealogies remain alive during the
evolution. These findings seem to be consistent with the
findings of a previous study that clones may not be as
detrimental in software maintenance as believed to be
(at least by many of us), and that instead of
aggressively refactoring clones, we should possibly
focus on tracking and managing clones during the
evolution of software systems.
an essential part of software maintenance. However,
due to the intense use of template-based programming
[12], a certain amount of clones are likely acceptable.
Previous studies were highly influenced by the idea
that clones are harmful and can be removed through
refactoring [15]. This notion has been challenged by
the work of Kim et al. [15]. They provided a clone
genealogy model and analyzed the clone genealogies
of two open source software systems. While a clone
group consists of a set of code fragments in a particular
version of a software that are clones to each other, a
genealogy of a clone group describes how the code
fragments of that clone group propagate during the
evolution of the subject system. Each clone genealogy
consists of a set of clone lineages that originate from
the same clone group (source). A clone lineage is a
directed acyclic graph that describes the evolution
history of a clone group from the beginning to the final
release of the software system. The empirical study
described by Kim et al. on code clone genealogy
reveals that clones are not always harmful.
Programmers intentionally practice code cloning to
achieve certain benefits [12, 13]. During the
development of a software system, many clones are
short lived. Refactoring them aggressively can
Release Level Analysis
Science of Computer Programming ( ) –
Contents lists available at ScienceDirect
Science of Computer Programming
journal homepage: www.elsevier.com/locate/scico
An empirical study on inconsistent changes to code clones at the
release level
Nicolas Bettenburg⇤
, Weiyi Shang, Walid M. Ibrahim, Bram Adams, Ying Zou,
Ahmed E. Hassan
Queen’s University, Kingston, Ontario, Canada
a r t i c l e i n f o
Article history:
Available online xxxx
Keywords:
Software engineering
Maintenance management
Reuse models
Clone detection
Maintainability
Software evolution
a b s t r a c t
To study the impact of code clones on software quality, researchers typically carry out
their studies based on fine-grained analysis of inconsistent changes at the revision level.
As a result, they capture much of the chaotic and experimental nature inherent in any on-
going software development process. Analyzing highly fluctuating and short-lived clones
is likely to exaggerate the ill effects of inconsistent changes on the quality of the released
software product, as perceived by the end user. To gain a broader perspective, we perform
an empirical study on the effect of inconsistent changes on software quality at the release
level. Based on a case study on three open source software systems, we observe that
only 1.02%–4.00% of all clone genealogies introduce software defects at the release level,
as opposed to the substantially higher percentages reported by previous studies at the
revision level. Our findings suggest that clones do not have a significant impact on the
post-release quality of the studied systems, and that the developers are able to effectively
manage the evolution of cloned code.
© 2010 Elsevier B.V. All rights reserved.
1. Introduction
Code clones are the source of heated debates among software maintenance researchers. Developers typically clone (copy)
existing pieces of code in order to jumpstart the development of a new feature, or to reuse robust parts of the source code
for new development. However, unless a clone is reused as is, developers quickly lose track of the link between the clone
and the cloned piece of code, especially after some local modifications. Losing the links between clones increases the risk
of inconsistent changes. These are code changes that are applied to only one clone, whereas they should propagate to all
clones, such as defect fixing changes.
There is no consensus on whether the positive traits of cloning, such as effective reuse, outweigh its drawbacks, such as
increased risk of deteriorated software quality. Many researchers consider clones to be harmful [3,6,14,21,22,27,36], due to
the belief that inconsistent changes increase both maintenance effort and the likelihood of introducing defects. Yet, other
researchers do not find empirical evidence of harm [39,47], or even establish cloning as a valuable software engineering
method to overcome language limitations or to specialize common parts of the code [10,24–26]. It is not yet clear which of
these two visions prevails, or whether the right vision depends on the software system at hand [15,43,47].
Empirical studies on code clones almost exclusively focus on the impact of cloning on developers, such as the developers’
ability to keep track of all related clones in a clone group and their ability to consistently propagate changes to all clones.
Many studies analyze inconsistent changes to clones and the general evolution (genealogy) of clone groups across very small
Evaluating Code Clone Genealogies at Release Level: An Empirical Study
Ripon K. Saha, Muhammad Asaduzzaman, Minhaz F. Zibran, Chanchal K. Roy, and Kevin A. Schneider
Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada S7N 5C9
{ripon.saha, md.asad, minhaz.zibran, chanchal.roy, kevin.schneider}@usask.ca
Abstract
Code clone genealogies show how clone groups
evolve with the evolution of the associated software
system, and thus could provide important insights on
the maintenance implications of clones. In this paper,
we provide an in-depth empirical study for evaluating
clone genealogies in evolving open source systems at
the release level. We develop a clone genealogy
extractor, examine 17 open source C, Java, C++ and
C# systems of diverse varieties and study different
dimensions of how clone groups evolve with the
evolution of the software systems. Our study shows that
majority of the clone groups of the clone genealogies
either propagate without any syntactic changes or
change consistently in the subsequent releases, and
that many of the genealogies remain alive during the
evolution. These findings seem to be consistent with the
findings of a previous study that clones may not be as
detrimental in software maintenance as believed to be
(at least by many of us), and that instead of
aggressively refactoring clones, we should possibly
focus on tracking and managing clones during the
evolution of software systems.
an essential part of software maintenance. However,
due to the intense use of template-based programming
[12], a certain amount of clones are likely acceptable.
Previous studies were highly influenced by the idea
that clones are harmful and can be removed through
refactoring [15]. This notion has been challenged by
the work of Kim et al. [15]. They provided a clone
genealogy model and analyzed the clone genealogies
of two open source software systems. While a clone
group consists of a set of code fragments in a particular
version of a software that are clones to each other, a
genealogy of a clone group describes how the code
fragments of that clone group propagate during the
evolution of the subject system. Each clone genealogy
consists of a set of clone lineages that originate from
the same clone group (source). A clone lineage is a
directed acyclic graph that describes the evolution
history of a clone group from the beginning to the final
release of the software system. The empirical study
described by Kim et al. on code clone genealogy
reveals that clones are not always harmful.
Programmers intentionally practice code cloning to
achieve certain benefits [12, 13]. During the
development of a software system, many clones are
short lived. Refactoring them aggressively can
Most of the clone inconsistent changes are not
visible at release level
Risks for Clone Changes
Frequency and Risks of Changes to Clones
Nils Göde
University of Bremen
Bremen, Germany
nils@informatik.uni-bremen.de
Rainer Koschke
University of Bremen
Bremen, Germany
koschke@informatik.uni-bremen.de
ABSTRACT
Code Clones—duplicated source fragments—are said to in-
crease maintenance e↵ort and to facilitate problems caused
by inconsistent changes to identical parts. While this is cer-
tainly true for some clones and certainly not true for others,
it is unclear how many clones are real threats to the system’s
quality and need to be taken care of. Our analysis of clone
evolution in mature software projects shows that most clones
are rarely changed and the number of unintentional incon-
sistent changes to clones is small. We thus have to carefully
select the clones to be managed to avoid unnecessary e↵ort
managing clones with no risk potential.
Categories and Subject Descriptors
D.2.7 [Software Engineering]: Distribution, Maintenance,
and Enhancement—restructuring, reverse engineering, and
reengineering
General Terms
Experimentation, Measurement
Keywords
Software maintenance, clone detection, clone evolution
1. INTRODUCTION
Code clones are similar fragments of source code. There
are many problems caused by the presences of clones. Among
others, the source code becomes larger, change e↵ort in-
There certainly exist clones that are true threats to soft-
ware maintenance. Nevertheless, recent research [19, 20]
doubts the harmfulness of clones in general and lists nu-
merous situations in which clones are a reasonable design
decision. From the clone management perspective, it is de-
sirable to detect and manage only the harmful clones, be-
cause managing clones that have no negative e↵ects creates
only additional e↵ort.
Unfortunately, state-of-the-art clone tools detect and clas-
sify clones based only on similar structures in the source
code or one of its various representations. When it comes to
clone-related problems, however, the most important char-
acteristic of a clone is its change behavior and not its struc-
ture. Only if a clone changes, it causes additional change
e↵ort. Only if a clone changes, unintentional inconsistencies
can arise. If, on the other hand, a clone never changes, there
are no additional costs induced by propagating changes and
there is no risk of unwanted inconsistencies.
Our hypothesis is that many clones detected by state-of-
the-art tools are “structurally interesting” but irrelevant to
software maintenance because they never change during their
lifetime.
Up-to-date clone detectors can e ciently process and de-
tect clones within huge amounts of source code, consequently
delivering huge numbers of clones. In contrast, clone assess-
ment and deciding how to proceed can be very costly even for
individual clones as we have experienced with clones in our
own code [11]. Hence, having many unproblematic clones in
the detection results creates enormous overhead for assess-
ing and managing clones that do not threaten maintenance
because they never change.
Risks for Clone Changes
Frequency and Risks of Changes to Clones
Nils Göde
University of Bremen
Bremen, Germany
nils@informatik.uni-bremen.de
Rainer Koschke
University of Bremen
Bremen, Germany
koschke@informatik.uni-bremen.de
ABSTRACT
Code Clones—duplicated source fragments—are said to in-
crease maintenance e↵ort and to facilitate problems caused
by inconsistent changes to identical parts. While this is cer-
tainly true for some clones and certainly not true for others,
it is unclear how many clones are real threats to the system’s
quality and need to be taken care of. Our analysis of clone
evolution in mature software projects shows that most clones
are rarely changed and the number of unintentional incon-
sistent changes to clones is small. We thus have to carefully
select the clones to be managed to avoid unnecessary e↵ort
managing clones with no risk potential.
Categories and Subject Descriptors
D.2.7 [Software Engineering]: Distribution, Maintenance,
and Enhancement—restructuring, reverse engineering, and
reengineering
General Terms
Experimentation, Measurement
Keywords
Software maintenance, clone detection, clone evolution
1. INTRODUCTION
Code clones are similar fragments of source code. There
are many problems caused by the presences of clones. Among
others, the source code becomes larger, change e↵ort in-
There certainly exist clones that are true threats to soft-
ware maintenance. Nevertheless, recent research [19, 20]
doubts the harmfulness of clones in general and lists nu-
merous situations in which clones are a reasonable design
decision. From the clone management perspective, it is de-
sirable to detect and manage only the harmful clones, be-
cause managing clones that have no negative e↵ects creates
only additional e↵ort.
Unfortunately, state-of-the-art clone tools detect and clas-
sify clones based only on similar structures in the source
code or one of its various representations. When it comes to
clone-related problems, however, the most important char-
acteristic of a clone is its change behavior and not its struc-
ture. Only if a clone changes, it causes additional change
e↵ort. Only if a clone changes, unintentional inconsistencies
can arise. If, on the other hand, a clone never changes, there
are no additional costs induced by propagating changes and
there is no risk of unwanted inconsistencies.
Our hypothesis is that many clones detected by state-of-
the-art tools are “structurally interesting” but irrelevant to
software maintenance because they never change during their
lifetime.
Up-to-date clone detectors can e ciently process and de-
tect clones within huge amounts of source code, consequently
delivering huge numbers of clones. In contrast, clone assess-
ment and deciding how to proceed can be very costly even for
individual clones as we have experienced with clones in our
own code [11]. Hence, having many unproblematic clones in
the detection results creates enormous overhead for assess-
ing and managing clones that do not threaten maintenance
because they never change.
Inconsistent changes are often intentional
Risks for Clone Changes
Frequency and Risks of Changes to Clones
Nils Göde
University of Bremen
Bremen, Germany
nils@informatik.uni-bremen.de
Rainer Koschke
University of Bremen
Bremen, Germany
koschke@informatik.uni-bremen.de
ABSTRACT
Code Clones—duplicated source fragments—are said to in-
crease maintenance e↵ort and to facilitate problems caused
by inconsistent changes to identical parts. While this is cer-
tainly true for some clones and certainly not true for others,
it is unclear how many clones are real threats to the system’s
quality and need to be taken care of. Our analysis of clone
evolution in mature software projects shows that most clones
are rarely changed and the number of unintentional incon-
sistent changes to clones is small. We thus have to carefully
select the clones to be managed to avoid unnecessary e↵ort
managing clones with no risk potential.
Categories and Subject Descriptors
D.2.7 [Software Engineering]: Distribution, Maintenance,
and Enhancement—restructuring, reverse engineering, and
reengineering
General Terms
Experimentation, Measurement
Keywords
Software maintenance, clone detection, clone evolution
1. INTRODUCTION
Code clones are similar fragments of source code. There
are many problems caused by the presences of clones. Among
others, the source code becomes larger, change e↵ort in-
There certainly exist clones that are true threats to soft-
ware maintenance. Nevertheless, recent research [19, 20]
doubts the harmfulness of clones in general and lists nu-
merous situations in which clones are a reasonable design
decision. From the clone management perspective, it is de-
sirable to detect and manage only the harmful clones, be-
cause managing clones that have no negative e↵ects creates
only additional e↵ort.
Unfortunately, state-of-the-art clone tools detect and clas-
sify clones based only on similar structures in the source
code or one of its various representations. When it comes to
clone-related problems, however, the most important char-
acteristic of a clone is its change behavior and not its struc-
ture. Only if a clone changes, it causes additional change
e↵ort. Only if a clone changes, unintentional inconsistencies
can arise. If, on the other hand, a clone never changes, there
are no additional costs induced by propagating changes and
there is no risk of unwanted inconsistencies.
Our hypothesis is that many clones detected by state-of-
the-art tools are “structurally interesting” but irrelevant to
software maintenance because they never change during their
lifetime.
Up-to-date clone detectors can e ciently process and de-
tect clones within huge amounts of source code, consequently
delivering huge numbers of clones. In contrast, clone assess-
ment and deciding how to proceed can be very costly even for
individual clones as we have experienced with clones in our
own code [11]. Hence, having many unproblematic clones in
the detection results creates enormous overhead for assess-
ing and managing clones that do not threaten maintenance
because they never change.
Inconsistent changes are often intentional
Worthless to plan clone maintenance where
not needed
Tracking Entities
Late Propagation
Clone changes
Clones and bugs
Tracking Entities
Late Propagation
Clone changes
Clones and bugs
Empir Software Eng (2012) 17:503–530
DOI 10.1007/s10664-011-9195-3
Clones: what is that smell?
Foyzur Rahman · Christian Bird ·
Premkumar Devanbu
Published online: 24 December 2011
© Springer Science+Business Media, LLC 2011
Editors: Jim Whitehead and Tom Zimmermann
Abstract Clones are generally considered bad programming practice in software
engineering folklore. They are identified as a bad smell (Fowler et al. 1999) and a
major contributor to project maintenance difficulties. Clones inherently cause code
bloat, thus increasing project size and maintenance costs. In this work, we try to
validate the conventional wisdom empirically to see whether cloning makes code
more defect prone. This paper analyses the relationship between cloning and defect
proneness. For the four medium to large open source projects that we studied, we
find that, first, the great majority of bugs are not significantly associated with clones.
Second, we find that clones may be less defect prone than non-cloned code. Third,
we find little evidence that clones with more copies are actually more error prone.
Fourth, we find little evidence to support the claim that clone groups that span more
than one file or directory are more defect prone than collocated clones. Finally, we
find that developers do not need to put a disproportionately higher effort to fix
clone dense bugs. Our findings do not support the claim that clones are really a
“bad smell” (Fowler et al. 1999). Perhaps we can clone, and breathe easily, at the
same time.
Keywords Empirical software engineering · Software maintenance ·
Software clone · Software quality · Software evolution
Empir Software Eng (2012) 17:503–530
DOI 10.1007/s10664-011-9195-3
Clones: what is that smell?
Foyzur Rahman · Christian Bird ·
Premkumar Devanbu
Published online: 24 December 2011
© Springer Science+Business Media, LLC 2011
Editors: Jim Whitehead and Tom Zimmermann
Abstract Clones are generally considered bad programming practice in software
engineering folklore. They are identified as a bad smell (Fowler et al. 1999) and a
major contributor to project maintenance difficulties. Clones inherently cause code
bloat, thus increasing project size and maintenance costs. In this work, we try to
validate the conventional wisdom empirically to see whether cloning makes code
more defect prone. This paper analyses the relationship between cloning and defect
proneness. For the four medium to large open source projects that we studied, we
find that, first, the great majority of bugs are not significantly associated with clones.
Second, we find that clones may be less defect prone than non-cloned code. Third,
we find little evidence that clones with more copies are actually more error prone.
Fourth, we find little evidence to support the claim that clone groups that span more
than one file or directory are more defect prone than collocated clones. Finally, we
find that developers do not need to put a disproportionately higher effort to fix
clone dense bugs. Our findings do not support the claim that clones are really a
“bad smell” (Fowler et al. 1999). Perhaps we can clone, and breathe easily, at the
same time.
Keywords Empirical software engineering · Software maintenance ·
Software clone · Software quality · Software evolution
Most of defect-prone code (>80%)
does not contain clones
Empir Software Eng (2012) 17:503–530
DOI 10.1007/s10664-011-9195-3
Clones: what is that smell?
Foyzur Rahman · Christian Bird ·
Premkumar Devanbu
Published online: 24 December 2011
© Springer Science+Business Media, LLC 2011
Editors: Jim Whitehead and Tom Zimmermann
Abstract Clones are generally considered bad programming practice in software
engineering folklore. They are identified as a bad smell (Fowler et al. 1999) and a
major contributor to project maintenance difficulties. Clones inherently cause code
bloat, thus increasing project size and maintenance costs. In this work, we try to
validate the conventional wisdom empirically to see whether cloning makes code
more defect prone. This paper analyses the relationship between cloning and defect
proneness. For the four medium to large open source projects that we studied, we
find that, first, the great majority of bugs are not significantly associated with clones.
Second, we find that clones may be less defect prone than non-cloned code. Third,
we find little evidence that clones with more copies are actually more error prone.
Fourth, we find little evidence to support the claim that clone groups that span more
than one file or directory are more defect prone than collocated clones. Finally, we
find that developers do not need to put a disproportionately higher effort to fix
clone dense bugs. Our findings do not support the claim that clones are really a
“bad smell” (Fowler et al. 1999). Perhaps we can clone, and breathe easily, at the
same time.
Keywords Empirical software engineering · Software maintenance ·
Software clone · Software quality · Software evolution
Most of defect-prone code (>80%)
does not contain clones
Large clones have lower defect density
Empir Software Eng (2012) 17:503–530
DOI 10.1007/s10664-011-9195-3
Clones: what is that smell?
Foyzur Rahman · Christian Bird ·
Premkumar Devanbu
Published online: 24 December 2011
© Springer Science+Business Media, LLC 2011
Editors: Jim Whitehead and Tom Zimmermann
Abstract Clones are generally considered bad programming practice in software
engineering folklore. They are identified as a bad smell (Fowler et al. 1999) and a
major contributor to project maintenance difficulties. Clones inherently cause code
bloat, thus increasing project size and maintenance costs. In this work, we try to
validate the conventional wisdom empirically to see whether cloning makes code
more defect prone. This paper analyses the relationship between cloning and defect
proneness. For the four medium to large open source projects that we studied, we
find that, first, the great majority of bugs are not significantly associated with clones.
Second, we find that clones may be less defect prone than non-cloned code. Third,
we find little evidence that clones with more copies are actually more error prone.
Fourth, we find little evidence to support the claim that clone groups that span more
than one file or directory are more defect prone than collocated clones. Finally, we
find that developers do not need to put a disproportionately higher effort to fix
clone dense bugs. Our findings do not support the claim that clones are really a
“bad smell” (Fowler et al. 1999). Perhaps we can clone, and breathe easily, at the
same time.
Keywords Empirical software engineering · Software maintenance ·
Software clone · Software quality · Software evolution
Most of defect-prone code (>80%)
does not contain clones
Large clones have lower defect density
Amount of changes to fix bugs is
smaller for clones
Duplicate bugs in clones
Bug Replication in Code Clones: An Empirical
Study
Judith F. Islam Manishankar Mondal Chanchal K. Roy
Department of Computer Science, University of Saskatchewan, Canada
{judith.islam, mshankar.mondal, chanchal.roy}@usask.ca
Abstract—Code clones are exactly or nearly similar code
fragments in the code-base of a software system. Existing studies
show that clones are directly related to bugs and inconsistencies
in the code-base. Code cloning (making code clones) is suspected
to be responsible for replicating bugs in the code fragments.
However, there is no study on the possibilities of bug-replication
through cloning process. Such a study can help us discover ways
of minimizing bug-replication. Focusing on this we conduct an
empirical study on the intensities of bug-replication in the code
clones of the major clone-types: Type 1, Type 2, and Type 3.
According to our investigation on thousands of revisions of
six diverse subject systems written in two different programming
languages, C and Java, a considerable proportion (i.e., up to
10%) of the code clones can contain replicated bugs. Both Type
2 and Type 3 clones have higher tendencies of having replicated
bugs compared to Type 1 clones. Thus, Type 2 and Type 3 clones
are more important from clone management perspectives. The
extent of bug-replication in the buggy clone classes is generally
very high (i.e., 100% in most of the cases). We also find that
overall 55% of all the bugs experienced by the code clones can
be replicated bugs. Our study shows that replication of bugs
through cloning is a common phenomenon. Clone fragments
having method-calls and if-conditions should be considered for
refactoring with high priorities, because such clone fragments
have high possibilities of containing replicated bugs. We believe
that our findings are important for better maintenance of software
systems, in particular, systems with code clones.
I. INTRODUCTION
If two or more code fragments in a software system’s code-
base are exactly or nearly similar to one another we call them
code clones [44], [45]. A group of similar code fragments
forms a clone class. Code clones are mainly created because
of the frequent copy/paste activities of the programmers during
software development and maintenance. Whatever may be the
reasons behind cloning, code clones are of great importance
from the perspectives of software maintenance and evolution
[44].
fragment contains a bug and a programmer copies that code
fragment to several other places in the code-base without the
knowledge of the existing bug, the bug in the original fragment
gets replicated. Fixing of such replicated bugs may require
increased maintenance effort and cost for software systems.
However, although cloning is suspected to be responsible for
replicating bugs, there is no study on the possibilities of
bug-replication through cloning. Such a study can provide us
helpful insights for minimizing bug-replication as well as for
prioritizing code clones for refactoring or tracking. Focusing
on this we conduct an in-depth empirical study regarding bug-
replication in the code clones of the major clone-types: Type
1, Type 2, Type 3.
We conduct our empirical study on thousands of revisions
of six diverse subject systems written in two different program-
ming languages (Java and C). We detect code clones from
each of the revisions of a subject system using the NiCad
[6] clone detector, analyze the evolution history of these code
clones, and investigate whether and to what extent they contain
replicated bugs. We answer four important research questions
(Table I) regarding the intensity and cause of bug-replication
through our investigation. According to our investigation in-
volving rigorous manual analysis we can state that:
(1) A considerable percentage of the code clones can be
related to bug-replication. According to our observation up
to 10% of the code clones in a software system can contain
replicated bugs.
(2) Both Type 2 and Type 3 clones have higher possibilities
of containing replicated bugs compared to Type 1 clones. Thus,
Type 2 and Type 3 clones should be given higher priorities for
management.
(3) A considerable proportion (around 55%) of the bugs
occurred in code clones can be replicated bugs.
(4) Most of the replicated bugs are related to the method-
2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering
Duplicate bugs in clones
Bug Replication in Code Clones: An Empirical
Study
Judith F. Islam Manishankar Mondal Chanchal K. Roy
Department of Computer Science, University of Saskatchewan, Canada
{judith.islam, mshankar.mondal, chanchal.roy}@usask.ca
Abstract—Code clones are exactly or nearly similar code
fragments in the code-base of a software system. Existing studies
show that clones are directly related to bugs and inconsistencies
in the code-base. Code cloning (making code clones) is suspected
to be responsible for replicating bugs in the code fragments.
However, there is no study on the possibilities of bug-replication
through cloning process. Such a study can help us discover ways
of minimizing bug-replication. Focusing on this we conduct an
empirical study on the intensities of bug-replication in the code
clones of the major clone-types: Type 1, Type 2, and Type 3.
According to our investigation on thousands of revisions of
six diverse subject systems written in two different programming
languages, C and Java, a considerable proportion (i.e., up to
10%) of the code clones can contain replicated bugs. Both Type
2 and Type 3 clones have higher tendencies of having replicated
bugs compared to Type 1 clones. Thus, Type 2 and Type 3 clones
are more important from clone management perspectives. The
extent of bug-replication in the buggy clone classes is generally
very high (i.e., 100% in most of the cases). We also find that
overall 55% of all the bugs experienced by the code clones can
be replicated bugs. Our study shows that replication of bugs
through cloning is a common phenomenon. Clone fragments
having method-calls and if-conditions should be considered for
refactoring with high priorities, because such clone fragments
have high possibilities of containing replicated bugs. We believe
that our findings are important for better maintenance of software
systems, in particular, systems with code clones.
I. INTRODUCTION
If two or more code fragments in a software system’s code-
base are exactly or nearly similar to one another we call them
code clones [44], [45]. A group of similar code fragments
forms a clone class. Code clones are mainly created because
of the frequent copy/paste activities of the programmers during
software development and maintenance. Whatever may be the
reasons behind cloning, code clones are of great importance
from the perspectives of software maintenance and evolution
[44].
fragment contains a bug and a programmer copies that code
fragment to several other places in the code-base without the
knowledge of the existing bug, the bug in the original fragment
gets replicated. Fixing of such replicated bugs may require
increased maintenance effort and cost for software systems.
However, although cloning is suspected to be responsible for
replicating bugs, there is no study on the possibilities of
bug-replication through cloning. Such a study can provide us
helpful insights for minimizing bug-replication as well as for
prioritizing code clones for refactoring or tracking. Focusing
on this we conduct an in-depth empirical study regarding bug-
replication in the code clones of the major clone-types: Type
1, Type 2, Type 3.
We conduct our empirical study on thousands of revisions
of six diverse subject systems written in two different program-
ming languages (Java and C). We detect code clones from
each of the revisions of a subject system using the NiCad
[6] clone detector, analyze the evolution history of these code
clones, and investigate whether and to what extent they contain
replicated bugs. We answer four important research questions
(Table I) regarding the intensity and cause of bug-replication
through our investigation. According to our investigation in-
volving rigorous manual analysis we can state that:
(1) A considerable percentage of the code clones can be
related to bug-replication. According to our observation up
to 10% of the code clones in a software system can contain
replicated bugs.
(2) Both Type 2 and Type 3 clones have higher possibilities
of containing replicated bugs compared to Type 1 clones. Thus,
Type 2 and Type 3 clones should be given higher priorities for
management.
(3) A considerable proportion (around 55%) of the bugs
occurred in code clones can be replicated bugs.
(4) Most of the replicated bugs are related to the method-
2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering
Over half of bugs occurring in clones are
duplicated bugs
Chapter Four:
Take-Aways
Late propagations
for type-3 clones
Actually, it does not
happen so often
Late propagations
for type-3 clones
Actually, it does not
happen so often
Many clone
genealogies
Late propagations
for type-3 clones
Actually, it does not
happen so often
Many clone
genealogies
Consistent if we
look at release level
Late propagations
for type-3 clones
Late propagation is
highly correlated
with defects
Actually, it does not
happen so often
Many clone
genealogies
Consistent if we
look at release level
Late propagations
for type-3 clones
Late propagation is
highly correlated
with defects
Actually, it does not
happen so often
Many clone
genealogies
Consistent if we
look at release level
Late propagations
for type-3 clones
But no more than
defects in non-cloned
code
We now have data, infrastructure and
computational power for larger, better studies
Comparing Approaches
Comparative Stability of Cloned and Non-cloned Code: An
Empirical Study
Manishankar Mondal1
, Chanchal K. Roy1
, Md. Saidur Rahman1
, Ripon K. Saha1
, Jens
Krinke2
, Kevin A. Schneider1
1
Department of Computer Science, University of Saskatchewan, Canada
2
University College London, UK
1
{mshankar.mondal, chanchal.roy, saeed.cs, ripon.saha, kevin.schneider}@usask.ca
2
j.krinke@ucl.ac.uk
ABSTRACT
Code cloning is a controversial software engineering practice
due to contradictory claims regarding its e↵ect on software
maintenance. Code stability is a recently introduced mea-
surement technique that has been used to determine the
impact of code cloning by quantifying the changeability of a
code region. Although most of the existing stability analy-
sis studies agree that cloned code is more stable than non-
cloned code, the studies have two major flaws: (i) each study
only considered a single stability measurement (e.g., lines of
code changed, frequency of change, age of change); and, (ii)
only a small number of subject systems were analyzed and
these were of limited variety.
In this paper, we present a comprehensive empirical study
on code stability using three di↵erent stability measuring
methods. We use a recently introduced hybrid clone detec-
tion tool, NiCAD, to detect the clones and analyze their
stability in four dimensions: by clone type, by measuring
method, by programming language, and by system size and
age. Our four-dimensional investigation on 12 diverse sub-
ject systems written in three programming languages consid-
ering three clone types reveals that: (i) Type-1 and Type-2
clones are unstable, but Type-3 clones are not; (ii) clones
in Java and C systems are not as stable as clones in C#
systems; (iii) a system’s development strategy might play a
key role in defining its comparative code stability scenario;
and, (iv) cloned and non-cloned regions of a subject system
do not follow a consistent change pattern.
Categories and Subject Descriptors
D.2.7 [Software Engineering]: Distribution, Maintenance,
and Enhancement—Restructuring, Reverse Engineering and
Keywords
Code Stability; Modification Frequency; Average Last Change
Date; Average Age; Clone Types
1. INTRODUCTION
Frequent copy-paste activity by programmers during soft-
ware development is common. Copying a code fragment
from one location and pasting it to another location with
or without modifications cause multiple copies of exact or
closely similar code fragments to co-exist in software sys-
tems. These code fragments are known as clones. Whatever
may be the reasons behind cloning, the impact of clones on
software maintenance and evolution is of great concern.
The common belief is that, the presence of duplicate code
poses additional challenges to software maintenance by mak-
ing inconsistent changes more di cult, introducing bugs and
as a result increasing maintenance e↵orts. From this point of
view, some researchers have identified clones as “bad smells”
and their studies showed that clones have negative impact on
software quality and maintenance [7, 14, 15]. On the other
hand, there has been a good number of empirical evidence
in favour of clones concluding that clones are not harmful
[1, 6, 9, 10, 18]. Instead, clones can be useful from di↵erent
points of views [8].
A widely used term to assess the impact of clones on soft-
ware maintenance is stability [6, 11, 12, 14]. Because if
cloned code is more stable (changes less frequently) as com-
pared to non-cloned code during software evolution, it can
be concluded that cloned code does not significantly increase
maintenance e↵orts. Di↵erent researchers have defined and
evaluated stability from di↵erent viewpoints which can be
broadly divided into two categories:
(1) Stability measurement in terms of changes:
Genealogy Extractors
An Automatic Framework for Extracting and
Classifying Near-Miss Clone Genealogies
Ripon K. Saha Chanchal K. Roy Kevin A. Schneider
Department of Computer Science, University of Saskatchewan, Canada
{ripon.saha, chanchal.roy, kevin.schneider}@usask.ca
Abstract—Extracting code clone genealogies across multiple
versions of a program and classifying them according to their
change patterns underlies the study of code clone evolution.
While there are a few studies in the area, the approaches do
not handle near-miss clones well and the associated tools are
often computationally expensive. To address these limitations,
we present a framework for automatically extracting both exact
and near-miss clone genealogies across multiple versions of a
program and for identifying their change patterns using a few key
similarity factors. We have developed a prototype clone genealogy
extractor, applied it to three open source projects including the
Linux Kernel, and evaluated its accuracy in terms of precision
and recall. Our experience shows that the prototype is scalable,
adaptable to different clone detection tools, and can automatically
identify evolution patterns of both exact and near-miss clones by
constructing their genealogies.
Index Terms—clone genealogy extractor; mapping; clone evo-
lution.
I. INTRODUCTION
The investigation and analysis of code clones has attracted
considerable attention from the software engineering research
community in recent years. Researchers have presented ev-
idence that code clones have both positive [10], [22] and
negative [16] consequences for maintenance activities and
thus, in general, code clones are neither good nor bad. It is
also not possible or practical to eliminate certain clone classes
from a software system [10]. Consequently, the identification
logs provided by source code repositories such as svn. In
the third approach [15], [6], clones are mapped during clone
detection based on source code changes between revisions. A
combination of the first and second approaches has also been
used in some studies [3].
Although intuitive, each of these approaches has some
limitations. In the first approach, a number of the similarity
metrics used to map clones have quadratic time complexities
[9]. In addition, if a clone fragment changes significantly
in the next version and goes beyond the given similarity
threshold of the clone genealogy extractor, a mapping may not
be identified. In the second approach, only clones identified
in the first version are mapped. Therefore, we do not know
what happens to clones introduced in later versions. The
third approach (“incremental approach”) avoids some of the
limitations of the previous two approaches by combining
detection and mapping, and works well for mapping clones
in many versions. By integrating clone detection and clone
mapping this approach can be faster than the approaches that
require clone detection to be conducted separately for each
version. Although this incremental approach is fast enough
both for detection and mapping for a given set of revisions,
it might not be as beneficial at the release level [6] because
there might be a significant difference between the releases.
Furthermore, in the sole available incremental tool, iClones
Tooling
Clone Detection 

In Modern IDEs
https://blogs.msdn.microsoft.com/zainnab/2012/06/28/visual-studio-2012-new-features-code-clone-analysis/
Clone Tracking Should be
also Put In The Practice
3
Clone Region Descriptors: Representing and
Tracking Duplication in Source Code
EKWA DUALA-EKOKO and MARTIN P. ROBILLARD
McGill University
Source code duplication, commonly known as code cloning, is considered an obstacle to software
maintenance because changes to a cloned region often require consistent changes to other regions
of the source code. Research has provided evidence that the elimination of clones may not always
be practical, feasible, or cost-effective. We present a clone management approach that describes
clone regions in a robust way that is independent from the exact text of clone regions or their
location in a file, and that provides support for tracking clones in evolving software. Our technique
relies on the concept of abstract clone region descriptors (CRDs), which describe clone regions using
a combination of their syntactic, structural, and lexical information. We present our definition of
CRDs, and describe a clone tracking system capable of producing CRDs from the output of dif-
ferent clone detection tools, notifying developers of modifications to clone regions, and supporting
updates to the documented clone relationships. We evaluated the performance and usefulness
of our approach across three clone detection tools and five subject systems, and the results in-
dicate that CRDs are a practical and robust representation for tracking code clones in evolving
software.
Categories and Subject Descriptors: D.2.7 [Software Engineering]: Distribution, Maintenance,
and Enhancement
General Terms: Design, Experimentation
Additional Key Words and Phrases: Source code duplication, code clones, clone detection, refactor-
ing, clone management
ACM Reference Format:
Duala-Ekoko, E. and Robillard, M. P. 2010. Clone region descriptors: Representing and tracking
duplication in source code. ACM Trans. Softw. Eng. Methodol. 20, 1, Article 3 (June 2010), 31 pages.
DOI = 10.1145/1767751.1767754 http://doi.acm.org/10.1145/1767751.1767754
Applying Clone Change Notification System into an
Industrial Development Process
Yuki Yamanaka ∗, Eunjong Choi ∗, Norihiro Yoshida †, Katsuro Inoue ∗, Tateki Sano ‡
∗ Graduate School of Information Science and Technology, Osaka University, Japan
{y-yuuki, ejchoi, inoue}@ist.osaka-u.ac.jp
† Graduate School of Information Science, Nara Institute of Science and Technology, Japan
yoshida@is.naist.jp
‡ Software Process Innovation and Standardization Division, NEC Corporation, Japan
t-sano@cp.jp.nec.com
Abstract—Programmers tend to write code clones unintention-
ally even in the case that they can easily avoid them. Clone change
management is one of crucial issues in open source software
(OSS) development as well as in industrial software development
(e.g., development of social infrastructure, financial system, and
medical equipment). When an industrial developer fixes a defect,
he/she has to find the code clones corresponding to the code
fragment including it. So far, several studies performed on the
analysis of clone evolution in OSS. However, to our knowledge,
a few researches have been reported on an application of a clone
change notification system to industrial development process.
In this paper, we introduce a system for notifying creation
and change of code clones, and then report on the experience
with 40-days application of it into a development process in
NEC Corporation. In the industrial application, a developer
successfully identified ten unintentionally-developed clones that
should be refactored.
Index Terms—Code Clone, Software Maintenance, Refactoring
I. INTRODUCTION
A code clone is a code fragment that has similar or identical
code fragments in source code. Many code clone detection
Because the team plans long-time maintenance as well as
reuse for other system developments, the developers are highly
motivated to merge code clones into a single module.
However, the cost of refactoring cannot be ignored espe-
cially in industry. Regression test after refactoring takes much
cost to preserve behavior after refactoring. The development
team at NEC also considers the cost of refactoring. Basically,
they do not touch source code after large-scale system test for
releasing major version of the software because refactoring
after large-scale test leads the re-performance of such costly
test. Therefore, they need to know newly-appeared clones
regularly, especially before large-scale system test.
In this paper, we present clone change notification system
Clone Notifier (see Figure 3) for the promotion of efficient
clone management (e.g., refactoring, simultaneous editing).
Clone Notifier notifies newly-appeared and changed clones
regularly to developers. As an industrial application, we ap-
plied Clone Notifier into the process of the web application
software development at NEC. The result shows 119 newly-
AMIC (Automatic Mining of Important Clones)
41
http://sr-p2irc-big2.usask.ca/amic/
Above all, in
Continuous Integration
[Duvall et al. , 2007]
Compile
Test
Integrate
Check
Deploy

…
Developers
SCM
Server
CI
Server
Poll
Push
changes
Push
changes
[Duvall et al., 2007]
Feedback
Survey in ING NL
Amount of duplicated codes
Cyclomatic complexity
Number of function parameters
Lines of Code (LOC)
Comment words
Number of source files
Other
% of respondents
0% 25% 50% 75% 100%
15%
16%
18%
44%
51%
69%
78%
Metrics Collected to Monitor Source Code Quality
Cloning From Forums…
Tomorrow 9:00AM:
Stack Overflow: A Code Laundering Platform?
Le An, Ons Mlouki, Foutse Khomh and Giuliano Antoniol
Conclusion
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Snap 6
Consistent change
Late propagation
Independent evolution
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Snap 6
Consistent change
Late propagation
Independent evolution
Evolution Patterns
0%
20%
40%
60%
80%
ArgoUML JBoss OpenSSH PostgreSQL
0%0%
3%4%
16%
4%5%7%
39%
24%
52%
34%
38%
71%
40%
55%
Consistent Indep. Evolution Late Propagation Unknown
Tracking Entities
Late Propagation
Clone changes
Clones and bugs
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Snap 6
Consistent change
Late propagation
Independent evolution
Evolution Patterns
0%
20%
40%
60%
80%
ArgoUML JBoss OpenSSH PostgreSQL
0%0%
3%4%
16%
4%5%7%
39%
24%
52%
34%
38%
71%
40%
55%
Consistent Indep. Evolution Late Propagation Unknown
Tracking Entities
Late Propagation
Clone changes
Clones and bugs
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Snap 6
Consistent change
Late propagation
Independent evolution
Evolution Patterns
0%
20%
40%
60%
80%
ArgoUML JBoss OpenSSH PostgreSQL
0%0%
3%4%
16%
4%5%7%
39%
24%
52%
34%
38%
71%
40%
55%
Consistent Indep. Evolution Late Propagation Unknown
Survey in ING NL
Amount of duplicated codes
Cyclomatic complexity
Number of function parameters
Lines of Code (LOC)
Comment words
Number of source files
Other
% of respondents
0% 25% 50% 75% 100%
15%
16%
18%
44%
51%
69%
78%
Metrics Collected to Monitor Source Code Quality
Tracking Entities
Late Propagation
Clone changes
Clones and bugs
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Snap 6
Consistent change
Late propagation
Independent evolution
Evolution Patterns
0%
20%
40%
60%
80%
ArgoUML JBoss OpenSSH PostgreSQL
0%0%
3%4%
16%
4%5%7%
39%
24%
52%
34%
38%
71%
40%
55%
Consistent Indep. Evolution Late Propagation Unknown
Survey in ING NL
Amount of duplicated codes
Cyclomatic complexity
Number of function parameters
Lines of Code (LOC)
Comment words
Number of source files
Other
% of respondents
0% 25% 50% 75% 100%
15%
16%
18%
44%
51%
69%
78%
Metrics Collected to Monitor Source Code Quality
Tracking Entities
Late Propagation
Clone changes
Clones and bugs
Tracking clone changes
Clone class A
Clone class B
Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Snap 6
Consistent change
Late propagation
Independent evolution
Evolution Patterns
0%
20%
40%
60%
80%
ArgoUML JBoss OpenSSH PostgreSQL
0%0%
3%4%
16%
4%5%7%
39%
24%
52%
34%
38%
71%
40%
55%
Consistent Indep. Evolution Late Propagation Unknown
Survey in ING NL
Amount of duplicated codes
Cyclomatic complexity
Number of function parameters
Lines of Code (LOC)
Comment words
Number of source files
Other
% of respondents
0% 25% 50% 75% 100%
15%
16%
18%
44%
51%
69%
78%
Metrics Collected to Monitor Source Code Quality

Most Influential Paper - SANER 2017

  • 1.
    How Clones areMaintained 2007 - 2017 Luigi Cerulo Max Di Penta Lerina Aversano University of Sannio
  • 2.
    How Clones areMaintained 2007 - 2017 Luigi Cerulo Max Di Penta Lerina Aversano University of Sannio
  • 3.
    Chapter 1 -How everything started Chapter 2 - The follow-up Chapter 3 - The impact Chapter 4 - Take-aways
  • 4.
    Chapter 1 -How everything started Chapter 2 - The follow-up Chapter 3 - The impact Chapter 4 - Take-aways
  • 5.
    Chapter 1 -How everything started Chapter 2 - The follow-up Chapter 3 - The impact Chapter 4 - Take-aways
  • 6.
    Chapter 1 -How everything started Chapter 2 - The follow-up Chapter 3 - The impact Chapter 4 - Take-aways
  • 7.
  • 10.
    … SE Prophetsenvisioned a new future
  • 11.
    … SE Prophetsenvisioned a new future Clone genealogies (ESEC/FSE 2005) SAME SHIFT INCONSISTENT CHANGE ADD CONSISTENT CHANGE SUBTRACT Figure 1: The relationship among evolution patter traces code clones in consecutive versions using a metr based clone detector and classifies clones into four cat gories: new clones, modified clones, never modified clon and deleted clones. Their analysis does not address how e ments in a group of code clones change with respect to oth elements in the group. To the best of our knowledge, o clone genealogy extractor (detailed in Section 4) is the fir tool that systematically analyzes clone evolution patter by monitoring how a clone group evolves. Techniques for Analyzing Structural Changes
  • 12.
    … SE Prophetsenvisioned a new future Clone genealogies (ESEC/FSE 2005) SAME SHIFT INCONSISTENT CHANGE ADD CONSISTENT CHANGE SUBTRACT Figure 1: The relationship among evolution patter traces code clones in consecutive versions using a metr based clone detector and classifies clones into four cat gories: new clones, modified clones, never modified clon and deleted clones. Their analysis does not address how e ments in a group of code clones change with respect to oth elements in the group. To the best of our knowledge, o clone genealogy extractor (detailed in Section 4) is the fir tool that systematically analyzes clone evolution patter by monitoring how a clone group evolves. Techniques for Analyzing Structural Changes Change coupling and clones
 (FASE 2006) Relation of Code Clones and Change Couplings 7 Number of Couplings Clone Coverage CouplingCoverage Length of Clone Fig. 2. Description of the metrics used in the visualization.
  • 13.
    … SE Prophetsenvisioned a new future “Cloning considered harmful” considered harmful (WCRE 2006) Clone genealogies (ESEC/FSE 2005) SAME SHIFT INCONSISTENT CHANGE ADD CONSISTENT CHANGE SUBTRACT Figure 1: The relationship among evolution patter traces code clones in consecutive versions using a metr based clone detector and classifies clones into four cat gories: new clones, modified clones, never modified clon and deleted clones. Their analysis does not address how e ments in a group of code clones change with respect to oth elements in the group. To the best of our knowledge, o clone genealogy extractor (detailed in Section 4) is the fir tool that systematically analyzes clone evolution patter by monitoring how a clone group evolves. Techniques for Analyzing Structural Changes Change coupling and clones
 (FASE 2006) Relation of Code Clones and Change Couplings 7 Number of Couplings Clone Coverage CouplingCoverage Length of Clone Fig. 2. Description of the metrics used in the visualization.
  • 14.
    Somebody was analyzing sourcecode line trails (ldiff)…
  • 15.
    Somebody was analyzing sourcecode line trails (ldiff)… MSR 2007
  • 16.
    Track the lifetimeof software entities ldiff’s ability to identify moved line blocks and thus its ability to track a software entity when its position in a file changes. To this end, we ran- domly generated new releases of 100 source code files selected from two open source projects (Post- greSQL and openSSH) by randomly moving code fragments within the source code file. The frag- ments varied from 1 line to a maximum of 1/10 of the total number of lines. We assessed the algo- rithm in terms of precision and recall: precision=numberofcorrectlydetectedmoves/ extracted change sets from the ArgoUML Con- current Versions System (CVS) repository, repre- senting different types of changes, such as bug fix- ing, refactoring, or enhancement. We assessed the tool’s precision by manually identifying false posi- tives in classifications the algorithm made. The 11 change sets affected from 11 to 72 files (median 19) and from 32 to 401 lines (median 42). Figure 3b shows the median ldiff and Unix diff accuracy and the interquartile range (between the third and first quartile). (For the ldiff syntax, see the “Ldiff: A Support Tool” sidebar.) /* * foo(revision 1.3) */ int foo(float a, int b) { return a; } Snapshots extracted from Concurrent Versions System/ Subversion archive Entity A added Entity B changed Entity A changed Entity B deleted Time Snapshot 1 Entity A tracking Entity B tracking Snapshot 2 LDA(1,2) LDA(2,3) Snapshot 3 LDA(3,4) Snapshot 4 LDA(4,5) Snapshot 5 LDA(n – 1, n) Snapshot n DEL CHG DEL CHG CHG CHG CHG CHG ADD ADD CHG ADD ADD ADD DEL CHG CHG CHG// foo (revision 1.4) float foo(int a, int b) { if (b!=0) return (float)a/b; else return 0; } // foo (revision 1.5) float foo(int a, int b) { int c=0 if (b!=0) return (float)a/b; return c; } IEEESoftware26.1(2009)
  • 17.
    Somebody else used tostudy clone evolution
  • 18.
    Nice surprise! Wegot a grant on software evolution
  • 19.
    Ok… that was notso much money…
  • 20.
  • 21.
    What we wantedto study… Software clones are devils? To what extend they can be assimilated as (bad/ good?) software engineering practices?
  • 22.
  • 23.
    Tracking clone changes Cloneclass A Clone class B
  • 24.
    Tracking clone changes Cloneclass A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn
  • 25.
    Tracking clone changes Cloneclass A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn
  • 26.
    Tracking clone changes Cloneclass A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn
  • 27.
    Tracking clone changes Cloneclass A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn Consistent change
  • 28.
    Tracking clone changes Cloneclass A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn Consistent change
  • 29.
    Tracking clone changes Cloneclass A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn Consistent change
  • 30.
    Tracking clone changes Cloneclass A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn Consistent change Late propagation
  • 31.
    Tracking clone changes Cloneclass A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn Consistent change Late propagation
  • 32.
    Tracking clone changes Cloneclass A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn Consistent change Late propagation
  • 33.
    Tracking clone changes Cloneclass A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Sn Consistent change Late propagation Independent evolution
  • 34.
  • 36.
  • 37.
    Only two projects Oneclone detector
  • 38.
    Only two projects Oneclone detector Automated clone tracking
  • 39.
    Only two projects Oneclone detector Automated clone tracking Manual classification
  • 40.
  • 41.
    Some findings Class-level clonesmostly consistently changed. Not the case for method and block
  • 42.
    Some findings Class-level clonesmostly consistently changed. Not the case for method and block 13%-32% of independent evolution
  • 43.
    Some findings Class-level clonesmostly consistently changed. Not the case for method and block Between 13% and 16% of late propagation 13%-32% of independent evolution
  • 44.
    Some findings Class-level clonesmostly consistently changed. Not the case for method and block Between 13% and 16% of late propagation 13%-32% of independent evolution
  • 45.
    Some findings Late propagationoften due to different schedule, caused bugs only in few cases Class-level clones mostly consistently changed. Not the case for method and block Between 13% and 16% of late propagation 13%-32% of independent evolution
  • 46.
    We got thePaper! How Clones are Maintained: An Empirical Study Lerina Aversano, Luigi Cerulo, Massimiliano Di Penta RCOST — Research Centre on Software Technology Department of Engineering - University of Sannio Viale Traiano - 82100 Benevento, Italy {aversano, lcerulo, dipenta}@unisannio.it Abstract Despite the conventional wisdom concerning the risks related to the use of source code cloning as a software de- velopment strategy, several studies appeared in literature indicated that this is not true. In most cases clones are prop- erly maintained and, when this does not happen, is because cloned code evolves independently. Stemming from previous works, this paper combines clone detection and co–change analysis to investigate how clones are maintained when an evolution activity or a bug fixing impact a source code fragment belonging to a clone class. The two case studies reported confirm that, either for bug fixing or for evolution purposes, most of the cloned code is consistently maintained during the same co–change or during temporally close co–changes. Keywords: Clone detection, software evolution, mining software repositories 1. Introduction Several recent studies contradict the common wisdom that cloning constitutes a risky practice: as found by Kim et al. [16]. As shown in a paper by Kasper and Godfrey [15], source code clones are not necessarily to be considered harmful but, many times, as a way to develop software cre- ating, for example, new features starting for existing, simi- lar ones. Whilst this creates duplications, it also permits the use of stable, already tested and used code. This paper aims to report results from an empiri- cal study aiming to investigate how clones, detected in a given release of a software system, are affected by mainte- nance intervention. The analysis is performed by intersect- ing cloned classes with data from Modification Transactions (MTs) mined from source code repositories. A MT iden- tifies groups of source code lines co-changed in the same time window. The work is built upon the idea of clone pat- terns described by Kasper and Godfrey and of clone evolution patterns described by Kim et al., and investi- gates whether clones (i) are updated consistently during the same MT or near MTs, confirming the correlation be- tween MTs and clones, as experienced by Geiger et al. [10]; (ii) evolve independently; or (iii) are subject to up-
  • 47.
    Submit where? How Clonesare Maintained: An Empirical Study Lerina Aversano, Luigi Cerulo, Massimiliano Di Penta RCOST — Research Centre on Software Technology Department of Engineering - University of Sannio Viale Traiano - 82100 Benevento, Italy {aversano, lcerulo, dipenta}@unisannio.it Abstract Despite the conventional wisdom concerning the risks related to the use of source code cloning as a software de- velopment strategy, several studies appeared in literature indicated that this is not true. In most cases clones are prop- erly maintained and, when this does not happen, is because cloned code evolves independently. Stemming from previous works, this paper combines clone detection and co–change analysis to investigate how clones are maintained when an evolution activity or a bug fixing impact a source code fragment belonging to a clone class. The two case studies reported confirm that, either for bug fixing or for evolution purposes, most of the cloned code is consistently maintained during the same co–change or during temporally close co–changes. Keywords: Clone detection, software evolution, mining software repositories 1. Introduction Several recent studies contradict the common wisdom that cloning constitutes a risky practice: as found by Kim et al. [16]. As shown in a paper by Kasper and Godfrey [15], source code clones are not necessarily to be considered harmful but, many times, as a way to develop software cre- ating, for example, new features starting for existing, simi- lar ones. Whilst this creates duplications, it also permits the use of stable, already tested and used code. This paper aims to report results from an empiri- cal study aiming to investigate how clones, detected in a given release of a software system, are affected by mainte- nance intervention. The analysis is performed by intersect- ing cloned classes with data from Modification Transactions (MTs) mined from source code repositories. A MT iden- tifies groups of source code lines co-changed in the same time window. The work is built upon the idea of clone pat- terns described by Kasper and Godfrey and of clone evolution patterns described by Kim et al., and investi- gates whether clones (i) are updated consistently during the same MT or near MTs, confirming the correlation be- tween MTs and clones, as experienced by Geiger et al. [10]; (ii) evolve independently; or (iii) are subject to up- WCRE?
  • 48.
    Submit where? How Clonesare Maintained: An Empirical Study Lerina Aversano, Luigi Cerulo, Massimiliano Di Penta RCOST — Research Centre on Software Technology Department of Engineering - University of Sannio Viale Traiano - 82100 Benevento, Italy {aversano, lcerulo, dipenta}@unisannio.it Abstract Despite the conventional wisdom concerning the risks related to the use of source code cloning as a software de- velopment strategy, several studies appeared in literature indicated that this is not true. In most cases clones are prop- erly maintained and, when this does not happen, is because cloned code evolves independently. Stemming from previous works, this paper combines clone detection and co–change analysis to investigate how clones are maintained when an evolution activity or a bug fixing impact a source code fragment belonging to a clone class. The two case studies reported confirm that, either for bug fixing or for evolution purposes, most of the cloned code is consistently maintained during the same co–change or during temporally close co–changes. Keywords: Clone detection, software evolution, mining software repositories 1. Introduction Several recent studies contradict the common wisdom that cloning constitutes a risky practice: as found by Kim et al. [16]. As shown in a paper by Kasper and Godfrey [15], source code clones are not necessarily to be considered harmful but, many times, as a way to develop software cre- ating, for example, new features starting for existing, simi- lar ones. Whilst this creates duplications, it also permits the use of stable, already tested and used code. This paper aims to report results from an empiri- cal study aiming to investigate how clones, detected in a given release of a software system, are affected by mainte- nance intervention. The analysis is performed by intersect- ing cloned classes with data from Modification Transactions (MTs) mined from source code repositories. A MT iden- tifies groups of source code lines co-changed in the same time window. The work is built upon the idea of clone pat- terns described by Kasper and Godfrey and of clone evolution patterns described by Kim et al., and investi- gates whether clones (i) are updated consistently during the same MT or near MTs, confirming the correlation be- tween MTs and clones, as experienced by Geiger et al. [10]; (ii) evolve independently; or (iii) are subject to up- Sorry! I’m WCRE PC co-chair
  • 49.
    Submit where? How Clonesare Maintained: An Empirical Study Lerina Aversano, Luigi Cerulo, Massimiliano Di Penta RCOST — Research Centre on Software Technology Department of Engineering - University of Sannio Viale Traiano - 82100 Benevento, Italy {aversano, lcerulo, dipenta}@unisannio.it Abstract Despite the conventional wisdom concerning the risks related to the use of source code cloning as a software de- velopment strategy, several studies appeared in literature indicated that this is not true. In most cases clones are prop- erly maintained and, when this does not happen, is because cloned code evolves independently. Stemming from previous works, this paper combines clone detection and co–change analysis to investigate how clones are maintained when an evolution activity or a bug fixing impact a source code fragment belonging to a clone class. The two case studies reported confirm that, either for bug fixing or for evolution purposes, most of the cloned code is consistently maintained during the same co–change or during temporally close co–changes. Keywords: Clone detection, software evolution, mining software repositories 1. Introduction Several recent studies contradict the common wisdom that cloning constitutes a risky practice: as found by Kim et al. [16]. As shown in a paper by Kasper and Godfrey [15], source code clones are not necessarily to be considered harmful but, many times, as a way to develop software cre- ating, for example, new features starting for existing, simi- lar ones. Whilst this creates duplications, it also permits the use of stable, already tested and used code. This paper aims to report results from an empiri- cal study aiming to investigate how clones, detected in a given release of a software system, are affected by mainte- nance intervention. The analysis is performed by intersect- ing cloned classes with data from Modification Transactions (MTs) mined from source code repositories. A MT iden- tifies groups of source code lines co-changed in the same time window. The work is built upon the idea of clone pat- terns described by Kasper and Godfrey and of clone evolution patterns described by Kim et al., and investi- gates whether clones (i) are updated consistently during the same MT or near MTs, confirming the correlation be- tween MTs and clones, as experienced by Geiger et al. [10]; (ii) evolve independently; or (iii) are subject to up- Lets try with CSMR, it is in Amsterdam!
  • 50.
  • 51.
    We got accepted! Amsterdam we’recoming From: Massimiliano Di Penta <dipenta@unisannio.it> Subject: [Fwd: CSMR 2007 Notification] Date: 30 Nov 2006 15:28:59 CET To: Lerina Aversano <aversano@unisannio.it>, "Luigi Cerulo" <lcerulo@unisannio.it> great...ecco le revisioni ... non so in effetti tra il primo e il terzo quale e' il piu' negativo (magari il primo) La critica del primo e' tutto sommato condivisibile, nel senso che considera il lavoro buono anche se molte cose si sapevano gia' (come del resto nel paper di Godfrey che nonostante una A aveva ricevuto qualche commento simile a WCRE) e questo e' yet another study.. (magari con qualche livello di dettaglio in piu')... da spiegare meglio nel camera ready copy … Guardate qui: se la gente dovesse seguire questa regola non si pubblicherebbe mai neanche su TSE ... !! General advice: Please submit your paper to a workshop to discuss the setup of your experiments. A submission for a conference should analyse more (>= 10) throughly selected software systems. As you suggest, your clone detection tool is very conservative, and you should perform the analyses with several different tools. Only then, your claim would be sufficiently supported. …. Ciao Max Amsterdam
  • 54.
  • 55.
  • 60.
  • 61.
    We need todo much better… the classification is not fully automated yet
  • 62.
    Folks, one reviewerwas upset! We also need to enlarge the study. More systems, … more…
  • 63.
    It would begreat to get a student to help us on the project
  • 64.
    One young studentwrote us to spend a few months in our lab..
  • 65.
    Suresh Thummalapenta at the timePhD student at NCSU with Tao Xie now with Microsoft Research
  • 66.
    This is great! Let’sask Suresh to join the force on this project
  • 67.
    CF CF CFCFCF CF3 2. Identification of clone fragment pairs evolution 3. Identification of clone class evolution Clone class CS 2 1. Identification of clone section pairs evolution LP LP CO LP LP LP LP CO CO CO CF1 CF2 CF3 CS 1 CS 1 CS 2 CS 2 CS 1 121 2 3 CF CF21 CF3 1,2 1,2 2,3 2,3 1,3 1,3 Fine-level automated tracking approach
  • 68.
    The Study Four projects,C and Java Both token-based and AST-based detectors Relation of clone evolution patterns with • Clone granularity • Clone radius • Defect-proneness
  • 69.
    Evolution Patterns 0% 20% 40% 60% 80% ArgoUML JBossOpenSSH PostgreSQL 0%0% 3%4% 16% 4%5%7% 39% 24% 52% 34% 38% 71% 40% 55% Consistent Indep. Evolution Late Propagation Unknown
  • 70.
    Late Propagation Two PostfreSQLFunctions containing clones The first underwent a bug fixing The second changed six months after:
 “...I had previously fixed the identical bug in oper_select_candidate, but didn't realize that the same error was repeated over here...”
  • 71.
    Independent Evolution ArgoUML ClassesGeneratorJava and GeneratorDisplay containing cloned methods GeneratorDisplay starts to implement enhanced visualization features After that, both changes independently
 (no more clones)
  • 72.
    Other Findings Clone radiusand granularity do not influence evolution patterns Late propagation more correlated to defects than other evolution patterns
  • 73.
    The EMSE Paper EmpirSoftware Eng (2010) 15:1–34 DOI 10.1007/s10664-009-9108-x An empirical study on the maintenance of source code clones Suresh Thummalapenta · Luigi Cerulo · Lerina Aversano · Massimiliano Di Penta Published online: 25 March 2009 © Springer Science + Business Media, LLC 2009 Editor: Murray Wood Abstract Code cloning has been very often indicated as a bad software development practice. However, many studies appearing in the literature indicate that this is not always the case. In fact, either changes occurring in cloned code are consistently propagated, or cloning is used as a sort of templating strategy, where cloned source code fragments evolve independently. This paper (a) proposes an automatic approach to classify the evolution of source code clone fragments, and (b) reports a fine-grained analysis of clone evolution in four different Java and C software systems, aimed at investigating to what extent clones are consistently propagated or they evolve independently. Also, the paper investigates the relationship between the presence of clone evolution patterns and other characteristics such as clone radius, clone size and the kind of change the clones underwent, i.e., corrective maintenance or enhancement. Keywords Software clones · Software maintenance · Mining software repositories · Clone evolution
  • 74.
  • 75.
  • 77.
  • 79.
    Late Propagation Clone changes Clonesand bugs Tracking Entities
  • 80.
    Late Propagation Clone changes Clonesand bugs Tracking Entities
  • 81.
    Tracking Design Patterns AnEmpirical Study on the Evolution of Design Patterns Lerina Aversano, Gerardo Canfora, Luigi Cerulo, Concettina Del Grosso, Massimiliano Di Penta RCOST – Research Centre on Software Technology, University of Sannio Via Traiano, 82100 Benevento, Italy aversano@unisannio,it, canfora@unisannio.it, lcerulo@unisannio.it, tina.delgrosso@unisannio.it, dipenta@unisannio.it ABSTRACT Design patterns are solutions to recurring design problems, conceived to increase benefits in terms of reuse, code quality and, above all, maintainability and resilience to changes. This paper presents results from an empirical study aimed at understanding the evolution of design patterns in three open source systems, namely JHotDraw, ArgoUML, and Eclipse-JDT. Specifically, the study analyzes how frequently patterns are modified, to what changes they undergo and what classes co-change with the patterns. Results show how patterns more suited to support the application pur- pose tend to change more frequently, and that different kind of changes have a different impact on co-changed classes and a different capability of making the system resilient to changes. Categories and Subject Descriptors D.2.2 [Software Engineering]: Design Tools And Tech- niques—Object-oriented design methods General Terms Design, Experimentation, Measurement Keywords Design patterns, Software Evolution, Mining Software Repo- sitories, Empirical Software Engineering 1. INTRODUCTION some aspect of system structure vary independently of other aspects, thereby making a system more robust to a particu- lar kind of change”. Advantages of design patterns include decoupling a request from specific operations (Chain of Re- sponsibility and Command), making a system independent from software and hardware platforms (Abstract Factory and Bridge), independent from algorithmic solutions (Itera- tor, Strategy, Visitor), or avoid modifying implementations (Adapter, Decorator, Visitor). Further discussion on design pattern advantages, and extensive pattern catalogues can be found in books such as [11] or [9]. While many benefits related to the use of design patterns have been stated, a little has been done to empirically in- vestigate pattern change proneness [3] or whether there is a relationships between the presence of defects in the source code and the use of design patterns [24]. In particular, there is lack of empirical studies aimed at analyzing what kind of changes each type of pattern undergoes during software evo- lution, and whether such a change can be related to changes contextually made on other classes not belonging to the pat- tern. The availability of source repositories for many object- oriented open source systems realized making use of design patterns, of techniques for identifying change sets [10] — i.e., sets of artifacts changed together by the same author — from source code repositories, and of design pattern de- tection techniques and tools [1, 8, 15, 19, 23], triggers op- portunities for this kind of studies. This paper reports and discusses results from an empir- ical study aimed at analyzing how design patterns change during a software system lifetime, and to what extent such changes cause modifications to other classes not part of the
  • 82.
    Tracking Design Pattern Evolution JHotDrawArgoUML Eclipse-JDT Patterns Observer, Composite Adapter-Command, Decorator, Factory Visitor Used for Model View Controller of Draws, Handling composite figures Adapting/ decorating UML objects to different views Execute menu actions Visiting Java AST Purpose 
 of change Adding new draw elements Adding new menu actions and presentations Adding new code analyses
  • 83.
    Patterns with More
 Co-Changed Code Pattern #ofLinesadded/removedinco-changedClasses Visitor Template State-Strategy Singleton Prototype Observer Factory Decorator Composite Adapter-Command 16000 14000 12000 10000 8000 6000 4000 2000 0 Eclipse-JDT
  • 84.
    Tracking Vulnerabilities The lifeand death of statically detected vulnerabilities: An empirical study Massimiliano Di Penta a,*, Luigi Cerulo b , Lerina Aversano a a Dept. of Engineering, University of Sannio, Via Traiano, 82100 Benevento, Italy b Dept. of Biological and Environmental Studies, University of Sannio, Via Port’Arsa, 11 – 82100 Benevento, Italy a r t i c l e i n f o Available online xxxx Keywords: Software vulnerabilities Mining software repositories Empirical study a b s t r a c t Vulnerable statements constitute a major problem for developers and maintainers of networking sys- tems. Their presence can ease the success of security attacks, aimed at gaining unauthorized access to data and functionality, or at causing system crashes and data loss. Examples of attacks caused by source code vulnerabilities are buffer overflows, command injections, and cross-site scripting. This paper reports on an empirical study, conducted across three networking systems, aimed at observ- ing the evolution and decay of vulnerabilities detected by three freely available static analysis tools. In particular, the study compares the decay of different kinds of vulnerabilities, characterizes the decay like- lihood through probability density functions, and reports a quantitative and qualitative analysis of the reasons for vulnerability removals. The study is performed by using a framework that traces the evolution of source code fragments across subsequent commits. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction Vulnerable instructions are, very often, the cause of serious problems such as security attacks, system failures or crashes. In his Ph.D. thesis [1] Krsul defined a software vulnerability as ‘‘an in- stance of an error in the specification, development, or configuration of software such that its execution can violate the security policy”. For business-critical systems, the presence of vulnerable instructions in the source code is often the cause of security attacks or, in other cases, of system failures or crashes. The problem is particularly rel- Detecting the presence of such instructions is therefore crucial to ensure high security and reliability. Indeed, security advisories are regularly published – see for example those of Linux distribu- tions3 Microsoft,4 those published by CERT, or by securityfocus.5 These advisories, however, are posted when a problem already occurred in the application, a problem that was very often caused by the introduction in the source code of vulnerable statements. This highlights the needs to identify potential problems when they are introduced, and to keep track of them during the software system lifetime, as it is done, for example for source code clones [2]. Information and Software Technology xxx (2009) xxx–xxx Contents lists available at ScienceDirect Information and Software Technology journal homepage: www.elsevier.com/locate/infsof ARTICLE IN PRESS
  • 85.
  • 86.
  • 87.
  • 88.
    Code Siblings andLicensing Code siblings: technical and legal implications of copying code between applications Daniel M. German† , Massimiliano Di Penta‡ , Yann-Ga¨el Gu´eh´eneuc⋆ , and Giuliano Antoniol⋆ † University of Victoria, Victoria, BC, Canada ‡ RCOST–University of Sannio, Benevento, Italy ⋆ PTIDEJ Team–SOCCER Lab., DGIGL, ´Ecole Polytechnique de Montr´eal, QC, Canada dmg@uvic.ca, dipenta@unisannio.it, yann-gael.gueheneuc@polymtl.ca, antoniol@ieee.org Abstract Source code cloning does not happen within a single sys- tem only. It can also occur between one system and another. We use the term code sibling to refer to a code clone that evolves in a different system than the code from which it originates. Code siblings can only occur when the source code copyright owner allows it and when the conditions imposed by such license are not incompatible with the li- cense of the destination system. In some situations copying of source code fragments are allowed—legally—in one di- rection, but not in the other. In this paper, we use clone detection, license mining and classification, and change history techniques to understand how code siblings—under different licenses—flow in one di- rection or the other between Linux and two BSD Unixes, different operating systems and environments. In all cases, cross-system clones are introduced. Usually, source code is distributed according to the terms of a software license. Once the developer chooses to dis- tribute her work with a particular license, she explicitly im- poses limits on what can be done with the code: if and how it can be used, modified, copied, distributed, and extended. Software licenses may prevent or favor the migration of code fragments in one or the other direction, or both. Once having migrated, code fragments evolve constrained by the new environment. In the following, we use the term sibling to refer to a fragment of code that has been cloned from one file in one system to another file in a different system. In some cases, a sibling may span an entire file. Then, we propose an analysis process to identify siblings
  • 89.
    Code Siblings andLicensing FreeBSD Linux siblings Cloned fragments Cloned fragments Migration direction
  • 90.
    Preferential Migration fromOS with permissive License (FreeBSD-OpenBSD) towards Linux (mainly GPL)
  • 91.
    Migration From Third-Party Code commita9474917099e007c0f51d5474394b5890111614f Author: Sean Hefty <sean.hefty@intel.com> Date: Mon Jul 14 23:48:43 2008 -0700 RDMA: Fix license text The license text for several files references a third software license that was inadvertently copied in. Update the license to what was intended. This update was based on a request from HP. [..]
  • 92.
    Blame-based tracking Distinguishing Copiesfrom Originals in Software Clones Jens Krinke, Nicolas Gold, Yue Jia King’s College London Centre for Research on Evolution, Search and Testing (CREST) {jens.krinke,nicolas.gold,yue.jia}@kcl.ac.uk David Binkley Loyola University Maryland Baltimore, MD, USA binkley@cs.loyola.edu ABSTRACT Cloning is widespread in today’s systems where automated assis- tance is required to locate cloned code. Although the evolution of clones has been studied for many years, no attempt has been made so far to automatically distinguish the original source code leading to cloned copies. This paper presents an approach to classify the clones of a clone pair based on the version information available in version control systems. This automatic classification attempts to distinguish the original from the copy. It allows for the fact that the clones may be modified and thus consist of lines coming from different versions. An evaluation, based on two case studies, shows that when comments are ignored and a small tolerance is accepted, for the majority of clone pairs the proposed approach can automat- ically distinguish between the original and the copy. Categories and Subject Descriptors D.2.9 [Software Engineering]: Management—Software config- uration management; D.2.13 [Software Engineering]: Reusable Software—Reusable libraries General Terms Algorithms Keywords Clone detection, mining software archives, software evolution 1. INTRODUCTION The duplication of code is a common practice to make software existing code. However, such practices can complicate software maintenance so it has been suggested that too much cloned code is a risk, albeit the practice itself is not generally harmful [16]. Be- cause of these problems, many approaches to detecting cloned code have been developed [2, 3, 8, 15, 18–20, 24, 26]. While methods to identify clones automatically and efficiently are to some extent un- derstood, it is still disputable whether the presence of clones is a risk. To better understand why and how code is cloned, recent em- pirical studies of cloned code have focused mainly on examining the evolution of clones, such as whether cloned code is more stable or changed consistently [1,10,12,17,21,22,27]. A lot of research has been done on finding and identifying soft- ware clones, but without additional information it is impossible to distinguish the original from the copy. Most of the above men- tioned previous empirical studies used version control systems to extract limited information about the discovered clones; for exam- ple, when a clone appears in some previous version. However, so far there has been no general approach proposed to distinguish orig- inals from copies except for a study done by German et al. [11] who tracked when clones appeared in the version history to identify the clone of a pair that appeared first. This paper presents an approach that uses line-by-line version information available from version control systems to distinguish the original from the copied code clone in a clone pair. Most version control systems have a ‘blame’ command which shows author and version information for each line in a file. This information, which includes the version when the line was added or last modified, can be used as a line age: if all lines in one clone have older versions than the lines in the other clone of a clone pair, then the clone with the older lines may be the original and the other may be the copy (assuming that the clone with the oldest lines existed Cloning and Copying between GNOME Projects Jens Krinke, Nicolas Gold, Yue Jia King’s College London, Centre for Research on Evolution, Search and Testing (CREST) {jens.krinke,nicolas.gold,yue.jia}@kcl.ac.uk David Binkley Loyola University Maryland, Baltimore, MD, USA binkley@cs.loyola.edu Abstract—This paper presents an approach to automatically distinguish the copied clone from the original in a pair of clones. It matches the line-by-line version information of a clone to the pair’s other clone. A case study on the GNOME Desktop Suite revealed a complex flow of reused code between the different subprojects. In particular, it showed that the majority of larger clones (with a minimal size of 28 lines or higher) exist between the subprojects and more than 60% of the clone pairs can be automatically separated into original and copy. I. INTRODUCTION The duplication of code is a common practice to make software development faster, to enable “experimental” devel- is most likely the original and the other the copy. However, usually, it is not that simple because the original and the copy may have been modified in turn after the copy was created. This paper makes the following contributions: • It extends previous work [19] to automatically distinguish between copy and original by allowing the clones of a clone pair to be in different systems. • A case study on the GNOME Desktop Suite subprojects shows that the majority of larger clones (with a minimal size of 28 lines or higher) exist between the subprojects and more than 60% of the clone pairs can be automat- ically separated automatically into original and copied
  • 93.
    Smell Evolution When andWhy Your Code Starts to Smell Bad (and Whether the Smells Go Away) Michele Tufano1, Fabio Palomba2, Gabriele Bavota3 Rocco Oliveto4, Massimiliano Di Penta5, Andrea De Lucia2, Denys Poshyvanyk1 1The College of William and Mary, Williamsburg, VA, USA 2University of Salerno, Fisciano (SA), Italy, 3Universit`a della Svizzera italiana (USI), Switzerland, 4University of Molise, Pesche (IS), Italy, 5University of Sannio, Benevento (BN), Italy mtufano@email.wm.edu, fpalomba@unisa.it, gabriele.bavota@usi.ch rocco.oliveto@unimol.it, dipenta@unisannio.it, adelucia@unisa.it, denys@cs.wm.edu Abstract—Technical debt is a metaphor introduced by Cunningham to indicate “not quite right code which we postpone making it right”. One noticeable symptom of technical debt is represented by code smells, defined as symptoms of poor design and implementation choices. Previous studies showed the negative impact of code smells on the comprehensibility and maintainability of code. While the repercussions of smells on code quality have been empirically assessed, there is still only anecdotal evidence on when and why bad smells are introduced, what is their survivability, and how they are removed by developers. To empirically corroborate such anecdotal evidence, we conducted a large empirical study over the change history of 200 open source projects. This study required the development of a strategy to identify smell-introducing commits, the mining of over half a million of commits, and the manual analysis and classification of over 10K of them. Our findings mostly contradict common wisdom, showing that most of the smell instances are introduced when an artifact is created and not as a result of its evolution. At the same time, 80% of smells survive in the system. Also, among the 20% of removed instances, only 9% are removed as a direct consequence of refactoring operations. Index Terms—Code Smells, Empirical Study, Mining Software Repositories F 1 INTRODUCTION THE technical debt metaphor introduced by Cunning- ham [23] explains well the trade-offs between deliv- ering the most appropriate but still immature product, removed [14]. This represents an obstacle for an effec- tive and efficient management of technical debt. Also, understanding the typical life-cycle of code smells and the actions undertaken by developers to remove them is of paramount importance in the conception of recom-
  • 94.
  • 95.
    When Are SmellsIntroduced Commits required to a class for becoming smell 50 1000 25 75 Generally, blobs affect a class since its creation There are several cases in which a blob is introduced during maintenance activities
  • 96.
    Why are smellintroduced? BLOB CDSBP CC FD SC BF E NF R Blob Class Data Should Be Private Complex Class Functional Decomposition Spaghetti Code Bug Fixing 0 1005025 75 Enhancement New Feature Refactoring
  • 97.
    Smell Removal Code Removal CodeReplacement Code Insertion Refactoring Major Restructuring 0% 10% 20% 30% 40% 4% 9% 15% 33% 40%
  • 98.
    Clone changes Clones andbugs Tracking Entities Late Propagation
  • 99.
    Clone changes Clones andbugs Tracking Entities Late Propagation
  • 100.
    Late Propagation inSoftware Clones Liliane Barbour, Foutse Khomh, Ying Zou Department of Electrical and Computer Engineering Queen’s University Kingston, ON {l.barbour, foutse.khomh, ying.zou}@queensu.ca Abstract—Two similar code segments, or clones, form a clone pair within a software system. The changes to the clones over time create a clone evolution history. In this work we study late propagation, a specific pattern of clone evolution. In late propagation, one clone in the clone pair is modified, causing the clone pair to become inconsistent. The code segments are then re-synchronized in a later revision. Existing work has established late propagation as a clone evolution pattern, and suggests that the pattern is related to a high number of faults. In this study we examine the characteristics of late propagation in two long-lived software systems using the Simian and CCFinder clone detection tools. We define 8 types of late propagation and compare them to other forms of clone evolution. Our results not only verify that late propagation is more harmful to software systems, but also establish that some specific cases of late propagations are more harmful than others. Specifically, two cases are most risky: (1) when a clone experiences inconsistent changes and then a re-synchronizing change without any modification to the other clone in a clone pair; and (2) when two clones undergo an inconsistent modification followed by a consistent change that modifies both the clones in a clone pair. Keywords-clone genealogies; late propagation; fault- proneness. I. INTRODUCTION A code segment is labeled as a code clone if it is identical or highly similar to another code segment. Similar code segments form a clone pair. Clone pairs can be introduced into systems deliberately (e.g., “copy and paste” actions) or inadvertently by a developer during development and the new context. For example, if a driver is required for a new printer model, a developer could copy the driver code from an older printer model and then modify it. Inconsistent changes can also occur accidentally. A developer may be unaware of a clone pair, and cause an inconsistency by only changing one half of the clone pair. This inconsistency could cause a software fault. If a fault is found in one clone and fixed, but not propagated to the other clone in the clone pair, the fault remains in the system. For example, a fault might be found in the old printer driver code and fixed, but the fix is not propagated to the new printer driver. For these reasons, previous studies [1] have argued that accidental inconsistent changes make code clones more prone to faults. Late propagation occurs when a clone pair that under- goes one or more inconsistent changes followed by a re- synchronizing change [2]. The re-synchronization of the code clones indicates that the gap in consistency is acci- dental. Since accidental inconsistencies are considered risky [3], the presence of late propagation in clone genealogies can be an indicator of risky, fault-prone code. Many studies have been performed on the evolution of clones. A few (e.g., [2], [3]) have studied late propagation, and indicated that late propagation genealogies are more fault-prone than other clone genealogies. Thummalapenta et al. began the initial work in examining the characteristics of late propagation. The authors measured the delay between an inconsistent change and a re-synchronizing change and related the delay to software faults. In our work, we examine More Detailed Genealogy
  • 101.
  • 102.
  • 103.
    More Detailed Genealogy Propagationalways occurs Propagation may not occur
  • 104.
    More Detailed Genealogy Propagationalways occurs Propagation may not occur Propagation never occurs
  • 105.
    Breakdown PercentageofAllLP Occurrences 0% 20% 40% 60% 80% LP1 LP2 LP3LP4 LP5 LP6 LP7 LP8 ArgoUML - Simian ArgoUML - CCFinder Ant - Simian Ant - CCFinder
  • 106.
    May not
 occur Never
 occurs Breakdown PercentageofAllLP Occurrences 0% 20% 40% 60% 80% LP1 LP2 LP3LP4 LP5 LP6 LP7 LP8 ArgoUML - Simian ArgoUML - CCFinder Ant - Simian Ant - CCFinder
  • 107.
    Faults by LPType PercentageofFaultOccurrences 0% 20% 40% 60% 80% LP Type LP1 LP2 LP3 LP4 LP5 LP6 LP7 LP8 Ant - Simian ArgoUML - CCFinder Ant - CCFinder
  • 108.
    LP In Type-3Clones Late propagation of Type-3 Clones Saman Bazrafshan Universit¨at Bremen saman.bazrafshan@informatik.uni-bremen.de Abstract Type-3 clones are duplicated source code fragments that span two or more identical sequences of tokens (whitespace and comments are ignored) that form a contiguous source code fragment interrupted by non- identical token sequences. Several studies on the evo- lution of code clones have been conducted to detect patterns that can help to manage clones [3,6]. One of those patterns that is assumed to be of special inter- est is late propagation [1,2,4]. In this paper, ways of detecting late propagation in the evolution of type-3 clones are proposed and discussed. 1 Introduction During the last years, di↵erent studies focused on de- tecting clone patterns that are considered to have a negative impact on code quality and therefore on maintainability of software. Missing or inconsistent propagation of changes to clones is identified as one pattern that may introduce new defects or prevent the removal of existing ones. To find these clone patterns and enable clone management, a series of tools have been introduced—including clone detectors and clone genealogy extractors. Clones reported by a clone de- tector are generally distinguished according to their level of similarity. Clones that are identical except for comments and whitespaces are called type-1 clones. Type-2 clones extend type-1 clones by tolerating dif- intentionally changed inconsistently [1,2,4]. 2 Late Propagation of Near-Miss Clones The definition of a late propagation regarding identi- cal clones is straightforward: an inconsistent modifica- tion of an identical clone causing the fragments to be non-identical until another inconsistent change to the fragments makes them identical again. However, the definition is not suitable for near-miss clones because they are not completely identical–changes between the identical and the non-identical parts have to be dif- ferentiated. The challenging question that arises from this fact is: What are the essential characteristics of a change that makes an inconsistent change to a near-miss clone consistent at a later point of time? One way to define the late propagation pattern for near-miss clones is to focus exclusively on the identical parts of a clone disregarding the gaps as the gaps are already not common between the cloned fragments. In this case, we would regard a near-miss clone to be changed consistently if the identical parts undergo the same modifications and continue to be identical– analogously to the definition of a late propagation of identical clones. Hence, to recognize an inconsistent ECEASST Late Propagation in Near-Miss Clones: An Empirical Study Manishankar Mondal1, Chanchal K. Roy2, Kevin A. Schneider3 1 mshankar.mondal@usask.ca, https://homepage.usask.ca/⇠mam815/ 2 croy@cs.usask.ca, http://www.cs.usask.ca/⇠croy/ 3 kevin.schneider@usask.ca, http://www.cs.usask.ca/⇠kas/ University of Saskatchewan, Canada Abstract: If two or more code fragments in the code-base of a software system are exactly or nearly similar to one another, we call them code clones. It is often important that updates (i.e., changes) in one clone fragment should be propagated to the other similar clone fragments to ensure consistency. However, if there is a delay in this propagation because of unawareness, the system might behave inconsistently. This delay in propagation, also known as late propagation, has been investigated by a number of existing studies. However, the existing studies did not investigate the intensity as well as the effect of late propagation in different types of clones sepa- rately. Also, late propagation in Type 3 clones is yet to investigate. In this research work we investigate late propagation in three types of clones (Type 1, Type 2, and
  • 109.
    LP In Type-3Clones Late propagation of Type-3 Clones Saman Bazrafshan Universit¨at Bremen saman.bazrafshan@informatik.uni-bremen.de Abstract Type-3 clones are duplicated source code fragments that span two or more identical sequences of tokens (whitespace and comments are ignored) that form a contiguous source code fragment interrupted by non- identical token sequences. Several studies on the evo- lution of code clones have been conducted to detect patterns that can help to manage clones [3,6]. One of those patterns that is assumed to be of special inter- est is late propagation [1,2,4]. In this paper, ways of detecting late propagation in the evolution of type-3 clones are proposed and discussed. 1 Introduction During the last years, di↵erent studies focused on de- tecting clone patterns that are considered to have a negative impact on code quality and therefore on maintainability of software. Missing or inconsistent propagation of changes to clones is identified as one pattern that may introduce new defects or prevent the removal of existing ones. To find these clone patterns and enable clone management, a series of tools have been introduced—including clone detectors and clone genealogy extractors. Clones reported by a clone de- tector are generally distinguished according to their level of similarity. Clones that are identical except for comments and whitespaces are called type-1 clones. Type-2 clones extend type-1 clones by tolerating dif- intentionally changed inconsistently [1,2,4]. 2 Late Propagation of Near-Miss Clones The definition of a late propagation regarding identi- cal clones is straightforward: an inconsistent modifica- tion of an identical clone causing the fragments to be non-identical until another inconsistent change to the fragments makes them identical again. However, the definition is not suitable for near-miss clones because they are not completely identical–changes between the identical and the non-identical parts have to be dif- ferentiated. The challenging question that arises from this fact is: What are the essential characteristics of a change that makes an inconsistent change to a near-miss clone consistent at a later point of time? One way to define the late propagation pattern for near-miss clones is to focus exclusively on the identical parts of a clone disregarding the gaps as the gaps are already not common between the cloned fragments. In this case, we would regard a near-miss clone to be changed consistently if the identical parts undergo the same modifications and continue to be identical– analogously to the definition of a late propagation of identical clones. Hence, to recognize an inconsistent ECEASST Late Propagation in Near-Miss Clones: An Empirical Study Manishankar Mondal1, Chanchal K. Roy2, Kevin A. Schneider3 1 mshankar.mondal@usask.ca, https://homepage.usask.ca/⇠mam815/ 2 croy@cs.usask.ca, http://www.cs.usask.ca/⇠croy/ 3 kevin.schneider@usask.ca, http://www.cs.usask.ca/⇠kas/ University of Saskatchewan, Canada Abstract: If two or more code fragments in the code-base of a software system are exactly or nearly similar to one another, we call them code clones. It is often important that updates (i.e., changes) in one clone fragment should be propagated to the other similar clone fragments to ensure consistency. However, if there is a delay in this propagation because of unawareness, the system might behave inconsistently. This delay in propagation, also known as late propagation, has been investigated by a number of existing studies. However, the existing studies did not investigate the intensity as well as the effect of late propagation in different types of clones sepa- rately. Also, late propagation in Type 3 clones is yet to investigate. In this research work we investigate late propagation in three types of clones (Type 1, Type 2, and More late propagations in type-3 clones than in others
  • 110.
    LP In Type-3Clones Late propagation of Type-3 Clones Saman Bazrafshan Universit¨at Bremen saman.bazrafshan@informatik.uni-bremen.de Abstract Type-3 clones are duplicated source code fragments that span two or more identical sequences of tokens (whitespace and comments are ignored) that form a contiguous source code fragment interrupted by non- identical token sequences. Several studies on the evo- lution of code clones have been conducted to detect patterns that can help to manage clones [3,6]. One of those patterns that is assumed to be of special inter- est is late propagation [1,2,4]. In this paper, ways of detecting late propagation in the evolution of type-3 clones are proposed and discussed. 1 Introduction During the last years, di↵erent studies focused on de- tecting clone patterns that are considered to have a negative impact on code quality and therefore on maintainability of software. Missing or inconsistent propagation of changes to clones is identified as one pattern that may introduce new defects or prevent the removal of existing ones. To find these clone patterns and enable clone management, a series of tools have been introduced—including clone detectors and clone genealogy extractors. Clones reported by a clone de- tector are generally distinguished according to their level of similarity. Clones that are identical except for comments and whitespaces are called type-1 clones. Type-2 clones extend type-1 clones by tolerating dif- intentionally changed inconsistently [1,2,4]. 2 Late Propagation of Near-Miss Clones The definition of a late propagation regarding identi- cal clones is straightforward: an inconsistent modifica- tion of an identical clone causing the fragments to be non-identical until another inconsistent change to the fragments makes them identical again. However, the definition is not suitable for near-miss clones because they are not completely identical–changes between the identical and the non-identical parts have to be dif- ferentiated. The challenging question that arises from this fact is: What are the essential characteristics of a change that makes an inconsistent change to a near-miss clone consistent at a later point of time? One way to define the late propagation pattern for near-miss clones is to focus exclusively on the identical parts of a clone disregarding the gaps as the gaps are already not common between the cloned fragments. In this case, we would regard a near-miss clone to be changed consistently if the identical parts undergo the same modifications and continue to be identical– analogously to the definition of a late propagation of identical clones. Hence, to recognize an inconsistent ECEASST Late Propagation in Near-Miss Clones: An Empirical Study Manishankar Mondal1, Chanchal K. Roy2, Kevin A. Schneider3 1 mshankar.mondal@usask.ca, https://homepage.usask.ca/⇠mam815/ 2 croy@cs.usask.ca, http://www.cs.usask.ca/⇠croy/ 3 kevin.schneider@usask.ca, http://www.cs.usask.ca/⇠kas/ University of Saskatchewan, Canada Abstract: If two or more code fragments in the code-base of a software system are exactly or nearly similar to one another, we call them code clones. It is often important that updates (i.e., changes) in one clone fragment should be propagated to the other similar clone fragments to ensure consistency. However, if there is a delay in this propagation because of unawareness, the system might behave inconsistently. This delay in propagation, also known as late propagation, has been investigated by a number of existing studies. However, the existing studies did not investigate the intensity as well as the effect of late propagation in different types of clones sepa- rately. Also, late propagation in Type 3 clones is yet to investigate. In this research work we investigate late propagation in three types of clones (Type 1, Type 2, and More late propagations in type-3 clones than in others Late propagations occur in small (block-size) clones
  • 111.
    A Study ofConsistent and Inconsistent Changes to Code Clones Jens Krinke FernUniversit¨at in Hagen, Germany krinke@acm.org Abstract Code Cloning is regarded as a threat to software main- tenance, because it is generally assumed that a change to a code clone usually has to be applied to the other clones of the clone group as well. However, there exists little empirical data that supports this assumption. This paper presents a study on the changes applied to code clones in open source software systems based on the changes between versions of the system. It is analyzed if changes to code clones are consistent to all code clones of a clone group or not. The results show that usually half of the changes to code clone groups are inconsistent changes. Moreover, the study observes that when there are inconsistent changes to a code clone group in a near version, it is rarely the case that there are additional changes in later versions such that the code clone group then has only consistent changes. 1 Introduction Duplicated code is common in all kind of software sys- tems. Although cut-copy-paste (-and-adapt) techniques are considered bad practice, every programmer uses them. Since these practices involve both duplication and mod- ification, they are collectively called code cloning. While the duplicated code is called a code clone. A clone group whether or not the above mentioned problems are relevant in practice. Kim et al. [15] investigated the evolution of code clones and provided a classification for evolving code clones. Their work already showed that during the evolution of the code clones, consistent changes to the code clones of a group are fewer than anticipated. Aversano et al. [4] did a similar study and they state “that the majority of clone classes is always maintained consistently.” Geiger et al. [10] studied the relation of code clone groups and change cou- plings (files which are committed at the same time, by the same author, and with the same modification description), but could not find a (strong) relation. Therefore, this work will present an empirical study that verifies the following hypothesis: During the evolution of a system, code clones of a clone group are changed consistently. Of course, a system may contain bugs where a change has been applied to some code clones, but has been forgot- ten for other code clones of the clone group. For stable systems it can be assumed that such bugs will be resolved at a later time. This results in a second hypothesis: During the evolution of a system, if code clones of a clone group are not changed consistently, the missing changes will appear in a later version. ECEASST Studying Late Propagations in Code Clone Evolution Using Software Repository Mining Hsiao Hui Mui1, Andy Zaidman1 and Martin Pinzger1 1 hsiaomui@gmail.com, a.e.zaidman@tudelft.nl Software Engineering Research Group Delft University of Technology, the Netherlands 2 martin.pinzger@aau.at Software Engineering Research Group University of Klagenfurt, Austria Abstract: In the code clone evolution community, the Late Propagation (LP) has been identified as one of the clone evolution patterns that can potentially lead to software defects. An LP occurs when instances of a clone pair are changed consis- tently, but not at the same time. The clone instance, which receives the update at a later time, might exhibit unintended behavior if the modification was a bugfix. In this paper, we present an approach to extract LPs from software repositories. Sub- Inconsistent? LP?
  • 112.
    A Study ofConsistent and Inconsistent Changes to Code Clones Jens Krinke FernUniversit¨at in Hagen, Germany krinke@acm.org Abstract Code Cloning is regarded as a threat to software main- tenance, because it is generally assumed that a change to a code clone usually has to be applied to the other clones of the clone group as well. However, there exists little empirical data that supports this assumption. This paper presents a study on the changes applied to code clones in open source software systems based on the changes between versions of the system. It is analyzed if changes to code clones are consistent to all code clones of a clone group or not. The results show that usually half of the changes to code clone groups are inconsistent changes. Moreover, the study observes that when there are inconsistent changes to a code clone group in a near version, it is rarely the case that there are additional changes in later versions such that the code clone group then has only consistent changes. 1 Introduction Duplicated code is common in all kind of software sys- tems. Although cut-copy-paste (-and-adapt) techniques are considered bad practice, every programmer uses them. Since these practices involve both duplication and mod- ification, they are collectively called code cloning. While the duplicated code is called a code clone. A clone group whether or not the above mentioned problems are relevant in practice. Kim et al. [15] investigated the evolution of code clones and provided a classification for evolving code clones. Their work already showed that during the evolution of the code clones, consistent changes to the code clones of a group are fewer than anticipated. Aversano et al. [4] did a similar study and they state “that the majority of clone classes is always maintained consistently.” Geiger et al. [10] studied the relation of code clone groups and change cou- plings (files which are committed at the same time, by the same author, and with the same modification description), but could not find a (strong) relation. Therefore, this work will present an empirical study that verifies the following hypothesis: During the evolution of a system, code clones of a clone group are changed consistently. Of course, a system may contain bugs where a change has been applied to some code clones, but has been forgot- ten for other code clones of the clone group. For stable systems it can be assumed that such bugs will be resolved at a later time. This results in a second hypothesis: During the evolution of a system, if code clones of a clone group are not changed consistently, the missing changes will appear in a later version. ECEASST Studying Late Propagations in Code Clone Evolution Using Software Repository Mining Hsiao Hui Mui1, Andy Zaidman1 and Martin Pinzger1 1 hsiaomui@gmail.com, a.e.zaidman@tudelft.nl Software Engineering Research Group Delft University of Technology, the Netherlands 2 martin.pinzger@aau.at Software Engineering Research Group University of Klagenfurt, Austria Abstract: In the code clone evolution community, the Late Propagation (LP) has been identified as one of the clone evolution patterns that can potentially lead to software defects. An LP occurs when instances of a clone pair are changed consis- tently, but not at the same time. The clone instance, which receives the update at a later time, might exhibit unintended behavior if the modification was a bugfix. In this paper, we present an approach to extract LPs from software repositories. Sub- Consistent changes occur half of the time Inconsistent? LP?
  • 113.
    A Study ofConsistent and Inconsistent Changes to Code Clones Jens Krinke FernUniversit¨at in Hagen, Germany krinke@acm.org Abstract Code Cloning is regarded as a threat to software main- tenance, because it is generally assumed that a change to a code clone usually has to be applied to the other clones of the clone group as well. However, there exists little empirical data that supports this assumption. This paper presents a study on the changes applied to code clones in open source software systems based on the changes between versions of the system. It is analyzed if changes to code clones are consistent to all code clones of a clone group or not. The results show that usually half of the changes to code clone groups are inconsistent changes. Moreover, the study observes that when there are inconsistent changes to a code clone group in a near version, it is rarely the case that there are additional changes in later versions such that the code clone group then has only consistent changes. 1 Introduction Duplicated code is common in all kind of software sys- tems. Although cut-copy-paste (-and-adapt) techniques are considered bad practice, every programmer uses them. Since these practices involve both duplication and mod- ification, they are collectively called code cloning. While the duplicated code is called a code clone. A clone group whether or not the above mentioned problems are relevant in practice. Kim et al. [15] investigated the evolution of code clones and provided a classification for evolving code clones. Their work already showed that during the evolution of the code clones, consistent changes to the code clones of a group are fewer than anticipated. Aversano et al. [4] did a similar study and they state “that the majority of clone classes is always maintained consistently.” Geiger et al. [10] studied the relation of code clone groups and change cou- plings (files which are committed at the same time, by the same author, and with the same modification description), but could not find a (strong) relation. Therefore, this work will present an empirical study that verifies the following hypothesis: During the evolution of a system, code clones of a clone group are changed consistently. Of course, a system may contain bugs where a change has been applied to some code clones, but has been forgot- ten for other code clones of the clone group. For stable systems it can be assumed that such bugs will be resolved at a later time. This results in a second hypothesis: During the evolution of a system, if code clones of a clone group are not changed consistently, the missing changes will appear in a later version. ECEASST Studying Late Propagations in Code Clone Evolution Using Software Repository Mining Hsiao Hui Mui1, Andy Zaidman1 and Martin Pinzger1 1 hsiaomui@gmail.com, a.e.zaidman@tudelft.nl Software Engineering Research Group Delft University of Technology, the Netherlands 2 martin.pinzger@aau.at Software Engineering Research Group University of Klagenfurt, Austria Abstract: In the code clone evolution community, the Late Propagation (LP) has been identified as one of the clone evolution patterns that can potentially lead to software defects. An LP occurs when instances of a clone pair are changed consis- tently, but not at the same time. The clone instance, which receives the update at a later time, might exhibit unintended behavior if the modification was a bugfix. In this paper, we present an approach to extract LPs from software repositories. Sub- LP seldom occurs, and most of them re- synchronize within one day Consistent changes occur half of the time Inconsistent? LP?
  • 114.
    Clones and bugs TrackingEntities Late Propagation Clone changes
  • 115.
    Clones and bugs TrackingEntities Late Propagation Clone changes
  • 116.
    Release Level Analysis Scienceof Computer Programming ( ) – Contents lists available at ScienceDirect Science of Computer Programming journal homepage: www.elsevier.com/locate/scico An empirical study on inconsistent changes to code clones at the release level Nicolas Bettenburg⇤ , Weiyi Shang, Walid M. Ibrahim, Bram Adams, Ying Zou, Ahmed E. Hassan Queen’s University, Kingston, Ontario, Canada a r t i c l e i n f o Article history: Available online xxxx Keywords: Software engineering Maintenance management Reuse models Clone detection Maintainability Software evolution a b s t r a c t To study the impact of code clones on software quality, researchers typically carry out their studies based on fine-grained analysis of inconsistent changes at the revision level. As a result, they capture much of the chaotic and experimental nature inherent in any on- going software development process. Analyzing highly fluctuating and short-lived clones is likely to exaggerate the ill effects of inconsistent changes on the quality of the released software product, as perceived by the end user. To gain a broader perspective, we perform an empirical study on the effect of inconsistent changes on software quality at the release level. Based on a case study on three open source software systems, we observe that only 1.02%–4.00% of all clone genealogies introduce software defects at the release level, as opposed to the substantially higher percentages reported by previous studies at the revision level. Our findings suggest that clones do not have a significant impact on the post-release quality of the studied systems, and that the developers are able to effectively manage the evolution of cloned code. © 2010 Elsevier B.V. All rights reserved. 1. Introduction Code clones are the source of heated debates among software maintenance researchers. Developers typically clone (copy) existing pieces of code in order to jumpstart the development of a new feature, or to reuse robust parts of the source code for new development. However, unless a clone is reused as is, developers quickly lose track of the link between the clone and the cloned piece of code, especially after some local modifications. Losing the links between clones increases the risk of inconsistent changes. These are code changes that are applied to only one clone, whereas they should propagate to all clones, such as defect fixing changes. There is no consensus on whether the positive traits of cloning, such as effective reuse, outweigh its drawbacks, such as increased risk of deteriorated software quality. Many researchers consider clones to be harmful [3,6,14,21,22,27,36], due to the belief that inconsistent changes increase both maintenance effort and the likelihood of introducing defects. Yet, other researchers do not find empirical evidence of harm [39,47], or even establish cloning as a valuable software engineering method to overcome language limitations or to specialize common parts of the code [10,24–26]. It is not yet clear which of these two visions prevails, or whether the right vision depends on the software system at hand [15,43,47]. Empirical studies on code clones almost exclusively focus on the impact of cloning on developers, such as the developers’ ability to keep track of all related clones in a clone group and their ability to consistently propagate changes to all clones. Many studies analyze inconsistent changes to clones and the general evolution (genealogy) of clone groups across very small Evaluating Code Clone Genealogies at Release Level: An Empirical Study Ripon K. Saha, Muhammad Asaduzzaman, Minhaz F. Zibran, Chanchal K. Roy, and Kevin A. Schneider Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada S7N 5C9 {ripon.saha, md.asad, minhaz.zibran, chanchal.roy, kevin.schneider}@usask.ca Abstract Code clone genealogies show how clone groups evolve with the evolution of the associated software system, and thus could provide important insights on the maintenance implications of clones. In this paper, we provide an in-depth empirical study for evaluating clone genealogies in evolving open source systems at the release level. We develop a clone genealogy extractor, examine 17 open source C, Java, C++ and C# systems of diverse varieties and study different dimensions of how clone groups evolve with the evolution of the software systems. Our study shows that majority of the clone groups of the clone genealogies either propagate without any syntactic changes or change consistently in the subsequent releases, and that many of the genealogies remain alive during the evolution. These findings seem to be consistent with the findings of a previous study that clones may not be as detrimental in software maintenance as believed to be (at least by many of us), and that instead of aggressively refactoring clones, we should possibly focus on tracking and managing clones during the evolution of software systems. an essential part of software maintenance. However, due to the intense use of template-based programming [12], a certain amount of clones are likely acceptable. Previous studies were highly influenced by the idea that clones are harmful and can be removed through refactoring [15]. This notion has been challenged by the work of Kim et al. [15]. They provided a clone genealogy model and analyzed the clone genealogies of two open source software systems. While a clone group consists of a set of code fragments in a particular version of a software that are clones to each other, a genealogy of a clone group describes how the code fragments of that clone group propagate during the evolution of the subject system. Each clone genealogy consists of a set of clone lineages that originate from the same clone group (source). A clone lineage is a directed acyclic graph that describes the evolution history of a clone group from the beginning to the final release of the software system. The empirical study described by Kim et al. on code clone genealogy reveals that clones are not always harmful. Programmers intentionally practice code cloning to achieve certain benefits [12, 13]. During the development of a software system, many clones are short lived. Refactoring them aggressively can
  • 117.
    Release Level Analysis Scienceof Computer Programming ( ) – Contents lists available at ScienceDirect Science of Computer Programming journal homepage: www.elsevier.com/locate/scico An empirical study on inconsistent changes to code clones at the release level Nicolas Bettenburg⇤ , Weiyi Shang, Walid M. Ibrahim, Bram Adams, Ying Zou, Ahmed E. Hassan Queen’s University, Kingston, Ontario, Canada a r t i c l e i n f o Article history: Available online xxxx Keywords: Software engineering Maintenance management Reuse models Clone detection Maintainability Software evolution a b s t r a c t To study the impact of code clones on software quality, researchers typically carry out their studies based on fine-grained analysis of inconsistent changes at the revision level. As a result, they capture much of the chaotic and experimental nature inherent in any on- going software development process. Analyzing highly fluctuating and short-lived clones is likely to exaggerate the ill effects of inconsistent changes on the quality of the released software product, as perceived by the end user. To gain a broader perspective, we perform an empirical study on the effect of inconsistent changes on software quality at the release level. Based on a case study on three open source software systems, we observe that only 1.02%–4.00% of all clone genealogies introduce software defects at the release level, as opposed to the substantially higher percentages reported by previous studies at the revision level. Our findings suggest that clones do not have a significant impact on the post-release quality of the studied systems, and that the developers are able to effectively manage the evolution of cloned code. © 2010 Elsevier B.V. All rights reserved. 1. Introduction Code clones are the source of heated debates among software maintenance researchers. Developers typically clone (copy) existing pieces of code in order to jumpstart the development of a new feature, or to reuse robust parts of the source code for new development. However, unless a clone is reused as is, developers quickly lose track of the link between the clone and the cloned piece of code, especially after some local modifications. Losing the links between clones increases the risk of inconsistent changes. These are code changes that are applied to only one clone, whereas they should propagate to all clones, such as defect fixing changes. There is no consensus on whether the positive traits of cloning, such as effective reuse, outweigh its drawbacks, such as increased risk of deteriorated software quality. Many researchers consider clones to be harmful [3,6,14,21,22,27,36], due to the belief that inconsistent changes increase both maintenance effort and the likelihood of introducing defects. Yet, other researchers do not find empirical evidence of harm [39,47], or even establish cloning as a valuable software engineering method to overcome language limitations or to specialize common parts of the code [10,24–26]. It is not yet clear which of these two visions prevails, or whether the right vision depends on the software system at hand [15,43,47]. Empirical studies on code clones almost exclusively focus on the impact of cloning on developers, such as the developers’ ability to keep track of all related clones in a clone group and their ability to consistently propagate changes to all clones. Many studies analyze inconsistent changes to clones and the general evolution (genealogy) of clone groups across very small Evaluating Code Clone Genealogies at Release Level: An Empirical Study Ripon K. Saha, Muhammad Asaduzzaman, Minhaz F. Zibran, Chanchal K. Roy, and Kevin A. Schneider Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada S7N 5C9 {ripon.saha, md.asad, minhaz.zibran, chanchal.roy, kevin.schneider}@usask.ca Abstract Code clone genealogies show how clone groups evolve with the evolution of the associated software system, and thus could provide important insights on the maintenance implications of clones. In this paper, we provide an in-depth empirical study for evaluating clone genealogies in evolving open source systems at the release level. We develop a clone genealogy extractor, examine 17 open source C, Java, C++ and C# systems of diverse varieties and study different dimensions of how clone groups evolve with the evolution of the software systems. Our study shows that majority of the clone groups of the clone genealogies either propagate without any syntactic changes or change consistently in the subsequent releases, and that many of the genealogies remain alive during the evolution. These findings seem to be consistent with the findings of a previous study that clones may not be as detrimental in software maintenance as believed to be (at least by many of us), and that instead of aggressively refactoring clones, we should possibly focus on tracking and managing clones during the evolution of software systems. an essential part of software maintenance. However, due to the intense use of template-based programming [12], a certain amount of clones are likely acceptable. Previous studies were highly influenced by the idea that clones are harmful and can be removed through refactoring [15]. This notion has been challenged by the work of Kim et al. [15]. They provided a clone genealogy model and analyzed the clone genealogies of two open source software systems. While a clone group consists of a set of code fragments in a particular version of a software that are clones to each other, a genealogy of a clone group describes how the code fragments of that clone group propagate during the evolution of the subject system. Each clone genealogy consists of a set of clone lineages that originate from the same clone group (source). A clone lineage is a directed acyclic graph that describes the evolution history of a clone group from the beginning to the final release of the software system. The empirical study described by Kim et al. on code clone genealogy reveals that clones are not always harmful. Programmers intentionally practice code cloning to achieve certain benefits [12, 13]. During the development of a software system, many clones are short lived. Refactoring them aggressively can Most of the clone inconsistent changes are not visible at release level
  • 118.
    Risks for CloneChanges Frequency and Risks of Changes to Clones Nils Göde University of Bremen Bremen, Germany nils@informatik.uni-bremen.de Rainer Koschke University of Bremen Bremen, Germany koschke@informatik.uni-bremen.de ABSTRACT Code Clones—duplicated source fragments—are said to in- crease maintenance e↵ort and to facilitate problems caused by inconsistent changes to identical parts. While this is cer- tainly true for some clones and certainly not true for others, it is unclear how many clones are real threats to the system’s quality and need to be taken care of. Our analysis of clone evolution in mature software projects shows that most clones are rarely changed and the number of unintentional incon- sistent changes to clones is small. We thus have to carefully select the clones to be managed to avoid unnecessary e↵ort managing clones with no risk potential. Categories and Subject Descriptors D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement—restructuring, reverse engineering, and reengineering General Terms Experimentation, Measurement Keywords Software maintenance, clone detection, clone evolution 1. INTRODUCTION Code clones are similar fragments of source code. There are many problems caused by the presences of clones. Among others, the source code becomes larger, change e↵ort in- There certainly exist clones that are true threats to soft- ware maintenance. Nevertheless, recent research [19, 20] doubts the harmfulness of clones in general and lists nu- merous situations in which clones are a reasonable design decision. From the clone management perspective, it is de- sirable to detect and manage only the harmful clones, be- cause managing clones that have no negative e↵ects creates only additional e↵ort. Unfortunately, state-of-the-art clone tools detect and clas- sify clones based only on similar structures in the source code or one of its various representations. When it comes to clone-related problems, however, the most important char- acteristic of a clone is its change behavior and not its struc- ture. Only if a clone changes, it causes additional change e↵ort. Only if a clone changes, unintentional inconsistencies can arise. If, on the other hand, a clone never changes, there are no additional costs induced by propagating changes and there is no risk of unwanted inconsistencies. Our hypothesis is that many clones detected by state-of- the-art tools are “structurally interesting” but irrelevant to software maintenance because they never change during their lifetime. Up-to-date clone detectors can e ciently process and de- tect clones within huge amounts of source code, consequently delivering huge numbers of clones. In contrast, clone assess- ment and deciding how to proceed can be very costly even for individual clones as we have experienced with clones in our own code [11]. Hence, having many unproblematic clones in the detection results creates enormous overhead for assess- ing and managing clones that do not threaten maintenance because they never change.
  • 119.
    Risks for CloneChanges Frequency and Risks of Changes to Clones Nils Göde University of Bremen Bremen, Germany nils@informatik.uni-bremen.de Rainer Koschke University of Bremen Bremen, Germany koschke@informatik.uni-bremen.de ABSTRACT Code Clones—duplicated source fragments—are said to in- crease maintenance e↵ort and to facilitate problems caused by inconsistent changes to identical parts. While this is cer- tainly true for some clones and certainly not true for others, it is unclear how many clones are real threats to the system’s quality and need to be taken care of. Our analysis of clone evolution in mature software projects shows that most clones are rarely changed and the number of unintentional incon- sistent changes to clones is small. We thus have to carefully select the clones to be managed to avoid unnecessary e↵ort managing clones with no risk potential. Categories and Subject Descriptors D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement—restructuring, reverse engineering, and reengineering General Terms Experimentation, Measurement Keywords Software maintenance, clone detection, clone evolution 1. INTRODUCTION Code clones are similar fragments of source code. There are many problems caused by the presences of clones. Among others, the source code becomes larger, change e↵ort in- There certainly exist clones that are true threats to soft- ware maintenance. Nevertheless, recent research [19, 20] doubts the harmfulness of clones in general and lists nu- merous situations in which clones are a reasonable design decision. From the clone management perspective, it is de- sirable to detect and manage only the harmful clones, be- cause managing clones that have no negative e↵ects creates only additional e↵ort. Unfortunately, state-of-the-art clone tools detect and clas- sify clones based only on similar structures in the source code or one of its various representations. When it comes to clone-related problems, however, the most important char- acteristic of a clone is its change behavior and not its struc- ture. Only if a clone changes, it causes additional change e↵ort. Only if a clone changes, unintentional inconsistencies can arise. If, on the other hand, a clone never changes, there are no additional costs induced by propagating changes and there is no risk of unwanted inconsistencies. Our hypothesis is that many clones detected by state-of- the-art tools are “structurally interesting” but irrelevant to software maintenance because they never change during their lifetime. Up-to-date clone detectors can e ciently process and de- tect clones within huge amounts of source code, consequently delivering huge numbers of clones. In contrast, clone assess- ment and deciding how to proceed can be very costly even for individual clones as we have experienced with clones in our own code [11]. Hence, having many unproblematic clones in the detection results creates enormous overhead for assess- ing and managing clones that do not threaten maintenance because they never change. Inconsistent changes are often intentional
  • 120.
    Risks for CloneChanges Frequency and Risks of Changes to Clones Nils Göde University of Bremen Bremen, Germany nils@informatik.uni-bremen.de Rainer Koschke University of Bremen Bremen, Germany koschke@informatik.uni-bremen.de ABSTRACT Code Clones—duplicated source fragments—are said to in- crease maintenance e↵ort and to facilitate problems caused by inconsistent changes to identical parts. While this is cer- tainly true for some clones and certainly not true for others, it is unclear how many clones are real threats to the system’s quality and need to be taken care of. Our analysis of clone evolution in mature software projects shows that most clones are rarely changed and the number of unintentional incon- sistent changes to clones is small. We thus have to carefully select the clones to be managed to avoid unnecessary e↵ort managing clones with no risk potential. Categories and Subject Descriptors D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement—restructuring, reverse engineering, and reengineering General Terms Experimentation, Measurement Keywords Software maintenance, clone detection, clone evolution 1. INTRODUCTION Code clones are similar fragments of source code. There are many problems caused by the presences of clones. Among others, the source code becomes larger, change e↵ort in- There certainly exist clones that are true threats to soft- ware maintenance. Nevertheless, recent research [19, 20] doubts the harmfulness of clones in general and lists nu- merous situations in which clones are a reasonable design decision. From the clone management perspective, it is de- sirable to detect and manage only the harmful clones, be- cause managing clones that have no negative e↵ects creates only additional e↵ort. Unfortunately, state-of-the-art clone tools detect and clas- sify clones based only on similar structures in the source code or one of its various representations. When it comes to clone-related problems, however, the most important char- acteristic of a clone is its change behavior and not its struc- ture. Only if a clone changes, it causes additional change e↵ort. Only if a clone changes, unintentional inconsistencies can arise. If, on the other hand, a clone never changes, there are no additional costs induced by propagating changes and there is no risk of unwanted inconsistencies. Our hypothesis is that many clones detected by state-of- the-art tools are “structurally interesting” but irrelevant to software maintenance because they never change during their lifetime. Up-to-date clone detectors can e ciently process and de- tect clones within huge amounts of source code, consequently delivering huge numbers of clones. In contrast, clone assess- ment and deciding how to proceed can be very costly even for individual clones as we have experienced with clones in our own code [11]. Hence, having many unproblematic clones in the detection results creates enormous overhead for assess- ing and managing clones that do not threaten maintenance because they never change. Inconsistent changes are often intentional Worthless to plan clone maintenance where not needed
  • 121.
  • 122.
  • 123.
    Empir Software Eng(2012) 17:503–530 DOI 10.1007/s10664-011-9195-3 Clones: what is that smell? Foyzur Rahman · Christian Bird · Premkumar Devanbu Published online: 24 December 2011 © Springer Science+Business Media, LLC 2011 Editors: Jim Whitehead and Tom Zimmermann Abstract Clones are generally considered bad programming practice in software engineering folklore. They are identified as a bad smell (Fowler et al. 1999) and a major contributor to project maintenance difficulties. Clones inherently cause code bloat, thus increasing project size and maintenance costs. In this work, we try to validate the conventional wisdom empirically to see whether cloning makes code more defect prone. This paper analyses the relationship between cloning and defect proneness. For the four medium to large open source projects that we studied, we find that, first, the great majority of bugs are not significantly associated with clones. Second, we find that clones may be less defect prone than non-cloned code. Third, we find little evidence that clones with more copies are actually more error prone. Fourth, we find little evidence to support the claim that clone groups that span more than one file or directory are more defect prone than collocated clones. Finally, we find that developers do not need to put a disproportionately higher effort to fix clone dense bugs. Our findings do not support the claim that clones are really a “bad smell” (Fowler et al. 1999). Perhaps we can clone, and breathe easily, at the same time. Keywords Empirical software engineering · Software maintenance · Software clone · Software quality · Software evolution
  • 124.
    Empir Software Eng(2012) 17:503–530 DOI 10.1007/s10664-011-9195-3 Clones: what is that smell? Foyzur Rahman · Christian Bird · Premkumar Devanbu Published online: 24 December 2011 © Springer Science+Business Media, LLC 2011 Editors: Jim Whitehead and Tom Zimmermann Abstract Clones are generally considered bad programming practice in software engineering folklore. They are identified as a bad smell (Fowler et al. 1999) and a major contributor to project maintenance difficulties. Clones inherently cause code bloat, thus increasing project size and maintenance costs. In this work, we try to validate the conventional wisdom empirically to see whether cloning makes code more defect prone. This paper analyses the relationship between cloning and defect proneness. For the four medium to large open source projects that we studied, we find that, first, the great majority of bugs are not significantly associated with clones. Second, we find that clones may be less defect prone than non-cloned code. Third, we find little evidence that clones with more copies are actually more error prone. Fourth, we find little evidence to support the claim that clone groups that span more than one file or directory are more defect prone than collocated clones. Finally, we find that developers do not need to put a disproportionately higher effort to fix clone dense bugs. Our findings do not support the claim that clones are really a “bad smell” (Fowler et al. 1999). Perhaps we can clone, and breathe easily, at the same time. Keywords Empirical software engineering · Software maintenance · Software clone · Software quality · Software evolution Most of defect-prone code (>80%) does not contain clones
  • 125.
    Empir Software Eng(2012) 17:503–530 DOI 10.1007/s10664-011-9195-3 Clones: what is that smell? Foyzur Rahman · Christian Bird · Premkumar Devanbu Published online: 24 December 2011 © Springer Science+Business Media, LLC 2011 Editors: Jim Whitehead and Tom Zimmermann Abstract Clones are generally considered bad programming practice in software engineering folklore. They are identified as a bad smell (Fowler et al. 1999) and a major contributor to project maintenance difficulties. Clones inherently cause code bloat, thus increasing project size and maintenance costs. In this work, we try to validate the conventional wisdom empirically to see whether cloning makes code more defect prone. This paper analyses the relationship between cloning and defect proneness. For the four medium to large open source projects that we studied, we find that, first, the great majority of bugs are not significantly associated with clones. Second, we find that clones may be less defect prone than non-cloned code. Third, we find little evidence that clones with more copies are actually more error prone. Fourth, we find little evidence to support the claim that clone groups that span more than one file or directory are more defect prone than collocated clones. Finally, we find that developers do not need to put a disproportionately higher effort to fix clone dense bugs. Our findings do not support the claim that clones are really a “bad smell” (Fowler et al. 1999). Perhaps we can clone, and breathe easily, at the same time. Keywords Empirical software engineering · Software maintenance · Software clone · Software quality · Software evolution Most of defect-prone code (>80%) does not contain clones Large clones have lower defect density
  • 126.
    Empir Software Eng(2012) 17:503–530 DOI 10.1007/s10664-011-9195-3 Clones: what is that smell? Foyzur Rahman · Christian Bird · Premkumar Devanbu Published online: 24 December 2011 © Springer Science+Business Media, LLC 2011 Editors: Jim Whitehead and Tom Zimmermann Abstract Clones are generally considered bad programming practice in software engineering folklore. They are identified as a bad smell (Fowler et al. 1999) and a major contributor to project maintenance difficulties. Clones inherently cause code bloat, thus increasing project size and maintenance costs. In this work, we try to validate the conventional wisdom empirically to see whether cloning makes code more defect prone. This paper analyses the relationship between cloning and defect proneness. For the four medium to large open source projects that we studied, we find that, first, the great majority of bugs are not significantly associated with clones. Second, we find that clones may be less defect prone than non-cloned code. Third, we find little evidence that clones with more copies are actually more error prone. Fourth, we find little evidence to support the claim that clone groups that span more than one file or directory are more defect prone than collocated clones. Finally, we find that developers do not need to put a disproportionately higher effort to fix clone dense bugs. Our findings do not support the claim that clones are really a “bad smell” (Fowler et al. 1999). Perhaps we can clone, and breathe easily, at the same time. Keywords Empirical software engineering · Software maintenance · Software clone · Software quality · Software evolution Most of defect-prone code (>80%) does not contain clones Large clones have lower defect density Amount of changes to fix bugs is smaller for clones
  • 127.
    Duplicate bugs inclones Bug Replication in Code Clones: An Empirical Study Judith F. Islam Manishankar Mondal Chanchal K. Roy Department of Computer Science, University of Saskatchewan, Canada {judith.islam, mshankar.mondal, chanchal.roy}@usask.ca Abstract—Code clones are exactly or nearly similar code fragments in the code-base of a software system. Existing studies show that clones are directly related to bugs and inconsistencies in the code-base. Code cloning (making code clones) is suspected to be responsible for replicating bugs in the code fragments. However, there is no study on the possibilities of bug-replication through cloning process. Such a study can help us discover ways of minimizing bug-replication. Focusing on this we conduct an empirical study on the intensities of bug-replication in the code clones of the major clone-types: Type 1, Type 2, and Type 3. According to our investigation on thousands of revisions of six diverse subject systems written in two different programming languages, C and Java, a considerable proportion (i.e., up to 10%) of the code clones can contain replicated bugs. Both Type 2 and Type 3 clones have higher tendencies of having replicated bugs compared to Type 1 clones. Thus, Type 2 and Type 3 clones are more important from clone management perspectives. The extent of bug-replication in the buggy clone classes is generally very high (i.e., 100% in most of the cases). We also find that overall 55% of all the bugs experienced by the code clones can be replicated bugs. Our study shows that replication of bugs through cloning is a common phenomenon. Clone fragments having method-calls and if-conditions should be considered for refactoring with high priorities, because such clone fragments have high possibilities of containing replicated bugs. We believe that our findings are important for better maintenance of software systems, in particular, systems with code clones. I. INTRODUCTION If two or more code fragments in a software system’s code- base are exactly or nearly similar to one another we call them code clones [44], [45]. A group of similar code fragments forms a clone class. Code clones are mainly created because of the frequent copy/paste activities of the programmers during software development and maintenance. Whatever may be the reasons behind cloning, code clones are of great importance from the perspectives of software maintenance and evolution [44]. fragment contains a bug and a programmer copies that code fragment to several other places in the code-base without the knowledge of the existing bug, the bug in the original fragment gets replicated. Fixing of such replicated bugs may require increased maintenance effort and cost for software systems. However, although cloning is suspected to be responsible for replicating bugs, there is no study on the possibilities of bug-replication through cloning. Such a study can provide us helpful insights for minimizing bug-replication as well as for prioritizing code clones for refactoring or tracking. Focusing on this we conduct an in-depth empirical study regarding bug- replication in the code clones of the major clone-types: Type 1, Type 2, Type 3. We conduct our empirical study on thousands of revisions of six diverse subject systems written in two different program- ming languages (Java and C). We detect code clones from each of the revisions of a subject system using the NiCad [6] clone detector, analyze the evolution history of these code clones, and investigate whether and to what extent they contain replicated bugs. We answer four important research questions (Table I) regarding the intensity and cause of bug-replication through our investigation. According to our investigation in- volving rigorous manual analysis we can state that: (1) A considerable percentage of the code clones can be related to bug-replication. According to our observation up to 10% of the code clones in a software system can contain replicated bugs. (2) Both Type 2 and Type 3 clones have higher possibilities of containing replicated bugs compared to Type 1 clones. Thus, Type 2 and Type 3 clones should be given higher priorities for management. (3) A considerable proportion (around 55%) of the bugs occurred in code clones can be replicated bugs. (4) Most of the replicated bugs are related to the method- 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering
  • 128.
    Duplicate bugs inclones Bug Replication in Code Clones: An Empirical Study Judith F. Islam Manishankar Mondal Chanchal K. Roy Department of Computer Science, University of Saskatchewan, Canada {judith.islam, mshankar.mondal, chanchal.roy}@usask.ca Abstract—Code clones are exactly or nearly similar code fragments in the code-base of a software system. Existing studies show that clones are directly related to bugs and inconsistencies in the code-base. Code cloning (making code clones) is suspected to be responsible for replicating bugs in the code fragments. However, there is no study on the possibilities of bug-replication through cloning process. Such a study can help us discover ways of minimizing bug-replication. Focusing on this we conduct an empirical study on the intensities of bug-replication in the code clones of the major clone-types: Type 1, Type 2, and Type 3. According to our investigation on thousands of revisions of six diverse subject systems written in two different programming languages, C and Java, a considerable proportion (i.e., up to 10%) of the code clones can contain replicated bugs. Both Type 2 and Type 3 clones have higher tendencies of having replicated bugs compared to Type 1 clones. Thus, Type 2 and Type 3 clones are more important from clone management perspectives. The extent of bug-replication in the buggy clone classes is generally very high (i.e., 100% in most of the cases). We also find that overall 55% of all the bugs experienced by the code clones can be replicated bugs. Our study shows that replication of bugs through cloning is a common phenomenon. Clone fragments having method-calls and if-conditions should be considered for refactoring with high priorities, because such clone fragments have high possibilities of containing replicated bugs. We believe that our findings are important for better maintenance of software systems, in particular, systems with code clones. I. INTRODUCTION If two or more code fragments in a software system’s code- base are exactly or nearly similar to one another we call them code clones [44], [45]. A group of similar code fragments forms a clone class. Code clones are mainly created because of the frequent copy/paste activities of the programmers during software development and maintenance. Whatever may be the reasons behind cloning, code clones are of great importance from the perspectives of software maintenance and evolution [44]. fragment contains a bug and a programmer copies that code fragment to several other places in the code-base without the knowledge of the existing bug, the bug in the original fragment gets replicated. Fixing of such replicated bugs may require increased maintenance effort and cost for software systems. However, although cloning is suspected to be responsible for replicating bugs, there is no study on the possibilities of bug-replication through cloning. Such a study can provide us helpful insights for minimizing bug-replication as well as for prioritizing code clones for refactoring or tracking. Focusing on this we conduct an in-depth empirical study regarding bug- replication in the code clones of the major clone-types: Type 1, Type 2, Type 3. We conduct our empirical study on thousands of revisions of six diverse subject systems written in two different program- ming languages (Java and C). We detect code clones from each of the revisions of a subject system using the NiCad [6] clone detector, analyze the evolution history of these code clones, and investigate whether and to what extent they contain replicated bugs. We answer four important research questions (Table I) regarding the intensity and cause of bug-replication through our investigation. According to our investigation in- volving rigorous manual analysis we can state that: (1) A considerable percentage of the code clones can be related to bug-replication. According to our observation up to 10% of the code clones in a software system can contain replicated bugs. (2) Both Type 2 and Type 3 clones have higher possibilities of containing replicated bugs compared to Type 1 clones. Thus, Type 2 and Type 3 clones should be given higher priorities for management. (3) A considerable proportion (around 55%) of the bugs occurred in code clones can be replicated bugs. (4) Most of the replicated bugs are related to the method- 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering Over half of bugs occurring in clones are duplicated bugs
  • 129.
  • 131.
  • 132.
    Actually, it doesnot happen so often Late propagations for type-3 clones
  • 133.
    Actually, it doesnot happen so often Many clone genealogies Late propagations for type-3 clones
  • 134.
    Actually, it doesnot happen so often Many clone genealogies Consistent if we look at release level Late propagations for type-3 clones
  • 135.
    Late propagation is highlycorrelated with defects Actually, it does not happen so often Many clone genealogies Consistent if we look at release level Late propagations for type-3 clones
  • 136.
    Late propagation is highlycorrelated with defects Actually, it does not happen so often Many clone genealogies Consistent if we look at release level Late propagations for type-3 clones But no more than defects in non-cloned code
  • 138.
    We now havedata, infrastructure and computational power for larger, better studies
  • 139.
    Comparing Approaches Comparative Stabilityof Cloned and Non-cloned Code: An Empirical Study Manishankar Mondal1 , Chanchal K. Roy1 , Md. Saidur Rahman1 , Ripon K. Saha1 , Jens Krinke2 , Kevin A. Schneider1 1 Department of Computer Science, University of Saskatchewan, Canada 2 University College London, UK 1 {mshankar.mondal, chanchal.roy, saeed.cs, ripon.saha, kevin.schneider}@usask.ca 2 j.krinke@ucl.ac.uk ABSTRACT Code cloning is a controversial software engineering practice due to contradictory claims regarding its e↵ect on software maintenance. Code stability is a recently introduced mea- surement technique that has been used to determine the impact of code cloning by quantifying the changeability of a code region. Although most of the existing stability analy- sis studies agree that cloned code is more stable than non- cloned code, the studies have two major flaws: (i) each study only considered a single stability measurement (e.g., lines of code changed, frequency of change, age of change); and, (ii) only a small number of subject systems were analyzed and these were of limited variety. In this paper, we present a comprehensive empirical study on code stability using three di↵erent stability measuring methods. We use a recently introduced hybrid clone detec- tion tool, NiCAD, to detect the clones and analyze their stability in four dimensions: by clone type, by measuring method, by programming language, and by system size and age. Our four-dimensional investigation on 12 diverse sub- ject systems written in three programming languages consid- ering three clone types reveals that: (i) Type-1 and Type-2 clones are unstable, but Type-3 clones are not; (ii) clones in Java and C systems are not as stable as clones in C# systems; (iii) a system’s development strategy might play a key role in defining its comparative code stability scenario; and, (iv) cloned and non-cloned regions of a subject system do not follow a consistent change pattern. Categories and Subject Descriptors D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement—Restructuring, Reverse Engineering and Keywords Code Stability; Modification Frequency; Average Last Change Date; Average Age; Clone Types 1. INTRODUCTION Frequent copy-paste activity by programmers during soft- ware development is common. Copying a code fragment from one location and pasting it to another location with or without modifications cause multiple copies of exact or closely similar code fragments to co-exist in software sys- tems. These code fragments are known as clones. Whatever may be the reasons behind cloning, the impact of clones on software maintenance and evolution is of great concern. The common belief is that, the presence of duplicate code poses additional challenges to software maintenance by mak- ing inconsistent changes more di cult, introducing bugs and as a result increasing maintenance e↵orts. From this point of view, some researchers have identified clones as “bad smells” and their studies showed that clones have negative impact on software quality and maintenance [7, 14, 15]. On the other hand, there has been a good number of empirical evidence in favour of clones concluding that clones are not harmful [1, 6, 9, 10, 18]. Instead, clones can be useful from di↵erent points of views [8]. A widely used term to assess the impact of clones on soft- ware maintenance is stability [6, 11, 12, 14]. Because if cloned code is more stable (changes less frequently) as com- pared to non-cloned code during software evolution, it can be concluded that cloned code does not significantly increase maintenance e↵orts. Di↵erent researchers have defined and evaluated stability from di↵erent viewpoints which can be broadly divided into two categories: (1) Stability measurement in terms of changes:
  • 140.
    Genealogy Extractors An AutomaticFramework for Extracting and Classifying Near-Miss Clone Genealogies Ripon K. Saha Chanchal K. Roy Kevin A. Schneider Department of Computer Science, University of Saskatchewan, Canada {ripon.saha, chanchal.roy, kevin.schneider}@usask.ca Abstract—Extracting code clone genealogies across multiple versions of a program and classifying them according to their change patterns underlies the study of code clone evolution. While there are a few studies in the area, the approaches do not handle near-miss clones well and the associated tools are often computationally expensive. To address these limitations, we present a framework for automatically extracting both exact and near-miss clone genealogies across multiple versions of a program and for identifying their change patterns using a few key similarity factors. We have developed a prototype clone genealogy extractor, applied it to three open source projects including the Linux Kernel, and evaluated its accuracy in terms of precision and recall. Our experience shows that the prototype is scalable, adaptable to different clone detection tools, and can automatically identify evolution patterns of both exact and near-miss clones by constructing their genealogies. Index Terms—clone genealogy extractor; mapping; clone evo- lution. I. INTRODUCTION The investigation and analysis of code clones has attracted considerable attention from the software engineering research community in recent years. Researchers have presented ev- idence that code clones have both positive [10], [22] and negative [16] consequences for maintenance activities and thus, in general, code clones are neither good nor bad. It is also not possible or practical to eliminate certain clone classes from a software system [10]. Consequently, the identification logs provided by source code repositories such as svn. In the third approach [15], [6], clones are mapped during clone detection based on source code changes between revisions. A combination of the first and second approaches has also been used in some studies [3]. Although intuitive, each of these approaches has some limitations. In the first approach, a number of the similarity metrics used to map clones have quadratic time complexities [9]. In addition, if a clone fragment changes significantly in the next version and goes beyond the given similarity threshold of the clone genealogy extractor, a mapping may not be identified. In the second approach, only clones identified in the first version are mapped. Therefore, we do not know what happens to clones introduced in later versions. The third approach (“incremental approach”) avoids some of the limitations of the previous two approaches by combining detection and mapping, and works well for mapping clones in many versions. By integrating clone detection and clone mapping this approach can be faster than the approaches that require clone detection to be conducted separately for each version. Although this incremental approach is fast enough both for detection and mapping for a given set of revisions, it might not be as beneficial at the release level [6] because there might be a significant difference between the releases. Furthermore, in the sole available incremental tool, iClones
  • 141.
  • 142.
    Clone Detection 
 InModern IDEs https://blogs.msdn.microsoft.com/zainnab/2012/06/28/visual-studio-2012-new-features-code-clone-analysis/
  • 143.
    Clone Tracking Shouldbe also Put In The Practice 3 Clone Region Descriptors: Representing and Tracking Duplication in Source Code EKWA DUALA-EKOKO and MARTIN P. ROBILLARD McGill University Source code duplication, commonly known as code cloning, is considered an obstacle to software maintenance because changes to a cloned region often require consistent changes to other regions of the source code. Research has provided evidence that the elimination of clones may not always be practical, feasible, or cost-effective. We present a clone management approach that describes clone regions in a robust way that is independent from the exact text of clone regions or their location in a file, and that provides support for tracking clones in evolving software. Our technique relies on the concept of abstract clone region descriptors (CRDs), which describe clone regions using a combination of their syntactic, structural, and lexical information. We present our definition of CRDs, and describe a clone tracking system capable of producing CRDs from the output of dif- ferent clone detection tools, notifying developers of modifications to clone regions, and supporting updates to the documented clone relationships. We evaluated the performance and usefulness of our approach across three clone detection tools and five subject systems, and the results in- dicate that CRDs are a practical and robust representation for tracking code clones in evolving software. Categories and Subject Descriptors: D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement General Terms: Design, Experimentation Additional Key Words and Phrases: Source code duplication, code clones, clone detection, refactor- ing, clone management ACM Reference Format: Duala-Ekoko, E. and Robillard, M. P. 2010. Clone region descriptors: Representing and tracking duplication in source code. ACM Trans. Softw. Eng. Methodol. 20, 1, Article 3 (June 2010), 31 pages. DOI = 10.1145/1767751.1767754 http://doi.acm.org/10.1145/1767751.1767754 Applying Clone Change Notification System into an Industrial Development Process Yuki Yamanaka ∗, Eunjong Choi ∗, Norihiro Yoshida †, Katsuro Inoue ∗, Tateki Sano ‡ ∗ Graduate School of Information Science and Technology, Osaka University, Japan {y-yuuki, ejchoi, inoue}@ist.osaka-u.ac.jp † Graduate School of Information Science, Nara Institute of Science and Technology, Japan yoshida@is.naist.jp ‡ Software Process Innovation and Standardization Division, NEC Corporation, Japan t-sano@cp.jp.nec.com Abstract—Programmers tend to write code clones unintention- ally even in the case that they can easily avoid them. Clone change management is one of crucial issues in open source software (OSS) development as well as in industrial software development (e.g., development of social infrastructure, financial system, and medical equipment). When an industrial developer fixes a defect, he/she has to find the code clones corresponding to the code fragment including it. So far, several studies performed on the analysis of clone evolution in OSS. However, to our knowledge, a few researches have been reported on an application of a clone change notification system to industrial development process. In this paper, we introduce a system for notifying creation and change of code clones, and then report on the experience with 40-days application of it into a development process in NEC Corporation. In the industrial application, a developer successfully identified ten unintentionally-developed clones that should be refactored. Index Terms—Code Clone, Software Maintenance, Refactoring I. INTRODUCTION A code clone is a code fragment that has similar or identical code fragments in source code. Many code clone detection Because the team plans long-time maintenance as well as reuse for other system developments, the developers are highly motivated to merge code clones into a single module. However, the cost of refactoring cannot be ignored espe- cially in industry. Regression test after refactoring takes much cost to preserve behavior after refactoring. The development team at NEC also considers the cost of refactoring. Basically, they do not touch source code after large-scale system test for releasing major version of the software because refactoring after large-scale test leads the re-performance of such costly test. Therefore, they need to know newly-appeared clones regularly, especially before large-scale system test. In this paper, we present clone change notification system Clone Notifier (see Figure 3) for the promotion of efficient clone management (e.g., refactoring, simultaneous editing). Clone Notifier notifies newly-appeared and changed clones regularly to developers. As an industrial application, we ap- plied Clone Notifier into the process of the web application software development at NEC. The result shows 119 newly-
  • 144.
    AMIC (Automatic Miningof Important Clones) 41 http://sr-p2irc-big2.usask.ca/amic/
  • 145.
    Above all, in ContinuousIntegration [Duvall et al. , 2007] Compile Test Integrate Check Deploy
 … Developers SCM Server CI Server Poll Push changes Push changes [Duvall et al., 2007] Feedback
  • 146.
    Survey in INGNL Amount of duplicated codes Cyclomatic complexity Number of function parameters Lines of Code (LOC) Comment words Number of source files Other % of respondents 0% 25% 50% 75% 100% 15% 16% 18% 44% 51% 69% 78% Metrics Collected to Monitor Source Code Quality
  • 147.
    Cloning From Forums… Tomorrow9:00AM: Stack Overflow: A Code Laundering Platform? Le An, Ons Mlouki, Foutse Khomh and Giuliano Antoniol
  • 148.
  • 150.
    Tracking clone changes Cloneclass A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Snap 6 Consistent change Late propagation Independent evolution
  • 151.
    Tracking clone changes Cloneclass A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Snap 6 Consistent change Late propagation Independent evolution Evolution Patterns 0% 20% 40% 60% 80% ArgoUML JBoss OpenSSH PostgreSQL 0%0% 3%4% 16% 4%5%7% 39% 24% 52% 34% 38% 71% 40% 55% Consistent Indep. Evolution Late Propagation Unknown
  • 152.
    Tracking Entities Late Propagation Clonechanges Clones and bugs Tracking clone changes Clone class A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Snap 6 Consistent change Late propagation Independent evolution Evolution Patterns 0% 20% 40% 60% 80% ArgoUML JBoss OpenSSH PostgreSQL 0%0% 3%4% 16% 4%5%7% 39% 24% 52% 34% 38% 71% 40% 55% Consistent Indep. Evolution Late Propagation Unknown
  • 153.
    Tracking Entities Late Propagation Clonechanges Clones and bugs Tracking clone changes Clone class A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Snap 6 Consistent change Late propagation Independent evolution Evolution Patterns 0% 20% 40% 60% 80% ArgoUML JBoss OpenSSH PostgreSQL 0%0% 3%4% 16% 4%5%7% 39% 24% 52% 34% 38% 71% 40% 55% Consistent Indep. Evolution Late Propagation Unknown Survey in ING NL Amount of duplicated codes Cyclomatic complexity Number of function parameters Lines of Code (LOC) Comment words Number of source files Other % of respondents 0% 25% 50% 75% 100% 15% 16% 18% 44% 51% 69% 78% Metrics Collected to Monitor Source Code Quality
  • 154.
    Tracking Entities Late Propagation Clonechanges Clones and bugs Tracking clone changes Clone class A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Snap 6 Consistent change Late propagation Independent evolution Evolution Patterns 0% 20% 40% 60% 80% ArgoUML JBoss OpenSSH PostgreSQL 0%0% 3%4% 16% 4%5%7% 39% 24% 52% 34% 38% 71% 40% 55% Consistent Indep. Evolution Late Propagation Unknown Survey in ING NL Amount of duplicated codes Cyclomatic complexity Number of function parameters Lines of Code (LOC) Comment words Number of source files Other % of respondents 0% 25% 50% 75% 100% 15% 16% 18% 44% 51% 69% 78% Metrics Collected to Monitor Source Code Quality
  • 155.
    Tracking Entities Late Propagation Clonechanges Clones and bugs Tracking clone changes Clone class A Clone class B Snap 1 Snap 2 Snap 3 Snap 4 Snap 5 Snap 6 Snap 6 Consistent change Late propagation Independent evolution Evolution Patterns 0% 20% 40% 60% 80% ArgoUML JBoss OpenSSH PostgreSQL 0%0% 3%4% 16% 4%5%7% 39% 24% 52% 34% 38% 71% 40% 55% Consistent Indep. Evolution Late Propagation Unknown Survey in ING NL Amount of duplicated codes Cyclomatic complexity Number of function parameters Lines of Code (LOC) Comment words Number of source files Other % of respondents 0% 25% 50% 75% 100% 15% 16% 18% 44% 51% 69% 78% Metrics Collected to Monitor Source Code Quality