Research presentation by Maelick Claes of joint research at the IWSC 2015 workshop of SANER 2015 (Montreal, Canada), reporting on an empirical analysis studying the presence of software code clones across R packages in CRAN, the reasons for these clones, and whether and how these clones could be removed.
Empirically Analysing the Socio-Technical Health of Software Package Managers
An Empirical Study of Identical Function Clones in CRAN
1. An Empirical Study of
Identical Function Clones
in CRAN
Maëlick Claes
Tom Mens, Narjisse Tabout & Philippe Grosjean
&
6th February 2014, IWSC 2015
Software Engineering Lab Numerical Ecology of Aquatic
Systems Lab
0
3. Statistical environment based on the S language
Packages with code, doc, examples, tests, datasets
CRAN (Comprehensive R Archive Network)
Official R package repository
Strict policy for package acceptance
Package quality regularly checked & archive process
Complaints in the community Hornik 2012, Are there too many R packages?
Empirical study of Inter-project clones in CRAN
http://www.r-project.org
4. Previous work
Preliminary empirical study using CRAN meta-data
On the maintainability of CRAN packages (CSMR-WCRE 2014)
R CMD check results from CRAN:
Most errors resolved quickly without developer intervention
Maintenance effort needs to focus on fixing errors caused by others
Need for a more specific tool to detect problems related to dependency
changes
Web-dashboard for CRAN maintainers
maintaineR, a web-based dashboard for maintainers of CRAN packages
(ICSME 2014)
Type-1 function clone identification
http://cran.r-project.org/web/checks/
5. Identifying cloned R functions
Parsing R code with R itself
Assigning a SHA-1 hash to each function's AST
Ignoring functions with less than 6 lines of code
Identifying Type-1 clones = identifying identical hashes across packages
6. Observed clone cases
Coexisting package versions: plyr and dplyr, lme and nlme, np and npRmpi
Fork package: Rcmdr and QCAGUI
Frequently cloned package: distr
Utility package: DescTools
Popular package: MASS
Popular function: permn() from combinat
7. Research Questions
How prevalent are (Type-1) function clones in CRAN?
Why did these clones appear?
Is it possible to remove them and how?
13. Categorizing clones
All clones on 1st December 2014
7366 clones
162k LOC
1409 packages
3184 clone sets
Identifying the origin of each clone set
Each clone set origin is either
An anonymous and/or local function
An archived global function
A private global function
A public global function
14. Anonymous, local and global
functions
From DescTools 0.99.8.1 package...
qbinom.abscont<-function(p,size,x){
fun<-function(prob,size,x,p){
pbinom.abscont(x,size,prob)-p
}
uniroot(fun,interval=c(0,1),size=size,x=x,p=p)$root
}
... which could be rewritten as
qbinom.abscont<-function(p,size,x){
uniroot(function(prob,size,x,p){
pbinom.abscont(x,size,prob)-p
},interval=c(0,1),size=size,x=x,p=p)$root
}
15. NAMESPACE file
Also from DescTools 0.99.8.1
exportPattern("^[^.]")
importFrom("boot","boot","boot.ci","corr")
import(tcltk)
useDynLib(DescTools)
16. Classification of clone origins
Most clones were created because it was not possible to re-use the original function
18. Adding dependency to
The origin package
673 out of the 1899 global clone set origins are public functions
782 functions that could potentially be removed in 332 packages
48 functions in a package where there is already a direct dependency
20 functions in a package where a dependency cannot be added without
creating cycles
A non-original clone copy
On 2511 clone sets with a non-public origin function, only 250 have another
public copy
Only 299 functions could be removed by depending on another copy
=> Removing clones in CRAN packages cannot be reduced to code refactoring. Most
of the time it would require communication between maintainers of different
packages
19. ConclusionCloned code represents a small fraction of all CRAN code but still more than
100K LOC across the biggest CRAN packages
Most clones cannot be removed by adding dependencies without enforcing CRAN
policy
But still an important number of clones that could theoretically easily be removed
Further work needed to understand if the refactorable clones are justified or not
20. Future WorkAsking developers (survey) about their cloning behavior
Type-2 and Type-3 clones
Clone patterns
Inter-project cloning behavior in other languages / ecosystems
21. Thanks for your attention
Questions?
Slides: http://maelick.net/presentations/iwsc2015/