SlideShare a Scribd company logo
An Empirical Study of
Identical Function Clones
in CRAN
Maëlick Claes
Tom Mens, Narjisse Tabout & Philippe Grosjean
&
6th February 2014, IWSC 2015
Software Engineering Lab Numerical Ecology of Aquatic
Systems Lab
0
Introduction
Statistical environment based on the S language
Packages with code, doc, examples, tests, datasets
CRAN (Comprehensive R Archive Network)
Official R package repository
Strict policy for package acceptance
Package quality regularly checked & archive process
Complaints in the community Hornik 2012, Are there too many R packages?
Empirical study of Inter-project clones in CRAN
http://www.r-project.org
Previous work
Preliminary empirical study using CRAN meta-data
On the maintainability of CRAN packages (CSMR-WCRE 2014)
R CMD check results from CRAN:
Most errors resolved quickly without developer intervention
Maintenance effort needs to focus on fixing errors caused by others
Need for a more specific tool to detect problems related to dependency
changes
Web-dashboard for CRAN maintainers
maintaineR, a web-based dashboard for maintainers of CRAN packages
(ICSME 2014)
Type-1 function clone identification
http://cran.r-project.org/web/checks/
Identifying cloned R functions
Parsing R code with R itself
Assigning a SHA-1 hash to each function's AST
Ignoring functions with less than 6 lines of code
Identifying Type-1 clones = identifying identical hashes across packages
Observed clone cases
Coexisting package versions: plyr and dplyr, lme and nlme, np and npRmpi
Fork package: Rcmdr and QCAGUI
Frequently cloned package: distr
Utility package: DescTools
Popular package: MASS
Popular function: permn() from combinat
Research Questions
How prevalent are (Type-1) function clones in CRAN?
Why did these clones appear?
Is it possible to remove them and how?
How prevalent are
(Type-1) function
clones in CRAN?
Evolution of the number of
packages
Evolution of the number of LOC
Evolution of the relative size
Why did clones
appear?
Categorizing clones
All clones on 1st December 2014
7366 clones
162k LOC
1409 packages
3184 clone sets
Identifying the origin of each clone set
Each clone set origin is either
An anonymous and/or local function
An archived global function
A private global function
A public global function
Anonymous, local and global
functions
From DescTools 0.99.8.1 package...
qbinom.abscont<-function(p,size,x){
fun<-function(prob,size,x,p){
pbinom.abscont(x,size,prob)-p
}
uniroot(fun,interval=c(0,1),size=size,x=x,p=p)$root
}
... which could be rewritten as
qbinom.abscont<-function(p,size,x){
uniroot(function(prob,size,x,p){
pbinom.abscont(x,size,prob)-p
},interval=c(0,1),size=size,x=x,p=p)$root
}
NAMESPACE file
Also from DescTools 0.99.8.1
exportPattern("^[^.]")
importFrom("boot","boot","boot.ci","corr")
import(tcltk)
useDynLib(DescTools)
Classification of clone origins
Most clones were created because it was not possible to re-use the original function
Is it possible to
remove clones and
how?
Adding dependency to
The origin package
673 out of the 1899 global clone set origins are public functions
782 functions that could potentially be removed in 332 packages
48 functions in a package where there is already a direct dependency
20 functions in a package where a dependency cannot be added without
creating cycles
A non-original clone copy
On 2511 clone sets with a non-public origin function, only 250 have another
public copy
Only 299 functions could be removed by depending on another copy
=> Removing clones in CRAN packages cannot be reduced to code refactoring. Most
of the time it would require communication between maintainers of different
packages
ConclusionCloned code represents a small fraction of all CRAN code but still more than
100K LOC across the biggest CRAN packages
Most clones cannot be removed by adding dependencies without enforcing CRAN
policy
But still an important number of clones that could theoretically easily be removed
Further work needed to understand if the refactorable clones are justified or not
Future WorkAsking developers (survey) about their cloning behavior
Type-2 and Type-3 clones
Clone patterns
Inter-project cloning behavior in other languages / ecosystems
Thanks for your attention
Questions?
Slides: http://maelick.net/presentations/iwsc2015/

More Related Content

What's hot

Automatically Tolerating And Correcting Memory Errors
Automatically Tolerating And Correcting Memory ErrorsAutomatically Tolerating And Correcting Memory Errors
Automatically Tolerating And Correcting Memory ErrorsEmery Berger
 
Debugging concurrency programs in go
Debugging concurrency programs in goDebugging concurrency programs in go
Debugging concurrency programs in goAndrii Soldatenko
 
Go debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFxGo debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFxSignalFx
 
Core Java Meetup #9 - Quiz Questions - 6th May
Core Java Meetup #9 - Quiz Questions - 6th MayCore Java Meetup #9 - Quiz Questions - 6th May
Core Java Meetup #9 - Quiz Questions - 6th MayCodeOps Technologies LLP
 
Reversing the dropbox client on windows
Reversing the dropbox client on windowsReversing the dropbox client on windows
Reversing the dropbox client on windowsextremecoders
 
Coding in GO - GDG SL - NSBM
Coding in GO - GDG SL - NSBMCoding in GO - GDG SL - NSBM
Coding in GO - GDG SL - NSBMRaveen Perera
 
Analyzing memory usage and leaks
Analyzing memory usage and leaksAnalyzing memory usage and leaks
Analyzing memory usage and leaksRonnBlack
 
Kotlin for backend development (Hackaburg 2018 Regensburg)
Kotlin for backend development (Hackaburg 2018 Regensburg)Kotlin for backend development (Hackaburg 2018 Regensburg)
Kotlin for backend development (Hackaburg 2018 Regensburg)Tobias Schneck
 
Practicing Python 3
Practicing Python 3Practicing Python 3
Practicing Python 3Mosky Liu
 

What's hot (11)

Automatically Tolerating And Correcting Memory Errors
Automatically Tolerating And Correcting Memory ErrorsAutomatically Tolerating And Correcting Memory Errors
Automatically Tolerating And Correcting Memory Errors
 
Parallel streams in java 8
Parallel streams in java 8Parallel streams in java 8
Parallel streams in java 8
 
Blist
BlistBlist
Blist
 
Debugging concurrency programs in go
Debugging concurrency programs in goDebugging concurrency programs in go
Debugging concurrency programs in go
 
Go debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFxGo debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFx
 
Core Java Meetup #9 - Quiz Questions - 6th May
Core Java Meetup #9 - Quiz Questions - 6th MayCore Java Meetup #9 - Quiz Questions - 6th May
Core Java Meetup #9 - Quiz Questions - 6th May
 
Reversing the dropbox client on windows
Reversing the dropbox client on windowsReversing the dropbox client on windows
Reversing the dropbox client on windows
 
Coding in GO - GDG SL - NSBM
Coding in GO - GDG SL - NSBMCoding in GO - GDG SL - NSBM
Coding in GO - GDG SL - NSBM
 
Analyzing memory usage and leaks
Analyzing memory usage and leaksAnalyzing memory usage and leaks
Analyzing memory usage and leaks
 
Kotlin for backend development (Hackaburg 2018 Regensburg)
Kotlin for backend development (Hackaburg 2018 Regensburg)Kotlin for backend development (Hackaburg 2018 Regensburg)
Kotlin for backend development (Hackaburg 2018 Regensburg)
 
Practicing Python 3
Practicing Python 3Practicing Python 3
Practicing Python 3
 

Similar to An Empirical Study of Identical Function Clones in CRAN

Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRANRevolution Analytics
 
maintaineR: A web-based dashboard for maintainers of CRAN packages
maintaineR: A web-based dashboard for maintainers of CRAN packagesmaintaineR: A web-based dashboard for maintainers of CRAN packages
maintaineR: A web-based dashboard for maintainers of CRAN packagesTom Mens
 
PhD public defense: A Measurement Framework for Analyzing Technical Lag in ...
PhD public defense: A Measurement Framework for  Analyzing Technical Lag in  ...PhD public defense: A Measurement Framework for  Analyzing Technical Lag in  ...
PhD public defense: A Measurement Framework for Analyzing Technical Lag in ...Ahmed Zerouali
 
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...Paul Richards
 
Clean Code for East Bay .NET User Group
Clean Code for East Bay .NET User GroupClean Code for East Bay .NET User Group
Clean Code for East Bay .NET User GroupTheo Jungeblut
 
Next.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsNext.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsEric Chiang
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Clean Code at Silicon Valley Code Camp 2011 (02/17/2012)
Clean Code at Silicon Valley Code Camp 2011 (02/17/2012)Clean Code at Silicon Valley Code Camp 2011 (02/17/2012)
Clean Code at Silicon Valley Code Camp 2011 (02/17/2012)Theo Jungeblut
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchRafael Ferreira da Silva
 
Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Ivo Jimenez
 
BOSH deploys distributed systems, and Diego runs any containers
BOSH deploys distributed systems, and Diego runs any containersBOSH deploys distributed systems, and Diego runs any containers
BOSH deploys distributed systems, and Diego runs any containersBenjamin Gandon
 
Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris.
Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris. Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris.
Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris. OW2
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos EngineeringSIGHUP
 
Learning to Rank Relevant Files for Bug Reports using Domain Knowledge
Learning to Rank Relevant Files for Bug Reports using Domain KnowledgeLearning to Rank Relevant Files for Bug Reports using Domain Knowledge
Learning to Rank Relevant Files for Bug Reports using Domain KnowledgeXin Ye
 
The Popper Experimentation Protocol and CLI tool
The Popper Experimentation Protocol and CLI toolThe Popper Experimentation Protocol and CLI tool
The Popper Experimentation Protocol and CLI toolIvo Jimenez
 
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdfReaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdffashionbigchennai
 
An automatic test data generation for data flow
An automatic test data generation for data flowAn automatic test data generation for data flow
An automatic test data generation for data flowWafaQKhan
 
Towards a Foundational API for Resilient Distributed Systems Design
Towards a Foundational API for Resilient Distributed Systems DesignTowards a Foundational API for Resilient Distributed Systems Design
Towards a Foundational API for Resilient Distributed Systems DesignDanilo Pianini
 

Similar to An Empirical Study of Identical Function Clones in CRAN (20)

Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRAN
 
maintaineR: A web-based dashboard for maintainers of CRAN packages
maintaineR: A web-based dashboard for maintainers of CRAN packagesmaintaineR: A web-based dashboard for maintainers of CRAN packages
maintaineR: A web-based dashboard for maintainers of CRAN packages
 
PhD public defense: A Measurement Framework for Analyzing Technical Lag in ...
PhD public defense: A Measurement Framework for  Analyzing Technical Lag in  ...PhD public defense: A Measurement Framework for  Analyzing Technical Lag in  ...
PhD public defense: A Measurement Framework for Analyzing Technical Lag in ...
 
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...
 
Clean Code for East Bay .NET User Group
Clean Code for East Bay .NET User GroupClean Code for East Bay .NET User Group
Clean Code for East Bay .NET User Group
 
Dynamix IoT 2012
Dynamix IoT 2012Dynamix IoT 2012
Dynamix IoT 2012
 
Next.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsNext.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev Ops
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Clean Code at Silicon Valley Code Camp 2011 (02/17/2012)
Clean Code at Silicon Valley Code Camp 2011 (02/17/2012)Clean Code at Silicon Valley Code Camp 2011 (02/17/2012)
Clean Code at Silicon Valley Code Camp 2011 (02/17/2012)
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation Workbench
 
Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...
 
BOSH deploys distributed systems, and Diego runs any containers
BOSH deploys distributed systems, and Diego runs any containersBOSH deploys distributed systems, and Diego runs any containers
BOSH deploys distributed systems, and Diego runs any containers
 
R development
R developmentR development
R development
 
Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris.
Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris. Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris.
Sat4j: from the lab to desktop computers. OW2con'15, November 17, Paris.
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos Engineering
 
Learning to Rank Relevant Files for Bug Reports using Domain Knowledge
Learning to Rank Relevant Files for Bug Reports using Domain KnowledgeLearning to Rank Relevant Files for Bug Reports using Domain Knowledge
Learning to Rank Relevant Files for Bug Reports using Domain Knowledge
 
The Popper Experimentation Protocol and CLI tool
The Popper Experimentation Protocol and CLI toolThe Popper Experimentation Protocol and CLI tool
The Popper Experimentation Protocol and CLI tool
 
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdfReaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
 
An automatic test data generation for data flow
An automatic test data generation for data flowAn automatic test data generation for data flow
An automatic test data generation for data flow
 
Towards a Foundational API for Resilient Distributed Systems Design
Towards a Foundational API for Resilient Distributed Systems DesignTowards a Foundational API for Resilient Distributed Systems Design
Towards a Foundational API for Resilient Distributed Systems Design
 

More from Tom Mens

How to be(come) a successful PhD student
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD studentTom Mens
 
Recognising bot activity in collaborative software development
Recognising bot activity in collaborative software developmentRecognising bot activity in collaborative software development
Recognising bot activity in collaborative software developmentTom Mens
 
A Dataset of Bot and Human Activities in GitHub
A Dataset of Bot and Human Activities in GitHubA Dataset of Bot and Human Activities in GitHub
A Dataset of Bot and Human Activities in GitHubTom Mens
 
The (r)evolution of CI/CD on GitHub
 The (r)evolution of CI/CD on GitHub The (r)evolution of CI/CD on GitHub
The (r)evolution of CI/CD on GitHubTom Mens
 
Nurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the FutureNurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the FutureTom Mens
 
Comment programmer un robot en 30 minutes?
Comment programmer un robot en 30 minutes?Comment programmer un robot en 30 minutes?
Comment programmer un robot en 30 minutes?Tom Mens
 
On the rise and fall of CI services in GitHub
On the rise and fall of CI services in GitHubOn the rise and fall of CI services in GitHub
On the rise and fall of CI services in GitHubTom Mens
 
On backporting practices in package dependency networks
On backporting practices in package dependency networksOn backporting practices in package dependency networks
On backporting practices in package dependency networksTom Mens
 
Comparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
Comparing semantic versioning practices in Cargo, npm, Packagist and RubygemsComparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
Comparing semantic versioning practices in Cargo, npm, Packagist and RubygemsTom Mens
 
Lost in Zero Space
Lost in Zero SpaceLost in Zero Space
Lost in Zero SpaceTom Mens
 
Evaluating a bot detection model on git commit messages
Evaluating a bot detection model on git commit messagesEvaluating a bot detection model on git commit messages
Evaluating a bot detection model on git commit messagesTom Mens
 
Is my software ecosystem healthy? It depends!
Is my software ecosystem healthy? It depends!Is my software ecosystem healthy? It depends!
Is my software ecosystem healthy? It depends!Tom Mens
 
Bot or not? Detecting bots in GitHub pull request activity based on comment s...
Bot or not? Detecting bots in GitHub pull request activity based on comment s...Bot or not? Detecting bots in GitHub pull request activity based on comment s...
Bot or not? Detecting bots in GitHub pull request activity based on comment s...Tom Mens
 
On the fragility of open source software packaging ecosystems
On the fragility of open source software packaging ecosystemsOn the fragility of open source software packaging ecosystems
On the fragility of open source software packaging ecosystemsTom Mens
 
How magic is zero? An Empirical Analysis of Initial Development Releases in S...
How magic is zero? An Empirical Analysis of Initial Development Releases in S...How magic is zero? An Empirical Analysis of Initial Development Releases in S...
How magic is zero? An Empirical Analysis of Initial Development Releases in S...Tom Mens
 
Comparing dependency issues across software package distributions (FOSDEM 2020)
Comparing dependency issues across software package distributions (FOSDEM 2020)Comparing dependency issues across software package distributions (FOSDEM 2020)
Comparing dependency issues across software package distributions (FOSDEM 2020)Tom Mens
 
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)Measuring Technical Lag in Software Deployments (CHAOSScon 2020)
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)Tom Mens
 
SecoHealth 2019 Research Achievements
SecoHealth 2019 Research AchievementsSecoHealth 2019 Research Achievements
SecoHealth 2019 Research AchievementsTom Mens
 
SECO-Assist 2019 research seminar
SECO-Assist 2019 research seminarSECO-Assist 2019 research seminar
SECO-Assist 2019 research seminarTom Mens
 
Empirically Analysing the Socio-Technical Health of Software Package Managers
Empirically Analysing the Socio-Technical Health of Software Package ManagersEmpirically Analysing the Socio-Technical Health of Software Package Managers
Empirically Analysing the Socio-Technical Health of Software Package ManagersTom Mens
 

More from Tom Mens (20)

How to be(come) a successful PhD student
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD student
 
Recognising bot activity in collaborative software development
Recognising bot activity in collaborative software developmentRecognising bot activity in collaborative software development
Recognising bot activity in collaborative software development
 
A Dataset of Bot and Human Activities in GitHub
A Dataset of Bot and Human Activities in GitHubA Dataset of Bot and Human Activities in GitHub
A Dataset of Bot and Human Activities in GitHub
 
The (r)evolution of CI/CD on GitHub
 The (r)evolution of CI/CD on GitHub The (r)evolution of CI/CD on GitHub
The (r)evolution of CI/CD on GitHub
 
Nurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the FutureNurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the Future
 
Comment programmer un robot en 30 minutes?
Comment programmer un robot en 30 minutes?Comment programmer un robot en 30 minutes?
Comment programmer un robot en 30 minutes?
 
On the rise and fall of CI services in GitHub
On the rise and fall of CI services in GitHubOn the rise and fall of CI services in GitHub
On the rise and fall of CI services in GitHub
 
On backporting practices in package dependency networks
On backporting practices in package dependency networksOn backporting practices in package dependency networks
On backporting practices in package dependency networks
 
Comparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
Comparing semantic versioning practices in Cargo, npm, Packagist and RubygemsComparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
Comparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
 
Lost in Zero Space
Lost in Zero SpaceLost in Zero Space
Lost in Zero Space
 
Evaluating a bot detection model on git commit messages
Evaluating a bot detection model on git commit messagesEvaluating a bot detection model on git commit messages
Evaluating a bot detection model on git commit messages
 
Is my software ecosystem healthy? It depends!
Is my software ecosystem healthy? It depends!Is my software ecosystem healthy? It depends!
Is my software ecosystem healthy? It depends!
 
Bot or not? Detecting bots in GitHub pull request activity based on comment s...
Bot or not? Detecting bots in GitHub pull request activity based on comment s...Bot or not? Detecting bots in GitHub pull request activity based on comment s...
Bot or not? Detecting bots in GitHub pull request activity based on comment s...
 
On the fragility of open source software packaging ecosystems
On the fragility of open source software packaging ecosystemsOn the fragility of open source software packaging ecosystems
On the fragility of open source software packaging ecosystems
 
How magic is zero? An Empirical Analysis of Initial Development Releases in S...
How magic is zero? An Empirical Analysis of Initial Development Releases in S...How magic is zero? An Empirical Analysis of Initial Development Releases in S...
How magic is zero? An Empirical Analysis of Initial Development Releases in S...
 
Comparing dependency issues across software package distributions (FOSDEM 2020)
Comparing dependency issues across software package distributions (FOSDEM 2020)Comparing dependency issues across software package distributions (FOSDEM 2020)
Comparing dependency issues across software package distributions (FOSDEM 2020)
 
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)Measuring Technical Lag in Software Deployments (CHAOSScon 2020)
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)
 
SecoHealth 2019 Research Achievements
SecoHealth 2019 Research AchievementsSecoHealth 2019 Research Achievements
SecoHealth 2019 Research Achievements
 
SECO-Assist 2019 research seminar
SECO-Assist 2019 research seminarSECO-Assist 2019 research seminar
SECO-Assist 2019 research seminar
 
Empirically Analysing the Socio-Technical Health of Software Package Managers
Empirically Analysing the Socio-Technical Health of Software Package ManagersEmpirically Analysing the Socio-Technical Health of Software Package Managers
Empirically Analysing the Socio-Technical Health of Software Package Managers
 

An Empirical Study of Identical Function Clones in CRAN

  • 1. An Empirical Study of Identical Function Clones in CRAN Maëlick Claes Tom Mens, Narjisse Tabout & Philippe Grosjean & 6th February 2014, IWSC 2015 Software Engineering Lab Numerical Ecology of Aquatic Systems Lab 0
  • 3. Statistical environment based on the S language Packages with code, doc, examples, tests, datasets CRAN (Comprehensive R Archive Network) Official R package repository Strict policy for package acceptance Package quality regularly checked & archive process Complaints in the community Hornik 2012, Are there too many R packages? Empirical study of Inter-project clones in CRAN http://www.r-project.org
  • 4. Previous work Preliminary empirical study using CRAN meta-data On the maintainability of CRAN packages (CSMR-WCRE 2014) R CMD check results from CRAN: Most errors resolved quickly without developer intervention Maintenance effort needs to focus on fixing errors caused by others Need for a more specific tool to detect problems related to dependency changes Web-dashboard for CRAN maintainers maintaineR, a web-based dashboard for maintainers of CRAN packages (ICSME 2014) Type-1 function clone identification http://cran.r-project.org/web/checks/
  • 5. Identifying cloned R functions Parsing R code with R itself Assigning a SHA-1 hash to each function's AST Ignoring functions with less than 6 lines of code Identifying Type-1 clones = identifying identical hashes across packages
  • 6. Observed clone cases Coexisting package versions: plyr and dplyr, lme and nlme, np and npRmpi Fork package: Rcmdr and QCAGUI Frequently cloned package: distr Utility package: DescTools Popular package: MASS Popular function: permn() from combinat
  • 7. Research Questions How prevalent are (Type-1) function clones in CRAN? Why did these clones appear? Is it possible to remove them and how?
  • 8. How prevalent are (Type-1) function clones in CRAN?
  • 9. Evolution of the number of packages
  • 10. Evolution of the number of LOC
  • 11. Evolution of the relative size
  • 13. Categorizing clones All clones on 1st December 2014 7366 clones 162k LOC 1409 packages 3184 clone sets Identifying the origin of each clone set Each clone set origin is either An anonymous and/or local function An archived global function A private global function A public global function
  • 14. Anonymous, local and global functions From DescTools 0.99.8.1 package... qbinom.abscont<-function(p,size,x){ fun<-function(prob,size,x,p){ pbinom.abscont(x,size,prob)-p } uniroot(fun,interval=c(0,1),size=size,x=x,p=p)$root } ... which could be rewritten as qbinom.abscont<-function(p,size,x){ uniroot(function(prob,size,x,p){ pbinom.abscont(x,size,prob)-p },interval=c(0,1),size=size,x=x,p=p)$root }
  • 15. NAMESPACE file Also from DescTools 0.99.8.1 exportPattern("^[^.]") importFrom("boot","boot","boot.ci","corr") import(tcltk) useDynLib(DescTools)
  • 16. Classification of clone origins Most clones were created because it was not possible to re-use the original function
  • 17. Is it possible to remove clones and how?
  • 18. Adding dependency to The origin package 673 out of the 1899 global clone set origins are public functions 782 functions that could potentially be removed in 332 packages 48 functions in a package where there is already a direct dependency 20 functions in a package where a dependency cannot be added without creating cycles A non-original clone copy On 2511 clone sets with a non-public origin function, only 250 have another public copy Only 299 functions could be removed by depending on another copy => Removing clones in CRAN packages cannot be reduced to code refactoring. Most of the time it would require communication between maintainers of different packages
  • 19. ConclusionCloned code represents a small fraction of all CRAN code but still more than 100K LOC across the biggest CRAN packages Most clones cannot be removed by adding dependencies without enforcing CRAN policy But still an important number of clones that could theoretically easily be removed Further work needed to understand if the refactorable clones are justified or not
  • 20. Future WorkAsking developers (survey) about their cloning behavior Type-2 and Type-3 clones Clone patterns Inter-project cloning behavior in other languages / ecosystems
  • 21. Thanks for your attention Questions? Slides: http://maelick.net/presentations/iwsc2015/