In this presentation at IWSECO-WEA 2015 (Dubrovnik, Croatia, 8 September 2015) we present the ecosystem of software packages for R, one of the most popular environments for statistical computing today. We empirically study how R packages are developed and distributed on different repositories: CRAN, BioConductor, R-Forge and GitHub. We also explore the role and size of each repository, the inter-repository dependencies, and how these repositories grow over time. With this analysis, we provide a deeper insight into the extent and the evolution of the R package ecosystem.
Analysis of R Package Development and Distribution Across Repositories
1. On the Development and
Distribution of R Packages
An Empirical Analysis of the R Ecosystem
Alexandre Decan, Tom Mens,
Maëlick Claes & Philippe Grosjean
COMPLEXYS Research Institute
8th September 2015, IWSECO-WEA 2015
2. Statistical environment
Packages with code, doc, examples, tests, datasets:
http://www.r-project.org
i n s t a l l . p a c k a g e s ( " M y P a c k a g e " )
3. R package repositories (in March 2015)
Repository name Number of packages Since Role
CRAN 6411 1997 Distribution
Bioconductor 997 2001 Distribution
R-Forge 1883 2006 SVN development
Distribution
GitHub 5150 2008 Git development
Distribution using devtools
But there are more: RForge, Omegahat, Bitbucket, Sourceforge, Google code, ...
4. How to install packages
install.packages function:
automatically installs a package and its dependencies if needed
only uses CRAN by default
can be configured to use other repositories like Bioconductor and R-Forge
Package devtools provides various functions to install packages from other sources:
SVN
Git
GitHub
Bitbucket
Gitorious
devtools retrieves the package content and installs it using install.packages
5. Previous work
Preliminary empirical study using CRAN meta-data
On the maintainability of CRAN packages (CSMR-WCRE 2014)
Inter-project (Type1) clone study of CRAN packages:
An Empirical Study of Identical Function Clones in CRAN (IWSC 2015)
Web-dashboard for CRAN maintainers
maintaineR, a web-based dashboard for maintainers of CRAN packages (ICSME 2014)
9. Number of newly created packages on GitHub
More and more packages are developed on GitHub that are not distributed somewhere else.
10. Evolution of the number of packages in CRAN
and GitHub
The number of packages only on GitHub grows faster than the number of packages on CRAN!
But it does not seem to impact the growth of CRAN.
12. Dependencies
Defined in the DESCRIPTION file
Using the fields Depends and Imports
These fields does not specify from which repository the dependency must come!
P a c k a g e : S c i V i e w s
T y p e : P a c k a g e
T i t l e : S c i V i e w s G U I A P I - M a i n p a c k a g e
I m p o r t s : e l l i p s e
D e p e n d s : R ( > = 2 . 6 . 0 ) , s t a t s , g r D e v i c e s , g r a p h i c s , M A S S
E n h a n c e s : b a s e
D e s c r i p t i o n : F u n c t i o n s t o i n s t a l l S c i V i e w s a d d i t i o n s t o R , a n d m o r e ( v a r i o u s ) t o o l s
V e r s i o n : 0 . 9 - 5
D a t e : 2 0 1 3 - 0 3 - 0 1
A u t h o r : P h i l i p p e G r o s j e a n
M a i n t a i n e r : P h i l i p p e G r o s j e a n p h g r o s j e a n @ s c i v i e w s . o r g
L i c e n s e : G P L - 2
L a z y L o a d : y e s
U R L : h t t p : / / w w w . s c i v i e w s . o r g / S c i V i e w s - R
B u g R e p o r t s : h t t p s : / / r - f o r g e . r - p r o j e c t . o r g / t r a c k e r / ? g r o u p _ i d = 1 9 4
P a c k a g e d : 2 0 1 4 - 0 3 - 0 1 2 0 : 3 4 : 1 1 U T C ; p h g r o s j e a n
N e e d s C o m p i l a t i o n : n o
R e p o s i t o r y : C R A N
D a t e / P u b l i c a t i o n : 2 0 1 4 - 0 3 - 0 2 1 2 : 4 0 : 4 2
I m p o r t s : e l l i p s e
D e p e n d s : R ( > = 2 . 6 . 0 ) , s t a t s , g r D e v i c e s , g r a p h i c s , M A S S
13. Package repository priority
For each defined dependency relationship we consider the first package matching the dependency
by privileging repositories in this order:
CRAN Bioconductor GitHub R-Forge
15. Conclusion
We looked where R packages are developed and distributed taking into account CRAN,
Bioconductor, GitHub and R-Forge
GitHub is growing at a faster pace than the other repositories
More and more packages are developed on GitHub but not distributed somewhere else
However it does not impact the other repositories:
CRAN is (still) at the center of the ecosystem
Most of Bioconductor, R-Forge and GitHub requires CRAN in order to work
16. Current and future work
Take into account more R package repositories (e.g. Bitbucket)
Investigate why there are so many packages only on GitHub
Asking developers (survey) about usage of CRAN and Github
Eventually provide support to R package users and developers
by improving package dependency management
Socio-technical analysis of R package developer communities
Similar study of an ecosystem based on another programming
17. Thanks for your attention
Questions?
Slides: http://maelick.net/presentations/iwseco-wea2015/