Big Data Analytics of Software Ecosystem Health: Presentation during INFORTECH Scientific Day (23 May 2018) by Professor Tom Mens. The talk reports on ongoing research of the Software Engineering Lab of the University of Mons (UMONS) on health aspects of evolving software ecosystems. This research was conducted in collaboration with postdoctoral researchers Alexandre Decan and Eleni Constantinou, as well as the external partners of two ongoing research projects: SECOHealth (https://secohealth.github.io) and the Excellence of Science research project SECO-ASSIST (https://secoassist.github.io).
2. Prof. Dr. Tom Mens
Software Engineering Lab
tom.mens@umons.ac.be
Big Data Analytics of
Software Ecosystem Health
3. A software ecosystem is a
collection of [inderdependent]
software projects that are
developed and evolve together in
the same environment.
Mircea Lungu
(PhD, 2008)
5. Software Ecosystems = Big Data
Volume: software ecosystems involve huge
quantities of data
Debian (Linux distribution)
Archive http://snapshot.debian.org containing daily
snapshots of packages, maintainers, dependencies, ...
"Snapshot keeps growing. We are now at approximate 60TB of files.
This made it necessary to break up the RAID-1 mirror across two
external storage arrays ..., and it also meant we needed more machines
(now six) at our mirrorsite ..."
Debian bug tracker http://methyer.ethz.ch/bts/
• 122 thousand active bugs; 779 thousand archived bugs
Debian security tracker
• 29 thousand security vulnerabilities
6. Software Ecosystems = Big Data
Volume: software ecosystems involve huge
quantities of data
Example: software package manager for JavaScript
Created in 2010.
In 2017:
• 3.5TB of storage required for hosting 500K packages
• 2.3 million opened GitHub pull requests for JavaScript
repositories
March 2018:
• ~0,7 million packages
• ~4,4 million package releases
• ~19,8 million (runtime) package dependencies
7. Software Ecosystems = Big Data
Volume: software ecosystems involve huge
quantities of data
GHTorrent: (partial) datadump of GitHub on April 2018
• 24,1 million users
• 83,6 million projects
• 67,4 million issues
• 34 million pull requests
• 930 million commits
8. Software Ecosystems = Big Data
Volume: software ecosystems involve huge
quantities of data
RubyGems software package manager for Ruby (since 2004)
Ecosystem size in March 2018:
• ~144 thousand packages
• ~825 thousand package releases
• ~2 million (runtime) package dependencies
Some data for the Ruby on Rails project:
68,980 commits
346 releases
3,570 contributors
> 11k issues; > 21k pull requests; >16k forks on GitHub
> 11k dependent packages; > 458k dependent repositories
9. Software Ecosystems = Big Data
Variety: software ecosystems involve very
heterogenous data sources
• Structured data: source code, dependency graphs,
version control systems, ...
• Semi-structured data: e.g. mailing lists, online surveys,
social media, Q&A websites, ...
• Unstructured data: unformatted text, video and voice
recordings of interviews, field notes
Heterogeneity
• Source code, packaging metadata, models,
documentation, tests, databases, bug and issue reports, ...
• Multiple programming/natural languages
• Cultural differences
10. Software Ecosystems = Big Data
Veracity: software ecosystem analysis
requires dealing with uncertain, inconsistent,
invalid and missing data
Examples:
• Missing: Corrupted or lost historical data (voluntarily or
not).
E.g., removed projects/user profiles; "rebasing" the
version history
• Uncertain: Different data source may disagree è which
one is correct?
• Invalid/inconsistent: Especially data produced by humans
11. Software Ecosystems = Big Data
Velocity: software ecosystems are growing rapidly
• New commits are made to GitHub several times every second
• For web-based development analytics dashboards, or
automated recommendation systems, using the most recent
data is important to make informed decisions
2012 2013 2014 2015 2016 2017
100
101
102
103
104
105
106
number of packages (log)
cargo
cpan
cran
npm
nuget
packagist
rubygems
2012 2013 2014 2015 2016 2017
100
101
102
103
104
105
106
107
108
number of dependencies (log)
cargo
cpan
cran
npm
nuget
packagist
rubygems
A. Decan, T. Mens, Ph. Grosjean. An Empirical Comparison of Dependency Network
Evolution in Seven Software Packaging Ecosystems. Empirical Software Engineering, 2018
12. Software Ecosystems = Big Data
Many Challenges
• How to retrieve data?
• Data provider services (e.g. libraries.io, GHTorrent, GitHubArchive, ...)
• How to avoid "abusing" their APIs?
• How to deal with changes in APIs?
• How to deal with data veracity?
• How to analyse data?
• Storing and sharing such amounts of data requires specific hardware
• Processing data can take a lot of time; several weeks not uncommon
• How to report such huge amounts of data?
Aggregating results
Adapted visualisation techniques
13. Software Ecosystems = Big Data
Challenges continued
• How to combine data originating from different sources?
O(n*m) time complexity not affordable because n and m
too large
• How to identify "interesting" and "relevant" data?
• Need new data cleaning and data mining techniques
capable of dealing with this amount of data
• Providing "incremental" solutions that keep the extracted
data up-to-date
• Dealing with identities of individuals
• Identity merging
• Preserving privacy and anonymity
14. Research Context
• Today over 80 percent of all software in any technology product
or service is open source software (OSS).
• CHAOSS focuses on creating analytics and metrics to help
define OSS community health.
https://chaoss.community
"The CHAOSS community is developing metrics, methodologies, and
software for expressing open source project health and sustainability. By
doing so, CHAOSS seeks to improve the transparency of open source
project health and sustainability so that relevant stakeholders can make
more informed decisions about open source project engagement."
21. seco-assist.github.io
@seco-assist
2018-2021
SECO-ASSIST aims to provide novel software recommendation
techniques to address the software ecosystem challenges of
longevity, scale, heterogeneity, and community. This will be
achieved by combining socio-technical analysis, database usage
analysis, library evolution and software test automation.
24. SECO-ASSIST Goals
Improve social health
• Retain key contributors and attract new ones
• Predict abandoners and find replacements
• Identify toxic contributors
• Ensure sufficient diversity
25. SECO-ASSIST Goals
Improve technical health
• Better software tests, taking into account the software
dependencies
è less bugs and security issues
• Higher productivity and quality by using reusable
software libraries
• Increased maintainability by supporting upgrades and
migrations (to new libraries, other technologies, ...)
26. Current Research
Empirical studies on historical software ecosystem data
to analyse and understand
• Software contributor retention and abandonment
• The propagation of health problems through technical
dependencies in a software ecosystem
• The impact of "technical lag" caused by outdated
dependencies
• The impact of security vulnerabilities
28. How long do packages
remain vulnerable?
It takes a long time before vulnerabilities are
removed from a package.
29. When are vulnerabilities fixed?
+ Most vulnerabilities are quickly fixed after their discovery.
- ~20% of vulnerabilities take more than 1 year to be fixed.
30. When are vulnerabilities fixed
in dependent packages?
Depending packages are vulnerable much
longer! Package maintainers must use security
monitoring tools, and adapt their dependency
constraints to quickly benefit from security fixes
31. References
• E. Constantinou, T. Mens. An Empirical Comparison of Developer
Retention in the RubyGems and npm Software Ecosystems.
Innovations in Systems and Software Engineering, 2017
• A. Decan, T. Mens, Ph. Grosjean. An Empirical Comparison of
Dependency Network Evolution in Seven Software Packaging
Ecosystems. Empirical Software Engineering, 2018
• A. Zerouali, E. Constantinou, T. Mens, G. Robles, J. Gonzalez-
Barahona. An empirical analysis of technical lag in npm package
dependencies. ICSR 2018
• A. Decan, T. Mens, E. Constantinou. On the impact of security
vulnerabilities in the npm package dependency network. MSR
2018