Software Ecosystems = Big Data

Software Ecosystems
=
Big Data !

Prof. Dr. Tom Mens
Software Engineering Lab
tom.mens@umons.ac.be
Big Data Analytics of
Software Ecosystem Health

A software ecosystem is a
collection of [inderdependent]
software projects that are
developed and evolve together in
the same environment.
Mircea Lungu
(PhD, 2008)

Software Ecosystems = Big Data
Volume Velocity
Variety Veracity
4V

Volume: software ecosystems involve huge
quantities of data
Debian (Linux distribution)
Archive http://snapshot.debian.org containing daily
snapshots of packages, maintainers, dependencies, ...
"Snapshot keeps growing. We are now at approximate 60TB of files.
This made it necessary to break up the RAID-1 mirror across two
external storage arrays ..., and it also meant we needed more machines
(now six) at our mirrorsite ..."
Debian bug tracker http://methyer.ethz.ch/bts/
• 122 thousand active bugs; 779 thousand archived bugs
Debian security tracker
• 29 thousand security vulnerabilities

quantities of data
Example: software package manager for JavaScript
Created in 2010.
In 2017:
• 3.5TB of storage required for hosting 500K packages
• 2.3 million opened GitHub pull requests for JavaScript
repositories
March 2018:
• ~0,7 million packages
• ~4,4 million package releases
• ~19,8 million (runtime) package dependencies

quantities of data
GHTorrent: (partial) datadump of GitHub on April 2018
• 24,1 million users
• 83,6 million projects
• 67,4 million issues
• 34 million pull requests
• 930 million commits

quantities of data
RubyGems software package manager for Ruby (since 2004)
Ecosystem size in March 2018:
• ~144 thousand packages
• ~825 thousand package releases
• ~2 million (runtime) package dependencies
Some data for the Ruby on Rails project:
68,980 commits
346 releases
3,570 contributors
> 11k issues; > 21k pull requests; >16k forks on GitHub
> 11k dependent packages; > 458k dependent repositories

Variety: software ecosystems involve very
heterogenous data sources
• Structured data: source code, dependency graphs,
version control systems, ...
• Semi-structured data: e.g. mailing lists, online surveys,
social media, Q&A websites, ...
• Unstructured data: unformatted text, video and voice
recordings of interviews, field notes
Heterogeneity
• Source code, packaging metadata, models,
documentation, tests, databases, bug and issue reports, ...
• Multiple programming/natural languages
• Cultural differences

Veracity: software ecosystem analysis
requires dealing with uncertain, inconsistent,
invalid and missing data
Examples:
• Missing: Corrupted or lost historical data (voluntarily or
not).
E.g., removed projects/user profiles; "rebasing" the
version history
• Uncertain: Different data source may disagree è which
one is correct?
• Invalid/inconsistent: Especially data produced by humans

Velocity: software ecosystems are growing rapidly
• New commits are made to GitHub several times every second
• For web-based development analytics dashboards, or
automated recommendation systems, using the most recent
data is important to make informed decisions
2012 2013 2014 2015 2016 2017
100
101
102
103
104
105
106
number of packages (log)
cargo
cpan
cran
npm
nuget
packagist
rubygems
2012 2013 2014 2015 2016 2017
100
101
102
103
104
105
106
107
108
number of dependencies (log)
cargo
cpan
cran
npm
nuget
packagist
rubygems
A. Decan, T. Mens, Ph. Grosjean. An Empirical Comparison of Dependency Network
Evolution in Seven Software Packaging Ecosystems. Empirical Software Engineering, 2018

Many Challenges
• How to retrieve data?
• Data provider services (e.g. libraries.io, GHTorrent, GitHubArchive, ...)
• How to avoid "abusing" their APIs?
• How to deal with changes in APIs?
• How to deal with data veracity?
• How to analyse data?
• Storing and sharing such amounts of data requires specific hardware
• Processing data can take a lot of time; several weeks not uncommon
• How to report such huge amounts of data?
Aggregating results
Adapted visualisation techniques

Challenges continued
• How to combine data originating from different sources?
O(n*m) time complexity not affordable because n and m
too large
• How to identify "interesting" and "relevant" data?
• Need new data cleaning and data mining techniques
capable of dealing with this amount of data
• Providing "incremental" solutions that keep the extracted
data up-to-date
• Dealing with identities of individuals
• Identity merging
• Preserving privacy and anonymity

Research Context
• Today over 80 percent of all software in any technology product
or service is open source software (OSS).
• CHAOSS focuses on creating analytics and metrics to help
define OSS community health.
https://chaoss.community
"The CHAOSS community is developing metrics, methodologies, and
software for expressing open source project health and sustainability. By
doing so, CHAOSS seeks to improve the transparency of open source
project health and sustainability so that relevant stakeholders can make
more informed decisions about open source project engagement."

University of Mons
Laval
University
Polytechnique
Montréal
Université de Mons
www.secohealth.org
@secohealth
2017-2019

University of Mons
Laval
University
Polytechnique
Montréal
Université de Mons
www.secohealth.org
@secohealth

Best
Practices
Best
Practices
Practices
Best
3. Provive recommendations
and guidelines to avoid future
software health problems
1. Determine indicators of
software health issues
2. Predict the impact and
propagation of health
issues
time

• Bugs
• Security vulnerabilities
• Dependency problems
• Abandoned or outdated software
• Redundant or duplicated code
• Incompatible software licences
• ...
Technical
• Lack of communication / interaction
• Social conflicts
• Contributor abandonment
• Insufficient diversity
• Cultural differences
• ..
Ecosystem Health Issues

seco-assist.github.io
@seco-assist
2018-2021

seco-assist.github.io
@seco-assist
2018-2021
SECO-ASSIST aims to provide novel software recommendation
techniques to address the software ecosystem challenges of
longevity, scale, heterogeneity, and community. This will be
achieved by combining socio-technical analysis, database usage
analysis, library evolution and software test automation.

UMONS UNamur
UAntwerpenVUB
Tom
Mens
Anthony
Cleve
Coen
De Roover
Serge
Demeyer

• Improve
software testing
and prevent
bugs
• Improve
software library
reuse
• Optimise
database
usage
• Improve
developer
team
interaction
UMONS UNamur
UAntwerpenVUB

SECO-ASSIST Goals
Improve social health
• Retain key contributors and attract new ones
• Predict abandoners and find replacements
• Identify toxic contributors
• Ensure sufficient diversity

SECO-ASSIST Goals
Improve technical health
• Better software tests, taking into account the software
dependencies
è less bugs and security issues
• Higher productivity and quality by using reusable
software libraries
• Increased maintainability by supporting upgrades and
migrations (to new libraries, other technologies, ...)

Current Research
Empirical studies on historical software ecosystem data
to analyse and understand
• Software contributor retention and abandonment
• The propagation of health problems through technical
dependencies in a software ecosystem
• The impact of "technical lag" caused by outdated
dependencies
• The impact of security vulnerabilities

Analysing Security
Vulnerabilities in
time

How long do packages
remain vulnerable?
It takes a long time before vulnerabilities are
removed from a package.

When are vulnerabilities fixed?
+ Most vulnerabilities are quickly fixed after their discovery.
- ~20% of vulnerabilities take more than 1 year to be fixed.

When are vulnerabilities fixed
in dependent packages?
Depending packages are vulnerable much
longer! Package maintainers must use security
monitoring tools, and adapt their dependency
constraints to quickly benefit from security fixes

References
• E. Constantinou, T. Mens. An Empirical Comparison of Developer
Retention in the RubyGems and npm Software Ecosystems.
Innovations in Systems and Software Engineering, 2017
• A. Decan, T. Mens, Ph. Grosjean. An Empirical Comparison of
Dependency Network Evolution in Seven Software Packaging
Ecosystems. Empirical Software Engineering, 2018
• A. Zerouali, E. Constantinou, T. Mens, G. Robles, J. Gonzalez-
Barahona. An empirical analysis of technical lag in npm package
dependencies. ICSR 2018
• A. Decan, T. Mens, E. Constantinou. On the impact of security
vulnerabilities in the npm package dependency network. MSR
2018

Software Ecosystems = Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Software Ecosystems = Big Data

Similar to Software Ecosystems = Big Data (20)

More from Tom Mens

More from Tom Mens (20)

Recently uploaded

Recently uploaded (20)

Software Ecosystems = Big Data