Software Ecosystems
=
Big Data !
Prof. Dr. Tom Mens
Software Engineering Lab
tom.mens@umons.ac.be
Big Data Analytics of
Software Ecosystem Health
A software ecosystem is a
collection of [inderdependent]
software projects that are
developed and evolve together in
the same environment.
Mircea Lungu
(PhD, 2008)
Software Ecosystems = Big Data
Volume Velocity
Variety Veracity
4V
Software Ecosystems = Big Data
Volume: software ecosystems involve huge
quantities of data
Debian (Linux distribution)
Archive http://snapshot.debian.org containing daily
snapshots of packages, maintainers, dependencies, ...
"Snapshot keeps growing. We are now at approximate 60TB of files.
This made it necessary to break up the RAID-1 mirror across two
external storage arrays ..., and it also meant we needed more machines
(now six) at our mirrorsite ..."
Debian bug tracker http://methyer.ethz.ch/bts/
• 122 thousand active bugs; 779 thousand archived bugs
Debian security tracker
• 29 thousand security vulnerabilities
Software Ecosystems = Big Data
Volume: software ecosystems involve huge
quantities of data
Example: software package manager for JavaScript
Created in 2010.
In 2017:
• 3.5TB of storage required for hosting 500K packages
• 2.3 million opened GitHub pull requests for JavaScript
repositories
March 2018:
• ~0,7 million packages
• ~4,4 million package releases
• ~19,8 million (runtime) package dependencies
Software Ecosystems = Big Data
Volume: software ecosystems involve huge
quantities of data
GHTorrent: (partial) datadump of GitHub on April 2018
• 24,1 million users
• 83,6 million projects
• 67,4 million issues
• 34 million pull requests
• 930 million commits
Software Ecosystems = Big Data
Volume: software ecosystems involve huge
quantities of data
RubyGems software package manager for Ruby (since 2004)
Ecosystem size in March 2018:
• ~144 thousand packages
• ~825 thousand package releases
• ~2 million (runtime) package dependencies
Some data for the Ruby on Rails project:
68,980 commits
346 releases
3,570 contributors
> 11k issues; > 21k pull requests; >16k forks on GitHub
> 11k dependent packages; > 458k dependent repositories
Software Ecosystems = Big Data
Variety: software ecosystems involve very
heterogenous data sources
• Structured data: source code, dependency graphs,
version control systems, ...
• Semi-structured data: e.g. mailing lists, online surveys,
social media, Q&A websites, ...
• Unstructured data: unformatted text, video and voice
recordings of interviews, field notes
Heterogeneity
• Source code, packaging metadata, models,
documentation, tests, databases, bug and issue reports, ...
• Multiple programming/natural languages
• Cultural differences
Software Ecosystems = Big Data
Veracity: software ecosystem analysis
requires dealing with uncertain, inconsistent,
invalid and missing data
Examples:
• Missing: Corrupted or lost historical data (voluntarily or
not).
E.g., removed projects/user profiles; "rebasing" the
version history
• Uncertain: Different data source may disagree è which
one is correct?
• Invalid/inconsistent: Especially data produced by humans
Software Ecosystems = Big Data
Velocity: software ecosystems are growing rapidly
• New commits are made to GitHub several times every second
• For web-based development analytics dashboards, or
automated recommendation systems, using the most recent
data is important to make informed decisions
2012 2013 2014 2015 2016 2017
100
101
102
103
104
105
106
number of packages (log)
cargo
cpan
cran
npm
nuget
packagist
rubygems
2012 2013 2014 2015 2016 2017
100
101
102
103
104
105
106
107
108
number of dependencies (log)
cargo
cpan
cran
npm
nuget
packagist
rubygems
A. Decan, T. Mens, Ph. Grosjean. An Empirical Comparison of Dependency Network
Evolution in Seven Software Packaging Ecosystems. Empirical Software Engineering, 2018
Software Ecosystems = Big Data
Many Challenges
• How to retrieve data?
• Data provider services (e.g. libraries.io, GHTorrent, GitHubArchive, ...)
• How to avoid "abusing" their APIs?
• How to deal with changes in APIs?
• How to deal with data veracity?
• How to analyse data?
• Storing and sharing such amounts of data requires specific hardware
• Processing data can take a lot of time; several weeks not uncommon
• How to report such huge amounts of data?
Aggregating results
Adapted visualisation techniques
Software Ecosystems = Big Data
Challenges continued
• How to combine data originating from different sources?
O(n*m) time complexity not affordable because n and m
too large
• How to identify "interesting" and "relevant" data?
• Need new data cleaning and data mining techniques
capable of dealing with this amount of data
• Providing "incremental" solutions that keep the extracted
data up-to-date
• Dealing with identities of individuals
• Identity merging
• Preserving privacy and anonymity
Research Context
• Today over 80 percent of all software in any technology product
or service is open source software (OSS).
• CHAOSS focuses on creating analytics and metrics to help
define OSS community health.
https://chaoss.community
"The CHAOSS community is developing metrics, methodologies, and
software for expressing open source project health and sustainability. By
doing so, CHAOSS seeks to improve the transparency of open source
project health and sustainability so that relevant stakeholders can make
more informed decisions about open source project engagement."
University of Mons
Laval
University
Polytechnique
Montréal
Université de Mons
www.secohealth.org
@secohealth
2017-2019
University of Mons
Laval
University
Polytechnique
Montréal
Université de Mons
www.secohealth.org
@secohealth
Best
Practices
Best
Practices
Practices
Best
3. Provive recommendations
and guidelines to avoid future
software health problems
1. Determine indicators of
software health issues
2. Predict the impact and
propagation of health
issues
time
• Bugs
• Security vulnerabilities
• Dependency problems
• Abandoned or outdated software
• Redundant or duplicated code
• Incompatible software licences
• ...
Technical
• Lack of communication / interaction
• Social conflicts
• Contributor abandonment
• Insufficient diversity
• Cultural differences
• ..
Ecosystem Health Issues
Example: leftpad
seco-assist.github.io
@seco-assist
2018-2021
seco-assist.github.io
@seco-assist
2018-2021
SECO-ASSIST aims to provide novel software recommendation
techniques to address the software ecosystem challenges of
longevity, scale, heterogeneity, and community. This will be
achieved by combining socio-technical analysis, database usage
analysis, library evolution and software test automation.
UMONS UNamur
UAntwerpenVUB
Tom
Mens
Anthony
Cleve
Coen
De Roover
Serge
Demeyer
• Improve
software testing
and prevent
bugs
• Improve
software library
reuse
• Optimise
database
usage
• Improve
developer
team
interaction
UMONS UNamur
UAntwerpenVUB
SECO-ASSIST Goals
Improve social health
• Retain key contributors and attract new ones
• Predict abandoners and find replacements
• Identify toxic contributors
• Ensure sufficient diversity
SECO-ASSIST Goals
Improve technical health
• Better software tests, taking into account the software
dependencies
è less bugs and security issues
• Higher productivity and quality by using reusable
software libraries
• Increased maintainability by supporting upgrades and
migrations (to new libraries, other technologies, ...)
Current Research
Empirical studies on historical software ecosystem data
to analyse and understand
• Software contributor retention and abandonment
• The propagation of health problems through technical
dependencies in a software ecosystem
• The impact of "technical lag" caused by outdated
dependencies
• The impact of security vulnerabilities
Analysing Security
Vulnerabilities in
time
How long do packages
remain vulnerable?
It takes a long time before vulnerabilities are
removed from a package.
When are vulnerabilities fixed?
+ Most vulnerabilities are quickly fixed after their discovery.
- ~20% of vulnerabilities take more than 1 year to be fixed.
When are vulnerabilities fixed
in dependent packages?
Depending packages are vulnerable much
longer! Package maintainers must use security
monitoring tools, and adapt their dependency
constraints to quickly benefit from security fixes
References
• E. Constantinou, T. Mens. An Empirical Comparison of Developer
Retention in the RubyGems and npm Software Ecosystems.
Innovations in Systems and Software Engineering, 2017
• A. Decan, T. Mens, Ph. Grosjean. An Empirical Comparison of
Dependency Network Evolution in Seven Software Packaging
Ecosystems. Empirical Software Engineering, 2018
• A. Zerouali, E. Constantinou, T. Mens, G. Robles, J. Gonzalez-
Barahona. An empirical analysis of technical lag in npm package
dependencies. ICSR 2018
• A. Decan, T. Mens, E. Constantinou. On the impact of security
vulnerabilities in the npm package dependency network. MSR
2018
Questions?

Software Ecosystems = Big Data

  • 1.
  • 2.
    Prof. Dr. TomMens Software Engineering Lab tom.mens@umons.ac.be Big Data Analytics of Software Ecosystem Health
  • 3.
    A software ecosystemis a collection of [inderdependent] software projects that are developed and evolve together in the same environment. Mircea Lungu (PhD, 2008)
  • 4.
    Software Ecosystems =Big Data Volume Velocity Variety Veracity 4V
  • 5.
    Software Ecosystems =Big Data Volume: software ecosystems involve huge quantities of data Debian (Linux distribution) Archive http://snapshot.debian.org containing daily snapshots of packages, maintainers, dependencies, ... "Snapshot keeps growing. We are now at approximate 60TB of files. This made it necessary to break up the RAID-1 mirror across two external storage arrays ..., and it also meant we needed more machines (now six) at our mirrorsite ..." Debian bug tracker http://methyer.ethz.ch/bts/ • 122 thousand active bugs; 779 thousand archived bugs Debian security tracker • 29 thousand security vulnerabilities
  • 6.
    Software Ecosystems =Big Data Volume: software ecosystems involve huge quantities of data Example: software package manager for JavaScript Created in 2010. In 2017: • 3.5TB of storage required for hosting 500K packages • 2.3 million opened GitHub pull requests for JavaScript repositories March 2018: • ~0,7 million packages • ~4,4 million package releases • ~19,8 million (runtime) package dependencies
  • 7.
    Software Ecosystems =Big Data Volume: software ecosystems involve huge quantities of data GHTorrent: (partial) datadump of GitHub on April 2018 • 24,1 million users • 83,6 million projects • 67,4 million issues • 34 million pull requests • 930 million commits
  • 8.
    Software Ecosystems =Big Data Volume: software ecosystems involve huge quantities of data RubyGems software package manager for Ruby (since 2004) Ecosystem size in March 2018: • ~144 thousand packages • ~825 thousand package releases • ~2 million (runtime) package dependencies Some data for the Ruby on Rails project: 68,980 commits 346 releases 3,570 contributors > 11k issues; > 21k pull requests; >16k forks on GitHub > 11k dependent packages; > 458k dependent repositories
  • 9.
    Software Ecosystems =Big Data Variety: software ecosystems involve very heterogenous data sources • Structured data: source code, dependency graphs, version control systems, ... • Semi-structured data: e.g. mailing lists, online surveys, social media, Q&A websites, ... • Unstructured data: unformatted text, video and voice recordings of interviews, field notes Heterogeneity • Source code, packaging metadata, models, documentation, tests, databases, bug and issue reports, ... • Multiple programming/natural languages • Cultural differences
  • 10.
    Software Ecosystems =Big Data Veracity: software ecosystem analysis requires dealing with uncertain, inconsistent, invalid and missing data Examples: • Missing: Corrupted or lost historical data (voluntarily or not). E.g., removed projects/user profiles; "rebasing" the version history • Uncertain: Different data source may disagree è which one is correct? • Invalid/inconsistent: Especially data produced by humans
  • 11.
    Software Ecosystems =Big Data Velocity: software ecosystems are growing rapidly • New commits are made to GitHub several times every second • For web-based development analytics dashboards, or automated recommendation systems, using the most recent data is important to make informed decisions 2012 2013 2014 2015 2016 2017 100 101 102 103 104 105 106 number of packages (log) cargo cpan cran npm nuget packagist rubygems 2012 2013 2014 2015 2016 2017 100 101 102 103 104 105 106 107 108 number of dependencies (log) cargo cpan cran npm nuget packagist rubygems A. Decan, T. Mens, Ph. Grosjean. An Empirical Comparison of Dependency Network Evolution in Seven Software Packaging Ecosystems. Empirical Software Engineering, 2018
  • 12.
    Software Ecosystems =Big Data Many Challenges • How to retrieve data? • Data provider services (e.g. libraries.io, GHTorrent, GitHubArchive, ...) • How to avoid "abusing" their APIs? • How to deal with changes in APIs? • How to deal with data veracity? • How to analyse data? • Storing and sharing such amounts of data requires specific hardware • Processing data can take a lot of time; several weeks not uncommon • How to report such huge amounts of data? Aggregating results Adapted visualisation techniques
  • 13.
    Software Ecosystems =Big Data Challenges continued • How to combine data originating from different sources? O(n*m) time complexity not affordable because n and m too large • How to identify "interesting" and "relevant" data? • Need new data cleaning and data mining techniques capable of dealing with this amount of data • Providing "incremental" solutions that keep the extracted data up-to-date • Dealing with identities of individuals • Identity merging • Preserving privacy and anonymity
  • 14.
    Research Context • Todayover 80 percent of all software in any technology product or service is open source software (OSS). • CHAOSS focuses on creating analytics and metrics to help define OSS community health. https://chaoss.community "The CHAOSS community is developing metrics, methodologies, and software for expressing open source project health and sustainability. By doing so, CHAOSS seeks to improve the transparency of open source project health and sustainability so that relevant stakeholders can make more informed decisions about open source project engagement."
  • 15.
    University of Mons Laval University Polytechnique Montréal Universitéde Mons www.secohealth.org @secohealth 2017-2019
  • 16.
  • 17.
    Best Practices Best Practices Practices Best 3. Provive recommendations andguidelines to avoid future software health problems 1. Determine indicators of software health issues 2. Predict the impact and propagation of health issues time
  • 18.
    • Bugs • Securityvulnerabilities • Dependency problems • Abandoned or outdated software • Redundant or duplicated code • Incompatible software licences • ... Technical • Lack of communication / interaction • Social conflicts • Contributor abandonment • Insufficient diversity • Cultural differences • .. Ecosystem Health Issues
  • 19.
  • 20.
  • 21.
    seco-assist.github.io @seco-assist 2018-2021 SECO-ASSIST aims toprovide novel software recommendation techniques to address the software ecosystem challenges of longevity, scale, heterogeneity, and community. This will be achieved by combining socio-technical analysis, database usage analysis, library evolution and software test automation.
  • 22.
  • 23.
    • Improve software testing andprevent bugs • Improve software library reuse • Optimise database usage • Improve developer team interaction UMONS UNamur UAntwerpenVUB
  • 24.
    SECO-ASSIST Goals Improve socialhealth • Retain key contributors and attract new ones • Predict abandoners and find replacements • Identify toxic contributors • Ensure sufficient diversity
  • 25.
    SECO-ASSIST Goals Improve technicalhealth • Better software tests, taking into account the software dependencies è less bugs and security issues • Higher productivity and quality by using reusable software libraries • Increased maintainability by supporting upgrades and migrations (to new libraries, other technologies, ...)
  • 26.
    Current Research Empirical studieson historical software ecosystem data to analyse and understand • Software contributor retention and abandonment • The propagation of health problems through technical dependencies in a software ecosystem • The impact of "technical lag" caused by outdated dependencies • The impact of security vulnerabilities
  • 27.
  • 28.
    How long dopackages remain vulnerable? It takes a long time before vulnerabilities are removed from a package.
  • 29.
    When are vulnerabilitiesfixed? + Most vulnerabilities are quickly fixed after their discovery. - ~20% of vulnerabilities take more than 1 year to be fixed.
  • 30.
    When are vulnerabilitiesfixed in dependent packages? Depending packages are vulnerable much longer! Package maintainers must use security monitoring tools, and adapt their dependency constraints to quickly benefit from security fixes
  • 31.
    References • E. Constantinou,T. Mens. An Empirical Comparison of Developer Retention in the RubyGems and npm Software Ecosystems. Innovations in Systems and Software Engineering, 2017 • A. Decan, T. Mens, Ph. Grosjean. An Empirical Comparison of Dependency Network Evolution in Seven Software Packaging Ecosystems. Empirical Software Engineering, 2018 • A. Zerouali, E. Constantinou, T. Mens, G. Robles, J. Gonzalez- Barahona. An empirical analysis of technical lag in npm package dependencies. ICSR 2018 • A. Decan, T. Mens, E. Constantinou. On the impact of security vulnerabilities in the npm package dependency network. MSR 2018
  • 32.