Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Software Ecosystems = Big Data

412 views

Published on

Big Data Analytics of Software Ecosystem Health: Presentation during INFORTECH Scientific Day (23 May 2018) by Professor Tom Mens. The talk reports on ongoing research of the Software Engineering Lab of the University of Mons (UMONS) on health aspects of evolving software ecosystems. This research was conducted in collaboration with postdoctoral researchers Alexandre Decan and Eleni Constantinou, as well as the external partners of two ongoing research projects: SECOHealth (https://secohealth.github.io) and the Excellence of Science research project SECO-ASSIST (https://secoassist.github.io).

Published in: Science
  • Be the first to comment

Software Ecosystems = Big Data

  1. 1. Software Ecosystems = Big Data !
  2. 2. Prof. Dr. Tom Mens Software Engineering Lab tom.mens@umons.ac.be Big Data Analytics of Software Ecosystem Health
  3. 3. A software ecosystem is a collection of [inderdependent] software projects that are developed and evolve together in the same environment. Mircea Lungu (PhD, 2008)
  4. 4. Software Ecosystems = Big Data Volume Velocity Variety Veracity 4V
  5. 5. Software Ecosystems = Big Data Volume: software ecosystems involve huge quantities of data Debian (Linux distribution) Archive http://snapshot.debian.org containing daily snapshots of packages, maintainers, dependencies, ... "Snapshot keeps growing. We are now at approximate 60TB of files. This made it necessary to break up the RAID-1 mirror across two external storage arrays ..., and it also meant we needed more machines (now six) at our mirrorsite ..." Debian bug tracker http://methyer.ethz.ch/bts/ • 122 thousand active bugs; 779 thousand archived bugs Debian security tracker • 29 thousand security vulnerabilities
  6. 6. Software Ecosystems = Big Data Volume: software ecosystems involve huge quantities of data Example: software package manager for JavaScript Created in 2010. In 2017: • 3.5TB of storage required for hosting 500K packages • 2.3 million opened GitHub pull requests for JavaScript repositories March 2018: • ~0,7 million packages • ~4,4 million package releases • ~19,8 million (runtime) package dependencies
  7. 7. Software Ecosystems = Big Data Volume: software ecosystems involve huge quantities of data GHTorrent: (partial) datadump of GitHub on April 2018 • 24,1 million users • 83,6 million projects • 67,4 million issues • 34 million pull requests • 930 million commits
  8. 8. Software Ecosystems = Big Data Volume: software ecosystems involve huge quantities of data RubyGems software package manager for Ruby (since 2004) Ecosystem size in March 2018: • ~144 thousand packages • ~825 thousand package releases • ~2 million (runtime) package dependencies Some data for the Ruby on Rails project: 68,980 commits 346 releases 3,570 contributors > 11k issues; > 21k pull requests; >16k forks on GitHub > 11k dependent packages; > 458k dependent repositories
  9. 9. Software Ecosystems = Big Data Variety: software ecosystems involve very heterogenous data sources • Structured data: source code, dependency graphs, version control systems, ... • Semi-structured data: e.g. mailing lists, online surveys, social media, Q&A websites, ... • Unstructured data: unformatted text, video and voice recordings of interviews, field notes Heterogeneity • Source code, packaging metadata, models, documentation, tests, databases, bug and issue reports, ... • Multiple programming/natural languages • Cultural differences
  10. 10. Software Ecosystems = Big Data Veracity: software ecosystem analysis requires dealing with uncertain, inconsistent, invalid and missing data Examples: • Missing: Corrupted or lost historical data (voluntarily or not). E.g., removed projects/user profiles; "rebasing" the version history • Uncertain: Different data source may disagree è which one is correct? • Invalid/inconsistent: Especially data produced by humans
  11. 11. Software Ecosystems = Big Data Velocity: software ecosystems are growing rapidly • New commits are made to GitHub several times every second • For web-based development analytics dashboards, or automated recommendation systems, using the most recent data is important to make informed decisions 2012 2013 2014 2015 2016 2017 100 101 102 103 104 105 106 number of packages (log) cargo cpan cran npm nuget packagist rubygems 2012 2013 2014 2015 2016 2017 100 101 102 103 104 105 106 107 108 number of dependencies (log) cargo cpan cran npm nuget packagist rubygems A. Decan, T. Mens, Ph. Grosjean. An Empirical Comparison of Dependency Network Evolution in Seven Software Packaging Ecosystems. Empirical Software Engineering, 2018
  12. 12. Software Ecosystems = Big Data Many Challenges • How to retrieve data? • Data provider services (e.g. libraries.io, GHTorrent, GitHubArchive, ...) • How to avoid "abusing" their APIs? • How to deal with changes in APIs? • How to deal with data veracity? • How to analyse data? • Storing and sharing such amounts of data requires specific hardware • Processing data can take a lot of time; several weeks not uncommon • How to report such huge amounts of data? Aggregating results Adapted visualisation techniques
  13. 13. Software Ecosystems = Big Data Challenges continued • How to combine data originating from different sources? O(n*m) time complexity not affordable because n and m too large • How to identify "interesting" and "relevant" data? • Need new data cleaning and data mining techniques capable of dealing with this amount of data • Providing "incremental" solutions that keep the extracted data up-to-date • Dealing with identities of individuals • Identity merging • Preserving privacy and anonymity
  14. 14. Research Context • Today over 80 percent of all software in any technology product or service is open source software (OSS). • CHAOSS focuses on creating analytics and metrics to help define OSS community health. https://chaoss.community "The CHAOSS community is developing metrics, methodologies, and software for expressing open source project health and sustainability. By doing so, CHAOSS seeks to improve the transparency of open source project health and sustainability so that relevant stakeholders can make more informed decisions about open source project engagement."
  15. 15. University of Mons Laval University Polytechnique Montréal Université de Mons www.secohealth.org @secohealth 2017-2019
  16. 16. University of Mons Laval University Polytechnique Montréal Université de Mons www.secohealth.org @secohealth
  17. 17. Best Practices Best Practices Practices Best 3. Provive recommendations and guidelines to avoid future software health problems 1. Determine indicators of software health issues 2. Predict the impact and propagation of health issues time
  18. 18. • Bugs • Security vulnerabilities • Dependency problems • Abandoned or outdated software • Redundant or duplicated code • Incompatible software licences • ... Technical • Lack of communication / interaction • Social conflicts • Contributor abandonment • Insufficient diversity • Cultural differences • .. Ecosystem Health Issues
  19. 19. Example: leftpad
  20. 20. seco-assist.github.io @seco-assist 2018-2021
  21. 21. seco-assist.github.io @seco-assist 2018-2021 SECO-ASSIST aims to provide novel software recommendation techniques to address the software ecosystem challenges of longevity, scale, heterogeneity, and community. This will be achieved by combining socio-technical analysis, database usage analysis, library evolution and software test automation.
  22. 22. UMONS UNamur UAntwerpenVUB Tom Mens Anthony Cleve Coen De Roover Serge Demeyer
  23. 23. • Improve software testing and prevent bugs • Improve software library reuse • Optimise database usage • Improve developer team interaction UMONS UNamur UAntwerpenVUB
  24. 24. SECO-ASSIST Goals Improve social health • Retain key contributors and attract new ones • Predict abandoners and find replacements • Identify toxic contributors • Ensure sufficient diversity
  25. 25. SECO-ASSIST Goals Improve technical health • Better software tests, taking into account the software dependencies è less bugs and security issues • Higher productivity and quality by using reusable software libraries • Increased maintainability by supporting upgrades and migrations (to new libraries, other technologies, ...)
  26. 26. Current Research Empirical studies on historical software ecosystem data to analyse and understand • Software contributor retention and abandonment • The propagation of health problems through technical dependencies in a software ecosystem • The impact of "technical lag" caused by outdated dependencies • The impact of security vulnerabilities
  27. 27. Analysing Security Vulnerabilities in time
  28. 28. How long do packages remain vulnerable? It takes a long time before vulnerabilities are removed from a package.
  29. 29. When are vulnerabilities fixed? + Most vulnerabilities are quickly fixed after their discovery. - ~20% of vulnerabilities take more than 1 year to be fixed.
  30. 30. When are vulnerabilities fixed in dependent packages? Depending packages are vulnerable much longer! Package maintainers must use security monitoring tools, and adapt their dependency constraints to quickly benefit from security fixes
  31. 31. References • E. Constantinou, T. Mens. An Empirical Comparison of Developer Retention in the RubyGems and npm Software Ecosystems. Innovations in Systems and Software Engineering, 2017 • A. Decan, T. Mens, Ph. Grosjean. An Empirical Comparison of Dependency Network Evolution in Seven Software Packaging Ecosystems. Empirical Software Engineering, 2018 • A. Zerouali, E. Constantinou, T. Mens, G. Robles, J. Gonzalez- Barahona. An empirical analysis of technical lag in npm package dependencies. ICSR 2018 • A. Decan, T. Mens, E. Constantinou. On the impact of security vulnerabilities in the npm package dependency network. MSR 2018
  32. 32. Questions?

×