Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Noise and Heterogeneity in Historical Build Data: An Empirical Study of Travis CI

77 views

Published on

This work was presented at the 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) which was held at the Corum, Montpellier, France from September 3 to 7, 2018.

Automated builds, which may pass or fail, provide feedback to a development team about changes to the codebase. A passing build indicates that the change compiles cleanly and tests (continue to) pass. A failing (a.k.a., broken) build indicates that there are issues that require attention. Without a closer analysis of the nature of build outcome data, practitioners and researchers are likely to make two critical assumptions: (1) build results are not noisy; however, passing builds may contain failing or skipped jobs that are actively or passively ignored; and (2) builds are equal; however, builds vary in terms of the number of jobs and configurations.

To investigate the degree to which these assumptions about build breakage hold, we perform an empirical study of 3.7 million build jobs spanning 1,276 open source projects. We find that: (1) 12% of passing builds have an actively ignored failure; (2) 9% of builds have a misleading or incorrect outcome on average; and (3) at least 44% of the broken builds contain passing jobs, i.e., the breakage is local to a subset of build variants. Like other software archives, build data is noisy and complex. Analysis of build data requires nuance.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Noise and Heterogeneity in Historical Build Data: An Empirical Study of Travis CI

  1. 1. Noise and Heterogeneity in Historical Build Data: An Empirical Study of Travis CI Keheliya Gallaba Shane McIntoshChristian Macho Martin Pinzger @keheliya keheliya.github.io @Mitschiiii mitschi.github.io @pinzger pinzger.github.io @shane_mcintosh shanemcintosh.org
  2. 2. Source Code Automated builds check the impact of changes on the software product Build System Deliverables 2
  3. 3. Build outcome data is used to solve software engineering research problems For understanding and predicting build breakage For measuring the build breakage rate For communicating the current build status 3
  4. 4. Build outcome data is nuanced! allow_failure enables experimentation with support for a new platform. 4
  5. 5. Can the off-the-shelf historical CI build data be trusted? The zdavatz/spreadsheet project has had the allow_failure feature enabled for the entire lifetime of the project! 5
  6. 6. Are build outcomes free of noise? Are build outcomes homogeneous? 6
  7. 7. We study 680,209 Travis CI builds spanning 1,276 open source projects We follow Mockus' four-step procedure 7
  8. 8. Are build outcomes free of noise? 8
  9. 9. We look for passing builds with actively ignored failures 9 680,209 Builds 496,204 Builds 59,904 Builds Select passing builds Select builds with failing jobs Check if the allow_failure property is enabled for the failing jobs in .travis.yml
  10. 10. Passing build outcomes do not always indicate that the build was entirely clean 12% of passing builds have an actively ignored failure. Up to 87% of the jobs are actively ignored. 10
  11. 11. Passively ignored breakages may introduce noise when all breakages are assumed to be distracting 11 680,209 Builds 610,550 Builds Build filtering Graph construction using version control data Graph analysis Long breakage sequences may mean developers passively ignored failures by not immediately fixing them.
  12. 12. In some cases, builds can remain broken for 423 days Overall median length of the failure sequence is five commits. 12
  13. 13. One of the reasons for ignoring a build breakage: Staleness 13 Developers may become desensitized to stale* breakages. *If the project has encountered a given breakage in the past it's a stale breakage.
  14. 14. 14 Maven Build Log Build fails due to the same reason as a prior failure? Stale Breakage We measure staleness in Maven build breakages Failure details are equal to a prior failure? Not Stale Breakage YES YES NONO Maven Log Analyzer
  15. 15. Two of every three build breakages (67%) that we analyze are stale 15
  16. 16. We propose Signal-To-Noise Ratio to quantify the proportion of noise 16 Has Ignored Breakages No Ignored Breakages Broken Builds False Build Breakages True Build Breakages Passing Builds False Build Successes True Build Successes SignalNoise
  17. 17. One in every 7 to 11 builds (9%-14%) is incorrectly labelled 17
  18. 18. Noise may influence analyses based on build outcome data 18 Passing build outcomes do not always indicate that the build was entirely clean Build breakages can persist for up to 485 commits (423 days) 67% of build breakages we analyze are stale 9%-14% of builds are incorrectly labelled
  19. 19. Are build outcomes homogeneous? 19 Noise may influence analyses based on build outcome data Passing build outcomes do not always indicate that the build was entirely clean Build breakages can persist for up to 485 commits (423 days) 67% of build breakages we analyze are stale 9%-14% of builds are incorrectly labelled
  20. 20. MBP<1 Environment-specific breakages Environment-agnostic breakages 20 Computing the Matrix Breakage Purity MBP=1
  21. 21. Environment-specific breakage is commonplace 21
  22. 22. Builds can break for various reasons 22 Compilation Failure Test Failure Dependency Resolution Failure We extend Maven Log Analyzer to parse and classify broken Maven build logs by type Deployment Failure
  23. 23. Maven Log Analyzer supports new build breakage categories 23 Ant Inside Maven Goal Failed Broken Outside Maven Run System/Java Program Run Jetty Server Manage Ruby Gems Polyglot for Maven No Log Available Failed Before Maven Travis Aborted Failed After Maven Travis Cancelled
  24. 24. Tool-specific breakage is rare. 24 41% of the broken builds failed due to problems outside of Maven.
  25. 25. 25 Noise may influence analyses based on build outcome data Passing build outcomes do not always indicate that the build was entirely clean Build breakages can persist for up to 485 commits (423 days) 67% of build breakages we analyze are stale 9%-14% of builds are incorrectly labelled Build outcomes are heterogenous Environment-specific breakage is commonplace Tool-specific breakage is rare Future automatic breakage recovery techniques should tackle issues in the CI scripts
  26. 26. Our observations have broader implications for researchers and tool builders 26 For Research Community For Tool Builders Build outcome noise should be filtered out before analyses Heterogeneity should be considered when training build outcome prediction models Automatic breakage recovery should look beyond tool-specific insight Richer information should be included in build outcome reports and dashboards
  27. 27. github.com/software-rebels/bbchch @keheliya

×