Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Continuous test suite failure prediction
1. Continuous Test Suite Failure Prediction
Cong Pan1*, Michael Pradel2
1School of Reliability and Systems Engineering, Beihang University, China
2Department of Computer Science, University of Stuttgart, Germany
*Parts of this work were done while visiting University of Stuttgart
07/17/2021 ISSTA 2021 1
cong_pan@buaa.edu.cn michael@binaervarianz.de
2. Do We Really Need to Execute Test Suites for Every
Code Change?
ISSTA 2021 2
[1] Jeff Anderson, Saeed Salem, and Hyunsook Do. Striving for failure: an industrial case study about test failure prediction.
[2] Mateusz Machalica, Alex Samylkin, Meredith Porth, and Satish Chandra. Predictive test selection.
[3] https://github.com/search, January 2021
* We collect a dataset from Travis CI and GitHub, which includes 15,000 test suite runs from 242 open-source projects.
The Dynamics AX project has nearly 65,000
regression test cases, takes 3 days to execute[1]
The Facebook mobile code base has over 10,000 code changes
per week, each potentially triggers over 10,000 test cases[2]
Over 205 million projects on GitHub, many of them
use GitHub Actions for continuous integration[3]
Around 4.21% test suite invocations turn a previous
passing test suite into a failing test suite*
07/17/2021
3. Terminology and Problem Statement
07/17/2021 ISSTA 2021 3
Code Change
Test Suite
Classification
Model Test Suite
Failure
Test Suite
Pass
Input Predict
Continuous Test Suite Failure Prediction
Code Change
Test Case
Classification
Model Test Case
Failure
Test Case
Pass
Input Predict
Test Case Failure Prediction
Continuous test suite failure prediction: Given a code change 𝑐 and a test suite 𝑠 , the problem
is to predict whether triggering 𝑠 as part of continuous integration upon 𝑐 will pass or fail 𝑠.
Test Suite Selection
Code Change
Test Suite 1
Classification
Model Test Suite i
Failure
Test Suite i
Pass
Input Predict
Test Suite 2
Test Suite N
5. • 44 features from 9 categories
• Features are adapted from just-in-time defect prediction and test case
failure prediction, and 19 newly added features
• Code change features
• Development history features
• Test features
Feature Extraction
07/17/2021 ISSTA 2021 5
6. 28 features in 6 categories
Code Change Features
07/17/2021 ISSTA 2021 6
*DP=Defect Prediction, TC = Test Case Failure Prediction
*
7. 7 features in 2 categories
Development History Features
07/17/2021 ISSTA 2021 7
*
*DP=Defect Prediction, TC = Test Case Failure Prediction
8. 9 features in Test category
Test Features
07/17/2021 ISSTA 2021 8
*DP=Defect Prediction, TC = Test Case Failure Prediction
*
9. • Two principles:
• Predict test suite results instead of build results
• Count only first test failures to identify the exact bug-inducing code change
Label Extraction
07/17/2021 ISSTA 2021 9
Aim: label test suites as pass or failure based on historical test results
Code
Change
12. The test suite failure prediction model is effective. The best studied model, LightGBM, achieves an
AUC of 0.836.
The effect of different classification thresholds
on prediction performance
Effectiveness of different classification models
RQ1: How Effective are the Prediction Models?
07/17/2021 ISSTA 2021 12
13. RQ1: How Effective are the Prediction Models?
With time-based data splitting, the model is still effective, but provides slightly worse predictions
due to the smaller size of the training data set.
Use adjacent time steps
for training and testing
Use data from the same time
step for training and testing
Use samples in one time
step for training and samples
in the new step for testing
Use the first 80% of samples
within a time step for training
and the last 20% for testing
07/17/2021 ISSTA 2021 13
14. Simply reusing features from related domains yields a less effective model than our full feature set.
Features known from prior work vs. full feature set
RQ2: How Effective are the Features?
07/17/2021 ISSTA 2021 14
15. RQ2: How Effective are the Features?
Information about the developer experience, previous test results, and abundance of test
cases is most important for an effective prediction model.
Most important features: test features (TF10, TC, TP10, TF) and experience features (REXP,
SEXP, EXP)
07/17/2021 ISSTA 2021 15
16. The cost model shows when continuous test suite failure prediction is effective in practice.
The input parameters include:
• Cost 𝑟 of running a test suite
• Computational cost, maintenance cost, developer & development process cost
• Cost 𝑑 of delayed detection of a failure-inducing code change
• Failure localization, developer memories
• Test suite failure rate 𝑓 at which code changes cause test suite failures
• 𝑓=4.21% in our study
Cost Model
07/17/2021 ISSTA 2021 16
18. Strategy Comparison
MODEL strategy vs PERIOD strategy
MODEL strategy vs other strategies
(ALL, PERIOD, NEVER, RANDOM)
boundary
condition
boundary
condition
interval
07/17/2021 ISSTA 2021 18
19. Instantiating the theoretical cost model with real-world data shows that the predictive model is
cost-effective if 2.47 < < 37.58.
RQ3: (When) Is the Model Cost-Saving?
07/17/2021 ISSTA 2021 19
Example: suppose r=2 person hours, d=20 person hours,
d/r=10, model is the best strategy
20. • Flaky tests
• Marked in test reports: 0.067%
• Execute a code change multiple times and get different test results: <1%
• Dataset size and programming language
• Focus on open-source projects
• Theoretical abstraction of real costs
Threats to Validity
07/17/2021 ISSTA 2021 20
21. • We define the problem of continuous test suite failure prediction
• We share a large-scale dataset gathered from 242 real-world projects
• Based on the proposed features, our approach improves over baselines
that use features for just-in-time defect prediction and test case failure
prediction by 13.9% and 2.9%
• We present a cost model showing that our results could be useful in real
scenarios
Conclusion
07/17/2021 ISSTA 2021 21