FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair

nanda-lab.ca
nanda-lab.ca
FlakyFix: Using Large Language
Models for Predicting Flaky Test
Fix Categories and Test Code
Repair
Sakina Fatima, Hadi Hemmati, Lionel Briand
02-04-2025

nanda-lab.ca
Background: Software Testing
Software Testing is an essential activity that ensures software dependability. Test cases are
executed to detect bugs in the code under test.
1

nanda-lab.ca
Flaky Tests intermittently pass and fail even for the same version of the source code
(i.e., non-deterministic testing results)
Why Detect and Repair Flaky Tests?
❖ Test failures caused by flaky tests can be hard to reproduce as re-running is required (computationally expensive)
❖ Flaky tests might hide real bugs in the source code
❖ Tests become unreliable
❖ Software releases might be delayed
❖ Hard to manually detect and fix so developers ignore these tests
Problem: Flaky Tests
2

nanda-lab.ca
FlakyFix: Black Box Automated Repair of Flaky Tests
Language
Model
Flaky Test A Fix for Flaky Test
Proposed Solution
*Black Box: Using test case code only, No access to code under test. This research focuses on those flaky tests where part of the flakiness lies in the test code. (10% of overall flaky tests dataset)
1. Sakina Fatima, Taher A Ghaleb, and Lionel Briand. Flakify: A black-box, language model-based predictor for flaky tests. IEEE Transactions on Software Engineering, 2022. 3

nanda-lab.ca
Proposed Approach
→ Definition of flaky tests fixes and labeling of flaky tests accordingly
• A set of heuristics
• First labeled dataset categorizing flaky tests by the type of fix needed
• Open-source script* to automatically label flaky tests based on their fixes
→ Prediction of flaky test fix category using pre-trained code language models
• Suggest to developers a type of fix they can implement to repair flaky tests
• Aid Conventional Large Language Models i.e. GPT to automatically generate the fix
→ Generation of a fully repaired flaky test using the predicted fix category and LLMS
• Attempt to generate a fully or semi-automated repair of Flaky tests
4

nanda-lab.ca
Prediction of Flaky Test Fix Category
Model
Change Data Structure
Cause of
Flakiness
Predicts
HashMap Should be replaced with LinkedHashMap to
maintain the order in which their elements are stored,
regardless of how many times a code is executed.
5

nanda-lab.ca
Generation of a Repaired Flaky Test using the Predicted Fix Category
6

nanda-lab.ca
Dataset
Used International Dataset of Flaky Tests (IDoFT)
→Largest available dataset of flaky tests where cause of flakiness is in the test code
→562 Flaky Tests in Java and their developer repaired fixes
→Flaky Tests belong to 123 different projects, helpful for the generalizability of prediction
models
7

nanda-lab.ca
Defining Flaky Test Fix Categories and Labeling the Dataset
→ Detected 13 Different Categories of the Fix
8

nanda-lab.ca
Fix Category Prediction
Fine-tune Pre-Trained Code Models i.e. CodeBERT and UniXcoder for the task of Flaky Tests
Fix Classification with two different techniques:
✓ Feed Forward Neural Network (FNN)
✓ Few Shot Learning (recommended for smaller datasets)
9

nanda-lab.ca
Fix Category Prediction: Technique #1
Fine-tune Pre-Trained Code Models i.e. CodeBERT and UniXcoder for the task of
Flaky Tests Fix Classification using a Feed Forward Neural Network (FNN).
10

nanda-lab.ca
Since a smaller dataset of 562 tests, we used Few Shot Learning (FSL), popular with
smaller datasets.
Step 1: Fine-tune code models (UnixCoder and CodeBert) using Siamese Network for
Flaky Tests fix category classification
11

nanda-lab.ca
Step 2: Evaluate the trained Model through Few-Shot Learning:
12

nanda-lab.ca
Prediction Results for Fix Categories Using Code Models
13

nanda-lab.ca
Example of Flaky Test Fix Category Prediction
Change Data Structure
Model
Flaky Test Original Fixed Flaky Test
14

nanda-lab.ca
Generating Fully Repaired Flaky Tests from GPT using Predicted Fix
Categories
Prompt 1 Prompt 2 (In-Context Learning)
16

nanda-lab.ca
Example of Repaired Flaky Test from GPT
Flaky Test Generated Fixed Flaky Test without Fix Category
Cause of Flakiness
Incorrect Repair Suggested
17

nanda-lab.ca
Example of Repaired Flaky Test from GPT (2)
Fix Generation with Fix Category Label Original Fix for Flaky Test
Repaired by GPT
Repaired by Developer
18

nanda-lab.ca
Generation Results from GPT using Predicted Fix Category Labels
19

nanda-lab.ca
Generation Results from GPT using In-Context Learning
20

nanda-lab.ca
GPT Generated Tests-Execution Results
• We ran a sample of 35 generated tests: 24 Passed, 11 Failed
• We conducted a series of analysis:
→Overall, among passing tests average CodeBLEU Score is 94%, Higher
Code BLEU scores have a higher likelihood to pass.
→16 GPT- fixed tests have 100% CodeBLEU score, indicating an exact match
with the developer-repaired versions.
→Bootstrapping: With 95% Confidence Interval, 51% to 83% GPT-fixed tests are
estimated to pass. (Helpful for Testers)
→Logistic Regression: Trained on the executable 35 tests to estimate the
passing test rate among the non-executable tests. 80 % Accuracy.
→Edit Distance is calculated to assess the manual fixing effort for 11 failed
tests. 16% average token replacement is needed.
21

nanda-lab.ca
GPT Generated Tests-Execution Results
Based on the trained Logistic Regression Model, Passing Estimates for non-executable tests for
both 181 and 131 test dataset:
22

nanda-lab.ca
Practical Implications
How do we envision our approach to be used in practice?
• Deploy in Continuous Integration (CI) environments to repair a flaky test without developer’s
explicit command.
• Guide developers about possible causes of flakiness that need to be addressed though Test
Smells and Fix Labels Information.
• Reduce the manual effort to fix the tests even when the GPT repair is not fully correct, semi-
automated repair approach.
23

FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair

More Related Content

Similar to FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair

More from Lionel Briand

Recently uploaded

FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair