nanda-lab.ca
nanda-lab.ca
FlakyFix: Using Large Language
Models for Predicting Flaky Test
Fix Categories and Test Code
Repair
Sakina Fatima, Hadi Hemmati, Lionel Briand
02-04-2025
nanda-lab.ca
Background: Software Testing
Software Testing is an essential activity that ensures software dependability. Test cases are
executed to detect bugs in the code under test.
1
nanda-lab.ca
Flaky Tests intermittently pass and fail even for the same version of the source code
(i.e., non-deterministic testing results)
Why Detect and Repair Flaky Tests?
❖ Test failures caused by flaky tests can be hard to reproduce as re-running is required (computationally expensive)
❖ Flaky tests might hide real bugs in the source code
❖ Tests become unreliable
❖ Software releases might be delayed
❖ Hard to manually detect and fix so developers ignore these tests
Problem: Flaky Tests
2
nanda-lab.ca
FlakyFix: Black Box Automated Repair of Flaky Tests
Language
Model
Flaky Test A Fix for Flaky Test
Proposed Solution
*Black Box: Using test case code only, No access to code under test. This research focuses on those flaky tests where part of the flakiness lies in the test code. (10% of overall flaky tests dataset)
1. Sakina Fatima, Taher A Ghaleb, and Lionel Briand. Flakify: A black-box, language model-based predictor for flaky tests. IEEE Transactions on Software Engineering, 2022. 3
nanda-lab.ca
Proposed Approach
→ Definition of flaky tests fixes and labeling of flaky tests accordingly
• A set of heuristics
• First labeled dataset categorizing flaky tests by the type of fix needed
• Open-source script* to automatically label flaky tests based on their fixes
→ Prediction of flaky test fix category using pre-trained code language models
• Suggest to developers a type of fix they can implement to repair flaky tests
• Aid Conventional Large Language Models i.e. GPT to automatically generate the fix
→ Generation of a fully repaired flaky test using the predicted fix category and LLMS
• Attempt to generate a fully or semi-automated repair of Flaky tests
4
nanda-lab.ca
Prediction of Flaky Test Fix Category
Model
Change Data Structure
Cause of
Flakiness
Predicts
HashMap Should be replaced with LinkedHashMap to
maintain the order in which their elements are stored,
regardless of how many times a code is executed.
5
nanda-lab.ca
Generation of a Repaired Flaky Test using the Predicted Fix Category
6
nanda-lab.ca
Dataset
Used International Dataset of Flaky Tests (IDoFT)
→Largest available dataset of flaky tests where cause of flakiness is in the test code
→562 Flaky Tests in Java and their developer repaired fixes
→Flaky Tests belong to 123 different projects, helpful for the generalizability of prediction
models
7
nanda-lab.ca
Defining Flaky Test Fix Categories and Labeling the Dataset
→ Detected 13 Different Categories of the Fix
8
nanda-lab.ca
Fix Category Prediction
Fine-tune Pre-Trained Code Models i.e. CodeBERT and UniXcoder for the task of Flaky Tests
Fix Classification with two different techniques:
✓ Feed Forward Neural Network (FNN)
✓ Few Shot Learning (recommended for smaller datasets)
9
nanda-lab.ca
Fix Category Prediction: Technique #1
Fine-tune Pre-Trained Code Models i.e. CodeBERT and UniXcoder for the task of
Flaky Tests Fix Classification using a Feed Forward Neural Network (FNN).
10
nanda-lab.ca
Fix Category Prediction: Technique #2
Since a smaller dataset of 562 tests, we used Few Shot Learning (FSL), popular with
smaller datasets.
Step 1: Fine-tune code models (UnixCoder and CodeBert) using Siamese Network for
Flaky Tests fix category classification
11
nanda-lab.ca
Fix Category Prediction: Technique #2
Step 2: Evaluate the trained Model through Few-Shot Learning:
12
nanda-lab.ca
Prediction Results for Fix Categories Using Code Models
13
nanda-lab.ca
Example of Flaky Test Fix Category Prediction
Change Data Structure
Model
Flaky Test Original Fixed Flaky Test
14
nanda-lab.ca
Generating Fully Repaired Flaky Tests from GPT using Predicted Fix
Categories
Prompt 1 Prompt 2 (In-Context Learning)
16
nanda-lab.ca
Example of Repaired Flaky Test from GPT
Flaky Test Generated Fixed Flaky Test without Fix Category
Cause of Flakiness
Incorrect Repair Suggested
17
nanda-lab.ca
Example of Repaired Flaky Test from GPT (2)
Fix Generation with Fix Category Label Original Fix for Flaky Test
Repaired by GPT
Repaired by Developer
18
nanda-lab.ca
Generation Results from GPT using Predicted Fix Category Labels
19
nanda-lab.ca
Generation Results from GPT using In-Context Learning
20
nanda-lab.ca
GPT Generated Tests-Execution Results
• We ran a sample of 35 generated tests: 24 Passed, 11 Failed
• We conducted a series of analysis:
→Overall, among passing tests average CodeBLEU Score is 94%, Higher
Code BLEU scores have a higher likelihood to pass.
→16 GPT- fixed tests have 100% CodeBLEU score, indicating an exact match
with the developer-repaired versions.
→Bootstrapping: With 95% Confidence Interval, 51% to 83% GPT-fixed tests are
estimated to pass. (Helpful for Testers)
→Logistic Regression: Trained on the executable 35 tests to estimate the
passing test rate among the non-executable tests. 80 % Accuracy.
→Edit Distance is calculated to assess the manual fixing effort for 11 failed
tests. 16% average token replacement is needed.
21
nanda-lab.ca
GPT Generated Tests-Execution Results
Based on the trained Logistic Regression Model, Passing Estimates for non-executable tests for
both 181 and 131 test dataset:
22
nanda-lab.ca
Practical Implications
How do we envision our approach to be used in practice?
• Deploy in Continuous Integration (CI) environments to repair a flaky test without developer’s
explicit command.
• Guide developers about possible causes of flakiness that need to be addressed though Test
Smells and Fix Labels Information.
• Reduce the manual effort to fix the tests even when the GPT repair is not fully correct, semi-
automated repair approach.
23
nanda-lab.ca
Summary
24
nanda-lab.ca
nanda-lab.ca
FlakyFix: Using Large Language
Models for Predicting Flaky Test
Fix Categories and Test Code
Repair
Sakina Fatima, Hadi Hemmati, Lionel Briand
02-04-2025

FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair

  • 1.
    nanda-lab.ca nanda-lab.ca FlakyFix: Using LargeLanguage Models for Predicting Flaky Test Fix Categories and Test Code Repair Sakina Fatima, Hadi Hemmati, Lionel Briand 02-04-2025
  • 2.
    nanda-lab.ca Background: Software Testing SoftwareTesting is an essential activity that ensures software dependability. Test cases are executed to detect bugs in the code under test. 1
  • 3.
    nanda-lab.ca Flaky Tests intermittentlypass and fail even for the same version of the source code (i.e., non-deterministic testing results) Why Detect and Repair Flaky Tests? ❖ Test failures caused by flaky tests can be hard to reproduce as re-running is required (computationally expensive) ❖ Flaky tests might hide real bugs in the source code ❖ Tests become unreliable ❖ Software releases might be delayed ❖ Hard to manually detect and fix so developers ignore these tests Problem: Flaky Tests 2
  • 4.
    nanda-lab.ca FlakyFix: Black BoxAutomated Repair of Flaky Tests Language Model Flaky Test A Fix for Flaky Test Proposed Solution *Black Box: Using test case code only, No access to code under test. This research focuses on those flaky tests where part of the flakiness lies in the test code. (10% of overall flaky tests dataset) 1. Sakina Fatima, Taher A Ghaleb, and Lionel Briand. Flakify: A black-box, language model-based predictor for flaky tests. IEEE Transactions on Software Engineering, 2022. 3
  • 5.
    nanda-lab.ca Proposed Approach → Definitionof flaky tests fixes and labeling of flaky tests accordingly • A set of heuristics • First labeled dataset categorizing flaky tests by the type of fix needed • Open-source script* to automatically label flaky tests based on their fixes → Prediction of flaky test fix category using pre-trained code language models • Suggest to developers a type of fix they can implement to repair flaky tests • Aid Conventional Large Language Models i.e. GPT to automatically generate the fix → Generation of a fully repaired flaky test using the predicted fix category and LLMS • Attempt to generate a fully or semi-automated repair of Flaky tests 4
  • 6.
    nanda-lab.ca Prediction of FlakyTest Fix Category Model Change Data Structure Cause of Flakiness Predicts HashMap Should be replaced with LinkedHashMap to maintain the order in which their elements are stored, regardless of how many times a code is executed. 5
  • 7.
    nanda-lab.ca Generation of aRepaired Flaky Test using the Predicted Fix Category 6
  • 8.
    nanda-lab.ca Dataset Used International Datasetof Flaky Tests (IDoFT) →Largest available dataset of flaky tests where cause of flakiness is in the test code →562 Flaky Tests in Java and their developer repaired fixes →Flaky Tests belong to 123 different projects, helpful for the generalizability of prediction models 7
  • 9.
    nanda-lab.ca Defining Flaky TestFix Categories and Labeling the Dataset → Detected 13 Different Categories of the Fix 8
  • 10.
    nanda-lab.ca Fix Category Prediction Fine-tunePre-Trained Code Models i.e. CodeBERT and UniXcoder for the task of Flaky Tests Fix Classification with two different techniques: ✓ Feed Forward Neural Network (FNN) ✓ Few Shot Learning (recommended for smaller datasets) 9
  • 11.
    nanda-lab.ca Fix Category Prediction:Technique #1 Fine-tune Pre-Trained Code Models i.e. CodeBERT and UniXcoder for the task of Flaky Tests Fix Classification using a Feed Forward Neural Network (FNN). 10
  • 12.
    nanda-lab.ca Fix Category Prediction:Technique #2 Since a smaller dataset of 562 tests, we used Few Shot Learning (FSL), popular with smaller datasets. Step 1: Fine-tune code models (UnixCoder and CodeBert) using Siamese Network for Flaky Tests fix category classification 11
  • 13.
    nanda-lab.ca Fix Category Prediction:Technique #2 Step 2: Evaluate the trained Model through Few-Shot Learning: 12
  • 14.
    nanda-lab.ca Prediction Results forFix Categories Using Code Models 13
  • 15.
    nanda-lab.ca Example of FlakyTest Fix Category Prediction Change Data Structure Model Flaky Test Original Fixed Flaky Test 14
  • 16.
    nanda-lab.ca Generating Fully RepairedFlaky Tests from GPT using Predicted Fix Categories Prompt 1 Prompt 2 (In-Context Learning) 16
  • 17.
    nanda-lab.ca Example of RepairedFlaky Test from GPT Flaky Test Generated Fixed Flaky Test without Fix Category Cause of Flakiness Incorrect Repair Suggested 17
  • 18.
    nanda-lab.ca Example of RepairedFlaky Test from GPT (2) Fix Generation with Fix Category Label Original Fix for Flaky Test Repaired by GPT Repaired by Developer 18
  • 19.
    nanda-lab.ca Generation Results fromGPT using Predicted Fix Category Labels 19
  • 20.
    nanda-lab.ca Generation Results fromGPT using In-Context Learning 20
  • 21.
    nanda-lab.ca GPT Generated Tests-ExecutionResults • We ran a sample of 35 generated tests: 24 Passed, 11 Failed • We conducted a series of analysis: →Overall, among passing tests average CodeBLEU Score is 94%, Higher Code BLEU scores have a higher likelihood to pass. →16 GPT- fixed tests have 100% CodeBLEU score, indicating an exact match with the developer-repaired versions. →Bootstrapping: With 95% Confidence Interval, 51% to 83% GPT-fixed tests are estimated to pass. (Helpful for Testers) →Logistic Regression: Trained on the executable 35 tests to estimate the passing test rate among the non-executable tests. 80 % Accuracy. →Edit Distance is calculated to assess the manual fixing effort for 11 failed tests. 16% average token replacement is needed. 21
  • 22.
    nanda-lab.ca GPT Generated Tests-ExecutionResults Based on the trained Logistic Regression Model, Passing Estimates for non-executable tests for both 181 and 131 test dataset: 22
  • 23.
    nanda-lab.ca Practical Implications How dowe envision our approach to be used in practice? • Deploy in Continuous Integration (CI) environments to repair a flaky test without developer’s explicit command. • Guide developers about possible causes of flakiness that need to be addressed though Test Smells and Fix Labels Information. • Reduce the manual effort to fix the tests even when the GPT repair is not fully correct, semi- automated repair approach. 23
  • 24.
  • 25.
    nanda-lab.ca nanda-lab.ca FlakyFix: Using LargeLanguage Models for Predicting Flaky Test Fix Categories and Test Code Repair Sakina Fatima, Hadi Hemmati, Lionel Briand 02-04-2025