Automated Test Case Repair Using
Language Models
Ahmadreza Saboor Yaraghi • Darren Holden • Nafiseh Kahani • Lionel Briand
IEEE Transactions on Software Engineering 2025
a.saboor@uottawa.ca • darren.holden@carleton.ca • kahani@sce.carleton.ca • lbriand@uottawa.ca
www.nanda-lab.ca
University of Ottawa
School of Electrical Engineering & Computer Science
Nanda Lab
Research supported by Huawei Canada
What is Regression Testing?
2
Software testing ensures that software works as expected.
It is critical for delivering reliable and functional software.
Regression testing ensures existing functionalities still work when the software
evolves (e.g., bug fixes or new features).
Why regression
testing?
1. Rapid software evolution
2. Interdependent features (small changes can break features)
Motivating Challenges in Regression Testing
3
Regression Testing in Software Systems Test Execution Failure
1. Fault in the
System Under Test
(SUT)
2. Broken Test Code
Challenge: Maintenance Overhead
• Frequent test code changes in fast-evolving systems increase
development costs.
• Ignoring broken tests affects testing quality and software
reliability.
Localizing and fixing
the fault in the SUT Goal: Automated Test Evolution and Repair
Broken Test Repair Motivating Example
4
SUT Test Case
Code Changes (Hunks) in the SUT, adding
the currency feature
Test Case Repair
Limitations of Existing Work
5
Most approaches target specific repair categories (e.g., assertions).
Most existing benchmarks are limited in size and diversity (e.g., only
91 broken test instances across four projects).
Most existing approaches lack reproducibility due to missing
publicly available replication packages.
Contributions (Automated Test Case Repair)
1. TaRGET (Test Repair GEneraTor)
• Using fine-tuned code language models (CLMs)
• Not limited to specific repair categories or programming languages
2. TaRBench (Test Repair Benchmark)
• Large and diverse benchmark
• Includes 45.3k instances across 59 projects
3. Addressed three research questions (RQs)
• Evaluated CLM performance and data formatting (prompting)
• Analyzed factors impacting performance
6
Our Approach
TaRGET
7
The Input and Output Format
1. Test Context
• Broken test code and broken line(s)
2. Repair Context
• SUT code changes (hunks) relevant for
the repair
8
Input
Output
Repair Context Prioritization
1. Input size limitations in CLMs
2. Impact of repair context order
9
Prioritizing SUT changes (hunks)
based on relevancy for repair
Hunk Prioritization Heuristics:
• Call graph depth
• Change in Method/Class
• TF-IDF similarity with broken test code
• Hunk Repetition
Code Hunk Representation
10
1) Word-Level Representation
2) Line-Level Representation
Code Hunk
Output Representations
11
Test Case Repair
1) Code Sequence
2) Edit Sequence
Input-Output Formats Overview
• Repair Context
• R: All SUT hunks
• Rm⊂ R: Method-level hunks within
test call graph
• Rc⊂ R : Class-level hunks within
test call graph
12
• Hunk Prioritization
• CGD: Call Graph Depth
• CT: Context Type (method/class)
• BrkSim: TF-IDF similarity with broken code
• TSim: TF-IDF similarity with test code
• Rep: Number of hunk repetition
TaRBench: Collecting Valid Test Repairs
• We detect valid repairs through three executions:
1. Test V1 on SUT V1 should pass
2. Test V1 on SUT V2 should fail
3. Test V2 on SUT V2 should pass
13
SUT V1 SUT V2
Test Case V1 Test Case V2
Update
Code Change
(Potential Repair)
1
2
3
TaRBench: Benchmark Creation Process
14
Research Questions
RQ1: Repair Performance
RQ1.1: CLMs and Input-Output Formats
RQ1.2: TaRGET Against Baselines
RQ2: Repair Analysis
RQ2.1: Analyzing Repair Characteristics
RQ2.2: Predicting Repair Trustworthiness
RQ3: Fine-tuning analysis
RQ3.1: Analyzing Fine-tuning Data Size
RQ3.2: Assessing Model Generalization
15
Evaluation Metrics
Exact Match Accuracy (EM)
Measures repair candidates that exactly
match the ground truth
Plausible Repair Accuracy (PR)
Measures repair candidates that
successfully compile and pass
BLEU and CodeBLEU
Measure textual similarities between the
repair candidate and the ground truth
16
RQ1.1: CLM and Data Formatting Performance
• CodeT5+ (CT5+) with IO2 shows best overall performance
• PLBART (PLB) outperforms CodeGen (CG) and is comparable to CodeT5+
• The choice between PLBART and CodeT5+ involves a cost-benefit trade-off
• IO2 and IO3 consistently yield the best results
17
RQ1.2: TaRGET Against Baselines
• Baselines:
• CEPROT [1] (SOTA): Automatically detects and updates obsolete tests using a
fine-tuned CodeT5 model and introduces a new dataset.
• NoContext: Fine-tuning CodeT5+ without repair context
• SUTCopy: Replicating SUT changes in test code if applicable
18
[1] X. Hu, Z. Liu, X. Xia, Z. Liu, T. Xu, and X. Yang, “Identify and update test cases when production code changes: A transformer-based approach,” in 2023 38th IEEE/ACM
International Conference on Automated Software Engineering (ASE), 2023, pp. 1111–1122.
Benchmark Approach EM
214 test set
instances of
CEPROT
TaRGET 40.6%
CEPROT 21%
Benchmark Approach EM
7,103 test set
instances of
TaRBench
TaRGET 66%
NoContext 29%
SUTCopy 11%
RQ2.1: Repair Characteristics in TaRBench
19
1. Repair categories
• ORC: Test oracle change
• ARG: Argument modification
• INV: Invocation modification
• OTH: Others
2. Abstract Syntax Tree (AST)
edit actions
• Number of changes in the
code structure
RQ2.1: Evaluation Based on Repair Characteristics
20
The model shows reduced effectiveness
in handling complex repairs.
RQ2.2: Predicting Repair Trustworthiness
• Created a Random Forest classifier utilizing test and repair context (input)
features
• Results show high accuracy in prediction:
• Practical Implications
1. Enhances TaRGet’s practicality
2. Saves time by avoiding low-quality repairs
21
RQ3.1: Impact of Fine-tuning Data Size
• Downsized fine-tuning data to 20%, 40%, 60%, and 80% of the most recent data.
22
• The impact of data size is significant for
EM.
• Considering EM, using an additional 7,000
fine-tuning samples, results in an average
increase of 1.85 percentage points.
RQ3.2: Assessing Generalization
• Stratified project exclusion from fine-tuning data:
• Creating 10 folds, each excluding six projects
• Keeping the evaluation set unchanged
• Fine-tuned 10 project-agnostic models based on the above folds
23
• Project-agnostic models are acceptable alternatives, with an average
difference of 4.9 EM points.
• TaRGet can effectively generalize to unseen projects.
TaRGET and Advances in Foundation Models (FM)
• TaRGET’s challenges: (1) Time-intensive data preparation for fine-tuning,
(2) Expensive fine-tuning process, (3) Limited context window
• FM’s challenges: (1) Lower task-specific performance, (2) Data privacy
concerns
• Advancements in FMs since TaRGET
• Large context (up to 1M in GPT-4.1), RAG techniques, multi-agent solutions, and
reasoning capability (O3, DeepSeek R1)
• Future research can utilize FM to
1. Overcome TaRGET’s existing challenges (e.g., aiding data preparation)
2. Improve components of TaRGET’s approach (e.g., using RAG for repair context)
3. Explore trade-offs: When is fine-tuning worth it?
24
Summary
1. TaRGET shows that CLMs can be effectively tailored for repairing
tests, achieving 66% EM and 80% PR.
2. TaRGET significantly outperforms the baselines, highlighting
importance of input-output formatting and repair context selection.
3. We introduce TaRBench, the most comprehensive benchmark
available.
4. Using our proposed repair trustworthiness predictor, TaRGET can be
utilized effectively.
5. TaRGET has the capability to generalize across new projects.
25
Publication
• This work was recently accepted for publication by IEEE Transactions
on Software Engineering (TSE), 2025.
26
https://doi.org/10.1109/TSE.2025.3541166
Code
&
Data
27
https://github.com/Ahmadreza-SY/TaRGET
https://doi.org/10.6084/m9.figshare.25008893
28

Automated Test Case Repair Using Language Models

  • 1.
    Automated Test CaseRepair Using Language Models Ahmadreza Saboor Yaraghi • Darren Holden • Nafiseh Kahani • Lionel Briand IEEE Transactions on Software Engineering 2025 a.saboor@uottawa.ca • darren.holden@carleton.ca • kahani@sce.carleton.ca • lbriand@uottawa.ca www.nanda-lab.ca University of Ottawa School of Electrical Engineering & Computer Science Nanda Lab Research supported by Huawei Canada
  • 2.
    What is RegressionTesting? 2 Software testing ensures that software works as expected. It is critical for delivering reliable and functional software. Regression testing ensures existing functionalities still work when the software evolves (e.g., bug fixes or new features). Why regression testing? 1. Rapid software evolution 2. Interdependent features (small changes can break features)
  • 3.
    Motivating Challenges inRegression Testing 3 Regression Testing in Software Systems Test Execution Failure 1. Fault in the System Under Test (SUT) 2. Broken Test Code Challenge: Maintenance Overhead • Frequent test code changes in fast-evolving systems increase development costs. • Ignoring broken tests affects testing quality and software reliability. Localizing and fixing the fault in the SUT Goal: Automated Test Evolution and Repair
  • 4.
    Broken Test RepairMotivating Example 4 SUT Test Case Code Changes (Hunks) in the SUT, adding the currency feature Test Case Repair
  • 5.
    Limitations of ExistingWork 5 Most approaches target specific repair categories (e.g., assertions). Most existing benchmarks are limited in size and diversity (e.g., only 91 broken test instances across four projects). Most existing approaches lack reproducibility due to missing publicly available replication packages.
  • 6.
    Contributions (Automated TestCase Repair) 1. TaRGET (Test Repair GEneraTor) • Using fine-tuned code language models (CLMs) • Not limited to specific repair categories or programming languages 2. TaRBench (Test Repair Benchmark) • Large and diverse benchmark • Includes 45.3k instances across 59 projects 3. Addressed three research questions (RQs) • Evaluated CLM performance and data formatting (prompting) • Analyzed factors impacting performance 6
  • 7.
  • 8.
    The Input andOutput Format 1. Test Context • Broken test code and broken line(s) 2. Repair Context • SUT code changes (hunks) relevant for the repair 8 Input Output
  • 9.
    Repair Context Prioritization 1.Input size limitations in CLMs 2. Impact of repair context order 9 Prioritizing SUT changes (hunks) based on relevancy for repair Hunk Prioritization Heuristics: • Call graph depth • Change in Method/Class • TF-IDF similarity with broken test code • Hunk Repetition
  • 10.
    Code Hunk Representation 10 1)Word-Level Representation 2) Line-Level Representation Code Hunk
  • 11.
    Output Representations 11 Test CaseRepair 1) Code Sequence 2) Edit Sequence
  • 12.
    Input-Output Formats Overview •Repair Context • R: All SUT hunks • Rm⊂ R: Method-level hunks within test call graph • Rc⊂ R : Class-level hunks within test call graph 12 • Hunk Prioritization • CGD: Call Graph Depth • CT: Context Type (method/class) • BrkSim: TF-IDF similarity with broken code • TSim: TF-IDF similarity with test code • Rep: Number of hunk repetition
  • 13.
    TaRBench: Collecting ValidTest Repairs • We detect valid repairs through three executions: 1. Test V1 on SUT V1 should pass 2. Test V1 on SUT V2 should fail 3. Test V2 on SUT V2 should pass 13 SUT V1 SUT V2 Test Case V1 Test Case V2 Update Code Change (Potential Repair) 1 2 3
  • 14.
  • 15.
    Research Questions RQ1: RepairPerformance RQ1.1: CLMs and Input-Output Formats RQ1.2: TaRGET Against Baselines RQ2: Repair Analysis RQ2.1: Analyzing Repair Characteristics RQ2.2: Predicting Repair Trustworthiness RQ3: Fine-tuning analysis RQ3.1: Analyzing Fine-tuning Data Size RQ3.2: Assessing Model Generalization 15
  • 16.
    Evaluation Metrics Exact MatchAccuracy (EM) Measures repair candidates that exactly match the ground truth Plausible Repair Accuracy (PR) Measures repair candidates that successfully compile and pass BLEU and CodeBLEU Measure textual similarities between the repair candidate and the ground truth 16
  • 17.
    RQ1.1: CLM andData Formatting Performance • CodeT5+ (CT5+) with IO2 shows best overall performance • PLBART (PLB) outperforms CodeGen (CG) and is comparable to CodeT5+ • The choice between PLBART and CodeT5+ involves a cost-benefit trade-off • IO2 and IO3 consistently yield the best results 17
  • 18.
    RQ1.2: TaRGET AgainstBaselines • Baselines: • CEPROT [1] (SOTA): Automatically detects and updates obsolete tests using a fine-tuned CodeT5 model and introduces a new dataset. • NoContext: Fine-tuning CodeT5+ without repair context • SUTCopy: Replicating SUT changes in test code if applicable 18 [1] X. Hu, Z. Liu, X. Xia, Z. Liu, T. Xu, and X. Yang, “Identify and update test cases when production code changes: A transformer-based approach,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2023, pp. 1111–1122. Benchmark Approach EM 214 test set instances of CEPROT TaRGET 40.6% CEPROT 21% Benchmark Approach EM 7,103 test set instances of TaRBench TaRGET 66% NoContext 29% SUTCopy 11%
  • 19.
    RQ2.1: Repair Characteristicsin TaRBench 19 1. Repair categories • ORC: Test oracle change • ARG: Argument modification • INV: Invocation modification • OTH: Others 2. Abstract Syntax Tree (AST) edit actions • Number of changes in the code structure
  • 20.
    RQ2.1: Evaluation Basedon Repair Characteristics 20 The model shows reduced effectiveness in handling complex repairs.
  • 21.
    RQ2.2: Predicting RepairTrustworthiness • Created a Random Forest classifier utilizing test and repair context (input) features • Results show high accuracy in prediction: • Practical Implications 1. Enhances TaRGet’s practicality 2. Saves time by avoiding low-quality repairs 21
  • 22.
    RQ3.1: Impact ofFine-tuning Data Size • Downsized fine-tuning data to 20%, 40%, 60%, and 80% of the most recent data. 22 • The impact of data size is significant for EM. • Considering EM, using an additional 7,000 fine-tuning samples, results in an average increase of 1.85 percentage points.
  • 23.
    RQ3.2: Assessing Generalization •Stratified project exclusion from fine-tuning data: • Creating 10 folds, each excluding six projects • Keeping the evaluation set unchanged • Fine-tuned 10 project-agnostic models based on the above folds 23 • Project-agnostic models are acceptable alternatives, with an average difference of 4.9 EM points. • TaRGet can effectively generalize to unseen projects.
  • 24.
    TaRGET and Advancesin Foundation Models (FM) • TaRGET’s challenges: (1) Time-intensive data preparation for fine-tuning, (2) Expensive fine-tuning process, (3) Limited context window • FM’s challenges: (1) Lower task-specific performance, (2) Data privacy concerns • Advancements in FMs since TaRGET • Large context (up to 1M in GPT-4.1), RAG techniques, multi-agent solutions, and reasoning capability (O3, DeepSeek R1) • Future research can utilize FM to 1. Overcome TaRGET’s existing challenges (e.g., aiding data preparation) 2. Improve components of TaRGET’s approach (e.g., using RAG for repair context) 3. Explore trade-offs: When is fine-tuning worth it? 24
  • 25.
    Summary 1. TaRGET showsthat CLMs can be effectively tailored for repairing tests, achieving 66% EM and 80% PR. 2. TaRGET significantly outperforms the baselines, highlighting importance of input-output formatting and repair context selection. 3. We introduce TaRBench, the most comprehensive benchmark available. 4. Using our proposed repair trustworthiness predictor, TaRGET can be utilized effectively. 5. TaRGET has the capability to generalize across new projects. 25
  • 26.
    Publication • This workwas recently accepted for publication by IEEE Transactions on Software Engineering (TSE), 2025. 26 https://doi.org/10.1109/TSE.2025.3541166
  • 27.
  • 28.