Automated Test Case Repair Using Language Models

Automated Test Case Repair Using
Language Models
Ahmadreza Saboor Yaraghi • Darren Holden • Nafiseh Kahani • Lionel Briand
IEEE Transactions on Software Engineering 2025
a.saboor@uottawa.ca • darren.holden@carleton.ca • kahani@sce.carleton.ca • lbriand@uottawa.ca
www.nanda-lab.ca
University of Ottawa
School of Electrical Engineering & Computer Science
Nanda Lab
Research supported by Huawei Canada

What is Regression Testing?
2
Software testing ensures that software works as expected.
It is critical for delivering reliable and functional software.
Regression testing ensures existing functionalities still work when the software
evolves (e.g., bug fixes or new features).
Why regression
testing?
1. Rapid software evolution
2. Interdependent features (small changes can break features)

Motivating Challenges in Regression Testing
3
Regression Testing in Software Systems Test Execution Failure
1. Fault in the
System Under Test
(SUT)
2. Broken Test Code
Challenge: Maintenance Overhead
• Frequent test code changes in fast-evolving systems increase
development costs.
• Ignoring broken tests affects testing quality and software
reliability.
Localizing and fixing
the fault in the SUT Goal: Automated Test Evolution and Repair

Broken Test Repair Motivating Example
4
SUT Test Case
Code Changes (Hunks) in the SUT, adding
the currency feature
Test Case Repair

Limitations of Existing Work
5
Most approaches target specific repair categories (e.g., assertions).
Most existing benchmarks are limited in size and diversity (e.g., only
91 broken test instances across four projects).
Most existing approaches lack reproducibility due to missing
publicly available replication packages.

Contributions (Automated Test Case Repair)
1. TaRGET (Test Repair GEneraTor)
• Using fine-tuned code language models (CLMs)
• Not limited to specific repair categories or programming languages
2. TaRBench (Test Repair Benchmark)
• Large and diverse benchmark
• Includes 45.3k instances across 59 projects
3. Addressed three research questions (RQs)
• Evaluated CLM performance and data formatting (prompting)
• Analyzed factors impacting performance
6

The Input and Output Format
1. Test Context
• Broken test code and broken line(s)
2. Repair Context
• SUT code changes (hunks) relevant for
the repair
8
Input
Output

Repair Context Prioritization
1. Input size limitations in CLMs
2. Impact of repair context order
9
Prioritizing SUT changes (hunks)
based on relevancy for repair
Hunk Prioritization Heuristics:
• Call graph depth
• Change in Method/Class
• TF-IDF similarity with broken test code
• Hunk Repetition

Code Hunk Representation
10
1) Word-Level Representation
2) Line-Level Representation
Code Hunk

Output Representations
11
Test Case Repair
1) Code Sequence
2) Edit Sequence

Input-Output Formats Overview
• Repair Context
• R: All SUT hunks
• Rm⊂ R: Method-level hunks within
test call graph
• Rc⊂ R : Class-level hunks within
test call graph
12
• Hunk Prioritization
• CGD: Call Graph Depth
• CT: Context Type (method/class)
• BrkSim: TF-IDF similarity with broken code
• TSim: TF-IDF similarity with test code
• Rep: Number of hunk repetition

TaRBench: Collecting Valid Test Repairs
• We detect valid repairs through three executions:
1. Test V1 on SUT V1 should pass
2. Test V1 on SUT V2 should fail
3. Test V2 on SUT V2 should pass
13
SUT V1 SUT V2
Test Case V1 Test Case V2
Update
Code Change
(Potential Repair)
1
2
3

TaRBench: Benchmark Creation Process
14

Research Questions
RQ1: Repair Performance
RQ1.1: CLMs and Input-Output Formats
RQ1.2: TaRGET Against Baselines
RQ2: Repair Analysis
RQ2.1: Analyzing Repair Characteristics
RQ2.2: Predicting Repair Trustworthiness
RQ3: Fine-tuning analysis
RQ3.1: Analyzing Fine-tuning Data Size
RQ3.2: Assessing Model Generalization
15

Evaluation Metrics
Exact Match Accuracy (EM)
Measures repair candidates that exactly
match the ground truth
Plausible Repair Accuracy (PR)
Measures repair candidates that
successfully compile and pass
BLEU and CodeBLEU
Measure textual similarities between the
repair candidate and the ground truth
16

RQ1.1: CLM and Data Formatting Performance
• CodeT5+ (CT5+) with IO2 shows best overall performance
• PLBART (PLB) outperforms CodeGen (CG) and is comparable to CodeT5+
• The choice between PLBART and CodeT5+ involves a cost-benefit trade-off
• IO2 and IO3 consistently yield the best results
17

RQ1.2: TaRGET Against Baselines
• Baselines:
• CEPROT [1] (SOTA): Automatically detects and updates obsolete tests using a
fine-tuned CodeT5 model and introduces a new dataset.
• NoContext: Fine-tuning CodeT5+ without repair context
• SUTCopy: Replicating SUT changes in test code if applicable
18
[1] X. Hu, Z. Liu, X. Xia, Z. Liu, T. Xu, and X. Yang, “Identify and update test cases when production code changes: A transformer-based approach,” in 2023 38th IEEE/ACM
International Conference on Automated Software Engineering (ASE), 2023, pp. 1111–1122.
Benchmark Approach EM
214 test set
instances of
CEPROT
TaRGET 40.6%
CEPROT 21%
Benchmark Approach EM
7,103 test set
instances of
TaRBench
TaRGET 66%
NoContext 29%
SUTCopy 11%

RQ2.1: Repair Characteristics in TaRBench
19
1. Repair categories
• ORC: Test oracle change
• ARG: Argument modification
• INV: Invocation modification
• OTH: Others
2. Abstract Syntax Tree (AST)
edit actions
• Number of changes in the
code structure

RQ2.1: Evaluation Based on Repair Characteristics
20
The model shows reduced effectiveness
in handling complex repairs.

RQ2.2: Predicting Repair Trustworthiness
• Created a Random Forest classifier utilizing test and repair context (input)
features
• Results show high accuracy in prediction:
• Practical Implications
1. Enhances TaRGet’s practicality
2. Saves time by avoiding low-quality repairs
21

RQ3.1: Impact of Fine-tuning Data Size
• Downsized fine-tuning data to 20%, 40%, 60%, and 80% of the most recent data.
22
• The impact of data size is significant for
EM.
• Considering EM, using an additional 7,000
fine-tuning samples, results in an average
increase of 1.85 percentage points.

RQ3.2: Assessing Generalization
• Stratified project exclusion from fine-tuning data:
• Creating 10 folds, each excluding six projects
• Keeping the evaluation set unchanged
• Fine-tuned 10 project-agnostic models based on the above folds
23
• Project-agnostic models are acceptable alternatives, with an average
difference of 4.9 EM points.
• TaRGet can effectively generalize to unseen projects.

TaRGET and Advances in Foundation Models (FM)
• TaRGET’s challenges: (1) Time-intensive data preparation for fine-tuning,
(2) Expensive fine-tuning process, (3) Limited context window
• FM’s challenges: (1) Lower task-specific performance, (2) Data privacy
concerns
• Advancements in FMs since TaRGET
• Large context (up to 1M in GPT-4.1), RAG techniques, multi-agent solutions, and
reasoning capability (O3, DeepSeek R1)
• Future research can utilize FM to
1. Overcome TaRGET’s existing challenges (e.g., aiding data preparation)
2. Improve components of TaRGET’s approach (e.g., using RAG for repair context)
3. Explore trade-offs: When is fine-tuning worth it?
24

Summary
1. TaRGET shows that CLMs can be effectively tailored for repairing
tests, achieving 66% EM and 80% PR.
2. TaRGET significantly outperforms the baselines, highlighting
importance of input-output formatting and repair context selection.
3. We introduce TaRBench, the most comprehensive benchmark
available.
4. Using our proposed repair trustworthiness predictor, TaRGET can be
utilized effectively.
5. TaRGET has the capability to generalize across new projects.
25

Publication
• This work was recently accepted for publication by IEEE Transactions
on Software Engineering (TSE), 2025.
26
https://doi.org/10.1109/TSE.2025.3541166

Code
&
Data
27
https://github.com/Ahmadreza-SY/TaRGET
https://doi.org/10.6084/m9.figshare.25008893

Automated Test Case Repair Using Language Models

More Related Content

Similar to Automated Test Case Repair Using Language Models

More from Lionel Briand

Recently uploaded

Automated Test Case Repair Using Language Models