Large Language Models for Test Case Evolution and Repair

Large Language Models
for Test Case Repair
Lionel Briand, FACM, FIEEE, FRSC
http://www.lbriand.info

State of the Art
• Significant body of work
in LLMs and testing
• Software Testing with
Large Language Models:
Survey, Landscape, and
Vision (Wang et al.
2023), ArXiv
4
2
f parameters. Such models hold tremendous
ckling complex practical tasks in domains
eration and artistic creation. With their
city and enhanced capabilities, LLMs have
changers in NLP and AI, and are driving
n other fields like coding, software testing.
LMs have been used for various coding
including code generation and code
n [10], [11]. However, there have been
the correctness and reliability of the code
LLMs, as some studies have shown that
ated by LLMs may not always be correct,
eet the expected software requirements. By
hen LLMs are used for software testing
generating test cases or validating the
oftware behavior, the impact of this problem
eaker. This is because the primary goal of
g is to identify issues or problems in the
m, rather than to generate correct code or
oftware requirements. At worst, the only
that the corresponding defects are not
rthermore, in some cases, the seemingly
uts from LLMs may actually be beneficial
er cases in software, and can help uncover

LLMs and Test Automation
• Natural match
• Translation: Requirements to test specifications
• Synthesis: Test code generation
• Clustering: Test minimization
• Classification: Flaky tests
5

Software Testing Tasks
• Identifying and designing test scenarios / specifications
• Test data generation
• Automated test infrastructure building
• Test minimization
• Test evolution and repair
6

Applying LLMs
• For each test problem and context, we need to determine:
• How to train them (architecture, data, objective)
• How to apply them:
• Zero-shot: apply to new tasks without any training examples for
those specific tasks
• Fine-tune: adjust the entire network to perform better in the target
task
• Few-shot “in context” learning: Provide a few training examples in
input
7

Types of Models
9
LLMs for NLP Tasks in RE: A Systematic Guide 11
LLMs like T5 are specialized for text-to-text translation. In the following, we
will explain these usage modes in detail.
Decoder-only
LLM
Task Model
Encoder-only
LLM
Using
Decoder-only
LLMs
Using
Encoder-only
LLMs
Embedding
Input Output
Prompt Output
Encoder-
Decoder LLM
Using
Encoder-
Decoder
LLMs
Input Output
Fig. 2. Simplified overview of using different LLM architectures to solve tasks
Vogelsang and Fishbach, “Using Large Language
Models for Natural Language Processing Tasks in
Requirements Engineering: A Systematic Guideline”,
ArXiv, 2024
FlakiFix
TaRGet
LTM
BERT
GPT-4
T5

Test Case Evolution
15
Software
System Update
Test Case
Failure
Why?
1- Fault in the
System Under
Test (SUT)
2- Broken Test
Case
Fix the fault
Test Case
Repair

Test Repair Example
16
System Under Test (SUT) Test Case
Code Changes (Hunks) in the SUT Test Case Repair

Why is it an Acute Problem?
17
1- Frequent Updates 2- Software Quality
High Software
Quality
Test Case
Repair
Maintaining
Software Quality
High Quality
Test Cases
Rapid Software
Updates
Frequent Test Case
Breakages
High Demand for
Test Repair

Test Case Evolution Cost
• Frequent system updates and Large test suites
• Maintaining the quality of test suites is expensive
• Automated test case evolution?
18

TaRGet (Test Repair
Generator)
Yaraghi et al., “Automated Test
Case Repair Using Language
Models”, ArXiv, 2024
Input (context):
1. Test Context
• Broken test code
2. Repair Context
• SUT code changes relevant for
the repair (hunks)
19

Inputs and Output
20
Input
Output
Repair
1. Test Context
• Broken test code and broken line(s)
2. Repair Context
• SUT code changes relevant for the repair

Repair Context Prioritization
21
Hunk Prioritization Heuristics:
• Call graph depth
• Change in Method/Class
• TF-IDF similarity with test code
• Repetition
1. Input size limitations in CLMs
2. The effect of repair context order
Prioritizing SUT changes based on
relevancy for repair

Hunk Changes Representations
22
1) Word Level Representation
2) Line Level Representation
Code Hunk

Output Representations
23
Test Case Repair
1) Code Sequence
2) Edit Sequence

Overview
24
• R: All SUT hunks
• Rm⊂ R / Rc⊂ R : Method/Class level hunks within test call
graph
• CGD: Call Graph Depth
• CT: Context Type (method/class)
• BrkSim/TSim: Breakage/Test TF-IDF similarity with test code

LLM Selection
25
Model Name Parameter Size Architecture
CodeT5+ 770 Million Encoder-decoder
CodeGen Multi 350 Million Decoder-only
PLBART 140 Million Encoder-decoder
• Open-source availability
• Pre-training on code
• Parameter size under 1 billion (increasing applicability)
• Highest performance on HumanEval
Selected Models

TaRBench: Benchmark Creation
Process

Research Questions
27
RQ1: Evaluating Performance of CLMs and Input-output Formats
RQ2: Repair Analysis
RQ2.1: Analyzing Repair Characteristics
RQ2.2: Predicting Repair Trustworthiness
RQ3: Fine-tuning analysis
RQ3.1: Analyzing Fine-tuning Data Size
RQ3.2: Assessing Model Generalization

Evaluation Metrics
28
Exact Match Rate (EM)
Measures repair candidates that exactly match
the ground truth
Plausible Repair Rate (PR)
Measures repair candidates that successfully
compile and pass
BLEU and CodeBLEU
Measure textual similarities between the repair
candidate and the ground truth

Repair Characteristics in
Benchmark
29
1. Repair categories
• ORC: Test oracle change
• ARG: Argument modification
• INV: Invocation modification
• OTH: Others
2. Abstract Syntax Tree (AST)
edit actions
• Number of changes in the
code structure

Evaluation vs Characteristics
30
• The model shows reduced effectiveness in handling complex repairs.
• The model, for complex repairs, is still achieving 60% PR.

Predicting Repair
Trustworthiness
• Required for making any such LLM solution widely applicable
• Created a Random Forest Classifier utilizing test and repair
context features:
• complexity of SUT changes
• complexity of test code
• similarity between test and repair contexts
31

Predicting Repair
Trustworthiness
• High reliability and accuracy in predicting plausible repairs
1. 90% precision (confidence in recommending repair trust)
2. 88% recall (12% miss rate of trustworthy repairs)
• Practical Implications
1. Enhances TaRGet’s practicality
2. Saves time by avoiding low-quality repairs
32

Data Leak?
1. CodeT5+ has two pre-training sets:
1. CodeSearchNet: Latest paper version is published June 2020
2. GitHubCode: Date of query on BigQuery was March 16th, 2022.
2. CodeGen 350M-multi
1. One of the datasets, ThePile is for 2020, but for the other one, BigQuery, no query date was
mentioned.
2. Uploaded to Hugging Face on April 11th, 2022. No significant changes after that
3. PLBART
1. One of the datasets, Stackexchange, is dated Sep 7th, 2020. For the other one, BigQuery, the
query date is not specified.
2. Uploaded to Hugging Face on Oct 6, 2021. No changes after that.
33

Data Leak Analysis
34
Model Count Test set Exact Match Plausible Rate CodeBLEU BLEU
CodeT5+
59% 2022-03-16 (before) 64.7 78.0 77.4 76.8
41% 2022-03-16 (after) 68.1 83.0 81.8 83.6
PLBART
54% 2021-10-06 (before) 56.6 79.0 76.4 76.8
46% 2021-10-06 (after) 62.0 79.6 80.4 82.2
CodeGen
62% 2022-04-11 (before) 57.8 76.7 75.0 75.0
38% 2022-04-11 (after) 58.6 77.4 76.5 78.1
1. On average, 40% (2.8k) of the test set is after the pretraining date
2. Across all models, data from after the pretraining date consistently
outperforms data from before pretraining.
3. Therefore, there is no indication of biased evaluation from data
leakage.

Impact of Fine-tuning Data Size
• Downsized fine-tuning data to 20%, 40%, 60%, and 80%
of the most recent data.
• The impact of data size is
significant for EM.
• Using an additional 7,000 fine-
tuning samples, results in an
average increase of 1.85
percentage points in EM.
35

Assessing Generalization
• Stratified project exclusion from fine-tuning data:
• Creating 10 training set folds, each excluding six or
five projects
• Fine-tuned 10 project-agnostic models based on the
above folds and evaluated them on excluded projects
• Project-agnostic models are acceptable alternatives, with an
average difference of 4.9 EM points.
• TaRGet can generalize to unseen projects.
36

TaRGet: Conclusions
• TaRGet shows that CLMs can be effectively tailored for test evolution,
achieving 66% Exact Match Accuracy, 80% Plausible Repair Accuracy, and
80% CodeBleu.
• TaRGet performs less effectively on “complex repair scenarios”.
• Utilizing a trust assessment model, we know when to trust a repair and
can thus maximize the benefits of using TaRGet.
• TaRGet has the capability to generalize across previously unseen projects.
37

Future Work
• Additional information in the repair context: Unchanged lines around SUT
changes, added and deleted files and methods, error message from
broken test
• General-purpose LLMs: Larger context windows, useful in the repair
context
• More diverse benchmark
• Better evaluation metrics: test behavior and semantics, e.g., coverage
similarity between candidate repair and ground truth
38

Test Flakiness Repair
• Flaky tests intermittently pass and fail even for the same version of source code (i.e., non-
deterministic testing results).
• Previous study showed that flakiness finds its root causes in the test in more than 70% of the
cases.
• Why detect and repair flaky tests?
• Test failures caused by flaky tests can be hard to reproduce as re-running is required
(computationally expensive)
• Flaky tests might hide real bugs in the source code
• Tests become unreliable
• Software releases might be delayed
• Hard to manually detect and fix so developers simply ignore these tests
40

FlakyFix
41
3
Fatima et al., “FlakyFix: Using Large Language
Models for Predicting Flaky Test Fix Categories
and Test Code Repair”, ArXiv, 2024
Focused on repairing test code

Step 1: Predicting Fix Category
42
Model
Change Data Structure
Cause of
Flakiness
HashMap Should be replaced with LinkedHashMap to maintain the order in which
their elements are stored, regardless of how many times a code is executed.
Predicts
UnixCoder and CodeBert
Fine-tuning

Step 1 Overview
43
§ Usually, small labelled datasets of flaky tests
§ Few-Shot learning (FSL) classifiers learn what makes
elements similar or belonging to the same class with
limited data.
§ Fine-tune code LLMs (UnixCoder and CodeBert) using
Siamese Network to predict flaky test fix category

International Dataset of Flaky
Tests (IDoFT)1
44
• Largest available dataset of flaky tests where the cause
of flakiness is in the test code
• 562 Flaky Tests in Java and their fixes
• Flaky Tests belong to 96 different projects, helpful for
generalizability
1. https://mir.cs.illinois.edu/flakytests

Step 1 Results
45
Flaky Test Fix Category Average Precision (%) Average Recall (%)
Change Assertion 96 86
Reset Variable 94 97
Change Condition 87 87
Reorder Data 90 72
Change Data Structure 81 87
Handle Exception 98 89
Change Data Format 90 94
Reorder Parameters 94 91
Misc 67 62
Accurate, except Misc, as expected

Step 2: Repair
46
Change Data Structure
Generates
Repaired Flaky
Test
175 B size
Window: >16k tokens

Step 2 Overview
47
Prompt 1 (fix
category label)
Prompt 2 (In-Context
Learning)
10
(41 examples), and
h of these three fix
flaky tests, along with
abels, as prompts for
the model to provide
ven fix category label,
se example tests and
ur prompt explicitly
e fixed test code, ex-
tions. This approach
the model generates
g post-processing. In
GPT model using a
ests. This expansion
odel’s capabilities in
pt
is
de]
ed code
to make
provide
iption.
� Input: Prompt
This test case is
Flaky: [Flaky Code]
This test can be fixed by
changing the following
information in the code:
[Fix Category Label]
Just Provide the full fixed code
of this test case only without
any other text description’
� Output: Generated Output:
[Fixed Flaky Test Code]
Fig. 6: Flaky test fix using GPT-3.5 Turbo with fix category
label.
� Input: Prompt
This test case is
Flaky: [Flaky Code]
This test can be fixed by
changing the following
information in the code:
[Fix Category Label]
Just Provide the full fixed code
of this test case only without
any other text description’
Here are some Flaky tests
examples, their fixes and fix
category labels: [Examples]
� Output: Generated Output:
[Fixed Flaky Test Code]
Fig. 7: Flaky test fix using GPT-3.5 Turbo with in-context
learning.
UniXcoder without FSL dem
and similar recall in the Ch
Structure, Handle Exception, C
Parameters categories, with p
87%, and 91%, respectively, ou
approaches. For Reorder Data, a
achieves the highest precision s
incorporating FSL, UniXcoder
call of 88%. Furthermore, Cod
slightly better results for the
83% precision and 67% recall
UniXcoder is not significant. F
UniXcoder with FSL yielded h
cision and 97% recall rates, w
achieved a slightly higher rec
precision of 88%. Finally, for Ch
with FSL outperformed all ot
precision and recall. Howeve
UniXcoder is not significant.
The Change Data Format, Ha
Reorder Data and Change A
easiest for all four approaches
precision score (across all app
88%, and 90%, respectively. T
frequency of common keywor
handling, assert statements, res
ing data (with keywords like ”
case of the Change Data Form
for a long complicated string
function. These strings are eas
because we observed that all
category contain a long compl
updated to remove flakiness. A
strong performance across m
high recall and precision. Ho

Step 2: Fix Category Labels
48
Higher CodeBLEU score with fix labels

Step 2: In-Context Learning
49
In-context learning helps for 2/3 categories

FlakyFix: Conclusions
50
• Automated repair of flakiness seems promising
• Two-step process:
• (1) predict the fix category first,
• (2) generate an actual fix using a pre-trained LLM with this
category, the test code, and few-shot learning
• High similarity degree of output (generated fix) with ground truth
(actual fix)

Future Work
51
• Better evaluation metrics
• Analyze the execution of fixed test cases
• Larger, more diverse benchmark dataset
• User study and feedback
• Fix causes of flakiness in production code

LLM Fine Tuning
• Fine tuning: Additional training datasets (e.g., requirements, tests,
repairs) to turn a general-purpose model into a specialized one
• Availability: (1) Data may not be available (in sufficient amount), (2)
Data may change frequently
• Reusability: Fine-tuned model is not reusable for other tasks/domains
• Overfitting: Model may fit too closely to training dataset and not
generalize
• Fine tuning was possible for TaRGet but not FlakyFix
• We checked for overfitting with TaRGet
53

Prompt Engineering
• Instruction: a specific task or instruction you want the model to
perform, e.g., guidance to generate test inputs
• Context: external information or additional context that can steer
the model to better responses, e.g., (part of the) code under test
• Input Data: the input or question that we are interested in finding
a response for, e.g., requirements
• Output Indicator: the type or format of the output, e.g., test
format (e.g., XML)
• Training examples , e.g., actual test inputs and outputs
54

Prompt Engineering
• Optimizing prompts, in terms of effectiveness and cost, for the
many relevant testing problems and contexts, such as prioritizing
and selecting context information (e.g., context prioritization in
TaRGet).
• FlakyFix: Optimized prompts with additional information (fix
category) and examples
• TaRGet: Prioritized SUT hunks according to relevance for test case
(changed code)
• Dynamic prompt optimization driven by multi-objective search, i.e.,
a new SBSE application.
• Static and dynamic analyses for prompt engineering
55

Empirical Evaluation
• Usually, many alternatives for automating a task
• Large evaluation space: Models * Application strategies * Hyperparameters
• Evaluation metrics are not enough
• The real question is what is the gain for the end user compared to current
practices and tools, in real development processes
• Only user studies can answer that --- they are very difficult to run and they
are rare
56

Benchmark Datasets
• Results are strongly influenced by the benchmarks used
• Often hard and expensive to build: Size and diversity
• Generalizability?
• TaRGET:
• Huge effort in building a large and diverse benchmark
• Attempt to address generalizability
58

Solution Applicability
• Data scarcity: Availability of data (labeled or not) for fine-tuning,
e.g., TaRBench versus FlakyFix.
• Size of the model (parameters) determines storage, memory, and
computational requirements, e.g., in TaRGet, we selected smaller
models.
• Automated support for prompt engineering, e.g., generating,
selecting, and prioritizing context information, e.g., TaRGet
prioritization of hunk changes.
• Dealing effectively with LLM uncertainty, e.g., TaRGet
trustworthiness model
61

SyMeCo Fellowship
• SyMeCo is a Marie Skłodowska-Curie postdoctoral fellowship
programme coordinated by Lero
• Co-funded by Science Foundation Ireland and the EU
• 16 fellowships of 2-year duration based in Ireland across 8
Higher Education Institutions
• Open to researchers of any nationality

LLMs are Moving Fast
66
1:10 X Hou, Y Zhao, Y Liu, Z Yang, K Wang, L Li, X Luo, D Lo, J Grundy, and H Wang
2018 2019 2020 2021 2022 2023
Encoder-only
Encoder
Features
Input Text
BERT (50)
RoBERTA (24)
ALBERT (6)
CodeBERT (51)
GraphCodeBERT (25)
Trace BERT (3)
BERTOverflow (3)
CostSens BERT (1)
CodeRetriever (1)
seBERT (2)
PRCBERT (1)
CuBERT (3)
Sentence-BERT (2)
2024
Encoder-decoder
Encoder
Features
Input Text
Decoder
Output Text
BART PLBART (15)
T5 (20) CodeT5 (46)
CodeT5+ (7)
CoTexT (4) AlphaCode (6)
NatGen (2)
UniXcoder (16)
Codetrans (2)
CoditT5 (1)
CodeReviewer (2)
SPT-Code (2)
Decoder-only
Decoder
Output Text
Input Text
GPT-1 (4)
GPT-2 (17)
GPT-3 (12) GPT-4 (53)
GPT-Neo (13) ChatGPT (72)
GPT-J (13)
CodeGPT (26) CodeGen (44)
GPT-3.5 (54)
Copilot (7)
XLNet (4)
CodeGeeX2 (3)
CodeGen2 (7)
CodeGeeX (8)
InstructGPT (5) CodeParrot (6)
PolyCoder (8)
PyCodeGPT (5) StarCoder (25) Claude2 (2)
OPT (5)
PanGu-Coder (1)
LaMDA (2)
PaLM2 (1)
PaLM (4) PaLM-Coder (3) Claude (3)
Bard (2)
StableLM (1)
WizardCoder (12)
LLaMA (14)
DeepSeek (3)
Vicuna (11)
BLOOM (5)
InCoder (29)
GPT-NeoX (5)
CodeLlama (19)
CodeLlama2 (1)
Llama2 (10) Llama2-Chat (2)
DeepSeek Coder (1)
SantaCoder (5) CodeFuse (1)
Codex (62)
Fig. 4. Distribution of the LLMs (as well as LLM-based applications) discussed in the collected papers. The
numbers in parentheses indicate the count of papers in which each LLM has been utilized.
Hou et al., “Large Language Models for Software Engineering:
A Systematic Literature Review ”, ArXiv, 2024

LLM Applications
67
Hou et al., “Large Language Models for Software Engineering:
A Systematic Literature Review ”, ArXiv, 2024
Large Language Models for So�ware Engineering: A Systematic Literature Review 25
Software
development
58.37%
Software
maintenance
24.89%
Software quality
assurance
10.30%
Requirements
engineering
4.72%
Software design
1.29%
Software
management
0.43%
(a) Distribution of LLM usages in SE activities.
Generation
64.34%
Classification
24.48%
Recommendation
9.79%
Regression
1.40%
(b) Problem classification based on collected studies.
Fig. 8. Distribution of LLM utilization across di�erent SE activities and problem types.
software updates and improvements. The software quality assurance domain holds approximately

Large Language Models for Test Case Evolution and Repair

More Related Content

What's hot

Similar to Large Language Models for Test Case Evolution and Repair

More from Lionel Briand

Recently uploaded

Large Language Models for Test Case Evolution and Repair