Large Language Models
for Test Case Repair
Lionel Briand, FACM, FIEEE, FRSC
http://www.lbriand.info
Introduction
3
State of the Art
• Significant body of work
in LLMs and testing
• Software Testing with
Large Language Models:
Survey, Landscape, and
Vision (Wang et al.
2023), ArXiv
4
2
f parameters. Such models hold tremendous
ckling complex practical tasks in domains
eration and artistic creation. With their
city and enhanced capabilities, LLMs have
changers in NLP and AI, and are driving
n other fields like coding, software testing.
LMs have been used for various coding
including code generation and code
n [10], [11]. However, there have been
the correctness and reliability of the code
LLMs, as some studies have shown that
ated by LLMs may not always be correct,
eet the expected software requirements. By
hen LLMs are used for software testing
generating test cases or validating the
oftware behavior, the impact of this problem
eaker. This is because the primary goal of
g is to identify issues or problems in the
m, rather than to generate correct code or
oftware requirements. At worst, the only
that the corresponding defects are not
rthermore, in some cases, the seemingly
uts from LLMs may actually be beneficial
er cases in software, and can help uncover
LLMs and Test Automation
• Natural match
• Translation: Requirements to test specifications
• Synthesis: Test code generation
• Clustering: Test minimization
• Classification: Flaky tests
5
Software Testing Tasks
• Identifying and designing test scenarios / specifications
• Test data generation
• Automated test infrastructure building
• Test minimization
• Test evolution and repair
6
Applying LLMs
• For each test problem and context, we need to determine:
• How to train them (architecture, data, objective)
• How to apply them:
• Zero-shot: apply to new tasks without any training examples for
those specific tasks
• Fine-tune: adjust the entire network to perform better in the target
task
• Few-shot “in context” learning: Provide a few training examples in
input
7
Types of Models
9
LLMs for NLP Tasks in RE: A Systematic Guide 11
LLMs like T5 are specialized for text-to-text translation. In the following, we
will explain these usage modes in detail.
Decoder-only
LLM
Task Model
Encoder-only
LLM
Using
Decoder-only
LLMs
Using
Encoder-only
LLMs
Embedding
Input Output
Prompt Output
Encoder-
Decoder LLM
Using
Encoder-
Decoder
LLMs
Input Output
Fig. 2. Simplified overview of using different LLM architectures to solve tasks
Vogelsang and Fishbach, “Using Large Language
Models for Natural Language Processing Tasks in
Requirements Engineering: A Systematic Guideline”,
ArXiv, 2024
FlakiFix
TaRGet
LTM
BERT
GPT-4
T5
Test Case Evolution
14
Test Case Evolution
15
Software
System Update
Test Case
Failure
Why?
1- Fault in the
System Under
Test (SUT)
2- Broken Test
Case
Fix the fault
Test Case
Repair
Test Repair Example
16
System Under Test (SUT) Test Case
Code Changes (Hunks) in the SUT Test Case Repair
Why is it an Acute Problem?
17
1- Frequent Updates 2- Software Quality
High Software
Quality
Test Case
Repair
Maintaining
Software Quality
High Quality
Test Cases
Rapid Software
Updates
Frequent Test Case
Breakages
High Demand for
Test Repair
Test Case Evolution Cost
• Frequent system updates and Large test suites
• Maintaining the quality of test suites is expensive
• Automated test case evolution?
18
TaRGet (Test Repair
Generator)
Yaraghi et al., “Automated Test
Case Repair Using Language
Models”, ArXiv, 2024
Input (context):
1. Test Context
• Broken test code
2. Repair Context
• SUT code changes relevant for
the repair (hunks)
19
Inputs and Output
20
Input
Output
Repair
1. Test Context
• Broken test code and broken line(s)
2. Repair Context
• SUT code changes relevant for the repair
Repair Context Prioritization
21
Hunk Prioritization Heuristics:
• Call graph depth
• Change in Method/Class
• TF-IDF similarity with test code
• Repetition
1. Input size limitations in CLMs
2. The effect of repair context order
Prioritizing SUT changes based on
relevancy for repair
Hunk Changes Representations
22
1) Word Level Representation
2) Line Level Representation
Code Hunk
Output Representations
23
Test Case Repair
1) Code Sequence
2) Edit Sequence
Overview
24
• R: All SUT hunks
• Rm⊂ R / Rc⊂ R : Method/Class level hunks within test call
graph
• CGD: Call Graph Depth
• CT: Context Type (method/class)
• BrkSim/TSim: Breakage/Test TF-IDF similarity with test code
LLM Selection
25
Model Name Parameter Size Architecture
CodeT5+ 770 Million Encoder-decoder
CodeGen Multi 350 Million Decoder-only
PLBART 140 Million Encoder-decoder
• Open-source availability
• Pre-training on code
• Parameter size under 1 billion (increasing applicability)
• Highest performance on HumanEval
Selected Models
TaRBench: Benchmark Creation
Process
Research Questions
27
RQ1: Evaluating Performance of CLMs and Input-output Formats
RQ2: Repair Analysis
RQ2.1: Analyzing Repair Characteristics
RQ2.2: Predicting Repair Trustworthiness
RQ3: Fine-tuning analysis
RQ3.1: Analyzing Fine-tuning Data Size
RQ3.2: Assessing Model Generalization
Evaluation Metrics
28
Exact Match Rate (EM)
Measures repair candidates that exactly match
the ground truth
Plausible Repair Rate (PR)
Measures repair candidates that successfully
compile and pass
BLEU and CodeBLEU
Measure textual similarities between the repair
candidate and the ground truth
Repair Characteristics in
Benchmark
29
1. Repair categories
• ORC: Test oracle change
• ARG: Argument modification
• INV: Invocation modification
• OTH: Others
2. Abstract Syntax Tree (AST)
edit actions
• Number of changes in the
code structure
Evaluation vs Characteristics
30
• The model shows reduced effectiveness in handling complex repairs.
• The model, for complex repairs, is still achieving 60% PR.
Predicting Repair
Trustworthiness
• Required for making any such LLM solution widely applicable
• Created a Random Forest Classifier utilizing test and repair
context features:
• complexity of SUT changes
• complexity of test code
• similarity between test and repair contexts
31
Predicting Repair
Trustworthiness
• High reliability and accuracy in predicting plausible repairs
1. 90% precision (confidence in recommending repair trust)
2. 88% recall (12% miss rate of trustworthy repairs)
• Practical Implications
1. Enhances TaRGet’s practicality
2. Saves time by avoiding low-quality repairs
32
Data Leak?
1. CodeT5+ has two pre-training sets:
1. CodeSearchNet: Latest paper version is published June 2020
2. GitHubCode: Date of query on BigQuery was March 16th, 2022.
2. CodeGen 350M-multi
1. One of the datasets, ThePile is for 2020, but for the other one, BigQuery, no query date was
mentioned.
2. Uploaded to Hugging Face on April 11th, 2022. No significant changes after that
3. PLBART
1. One of the datasets, Stackexchange, is dated Sep 7th, 2020. For the other one, BigQuery, the
query date is not specified.
2. Uploaded to Hugging Face on Oct 6, 2021. No changes after that.
33
Data Leak Analysis
34
Model Count Test set Exact Match Plausible Rate CodeBLEU BLEU
CodeT5+
59% 2022-03-16 (before) 64.7 78.0 77.4 76.8
41% 2022-03-16 (after) 68.1 83.0 81.8 83.6
PLBART
54% 2021-10-06 (before) 56.6 79.0 76.4 76.8
46% 2021-10-06 (after) 62.0 79.6 80.4 82.2
CodeGen
62% 2022-04-11 (before) 57.8 76.7 75.0 75.0
38% 2022-04-11 (after) 58.6 77.4 76.5 78.1
1. On average, 40% (2.8k) of the test set is after the pretraining date
2. Across all models, data from after the pretraining date consistently
outperforms data from before pretraining.
3. Therefore, there is no indication of biased evaluation from data
leakage.
Impact of Fine-tuning Data Size
• Downsized fine-tuning data to 20%, 40%, 60%, and 80%
of the most recent data.
• The impact of data size is
significant for EM.
• Using an additional 7,000 fine-
tuning samples, results in an
average increase of 1.85
percentage points in EM.
35
Assessing Generalization
• Stratified project exclusion from fine-tuning data:
• Creating 10 training set folds, each excluding six or
five projects
• Fine-tuned 10 project-agnostic models based on the
above folds and evaluated them on excluded projects
• Project-agnostic models are acceptable alternatives, with an
average difference of 4.9 EM points.
• TaRGet can generalize to unseen projects.
36
TaRGet: Conclusions
• TaRGet shows that CLMs can be effectively tailored for test evolution,
achieving 66% Exact Match Accuracy, 80% Plausible Repair Accuracy, and
80% CodeBleu.
• TaRGet performs less effectively on “complex repair scenarios”.
• Utilizing a trust assessment model, we know when to trust a repair and
can thus maximize the benefits of using TaRGet.
• TaRGet has the capability to generalize across previously unseen projects.
37
Future Work
• Additional information in the repair context: Unchanged lines around SUT
changes, added and deleted files and methods, error message from
broken test
• General-purpose LLMs: Larger context windows, useful in the repair
context
• More diverse benchmark
• Better evaluation metrics: test behavior and semantics, e.g., coverage
similarity between candidate repair and ground truth
38
Test Flakiness Repair
39
Test Flakiness Repair
• Flaky tests intermittently pass and fail even for the same version of source code (i.e., non-
deterministic testing results).
• Previous study showed that flakiness finds its root causes in the test in more than 70% of the
cases.
• Why detect and repair flaky tests?
• Test failures caused by flaky tests can be hard to reproduce as re-running is required
(computationally expensive)
• Flaky tests might hide real bugs in the source code
• Tests become unreliable
• Software releases might be delayed
• Hard to manually detect and fix so developers simply ignore these tests
40
FlakyFix
41
3
Fatima et al., “FlakyFix: Using Large Language
Models for Predicting Flaky Test Fix Categories
and Test Code Repair”, ArXiv, 2024
Focused on repairing test code
Step 1: Predicting Fix Category
42
Model
Change Data Structure
Cause of
Flakiness
HashMap Should be replaced with LinkedHashMap to maintain the order in which
their elements are stored, regardless of how many times a code is executed.
Predicts
UnixCoder and CodeBert
Fine-tuning
Step 1 Overview
43
§ Usually, small labelled datasets of flaky tests
§ Few-Shot learning (FSL) classifiers learn what makes
elements similar or belonging to the same class with
limited data.
§ Fine-tune code LLMs (UnixCoder and CodeBert) using
Siamese Network to predict flaky test fix category
International Dataset of Flaky
Tests (IDoFT)1
44
• Largest available dataset of flaky tests where the cause
of flakiness is in the test code
• 562 Flaky Tests in Java and their fixes
• Flaky Tests belong to 96 different projects, helpful for
generalizability
1. https://mir.cs.illinois.edu/flakytests
Step 1 Results
45
Flaky Test Fix Category Average Precision (%) Average Recall (%)
Change Assertion 96 86
Reset Variable 94 97
Change Condition 87 87
Reorder Data 90 72
Change Data Structure 81 87
Handle Exception 98 89
Change Data Format 90 94
Reorder Parameters 94 91
Misc 67 62
Accurate, except Misc, as expected
Step 2: Repair
46
Change Data Structure
Generates
Repaired Flaky
Test
175 B size
Window: >16k tokens
Step 2 Overview
47
Prompt 1 (fix
category label)
Prompt 2 (In-Context
Learning)
10
(41 examples), and
h of these three fix
flaky tests, along with
abels, as prompts for
the model to provide
ven fix category label,
se example tests and
ur prompt explicitly
e fixed test code, ex-
tions. This approach
the model generates
g post-processing. In
GPT model using a
ests. This expansion
odel’s capabilities in
pt
is
de]
ed code
to make
provide
iption.
� Input: Prompt
This test case is
Flaky: [Flaky Code]
This test can be fixed by
changing the following
information in the code:
[Fix Category Label]
Just Provide the full fixed code
of this test case only without
any other text description’
� Output: Generated Output:
[Fixed Flaky Test Code]
Fig. 6: Flaky test fix using GPT-3.5 Turbo with fix category
label.
� Input: Prompt
This test case is
Flaky: [Flaky Code]
This test can be fixed by
changing the following
information in the code:
[Fix Category Label]
Just Provide the full fixed code
of this test case only without
any other text description’
Here are some Flaky tests
examples, their fixes and fix
category labels: [Examples]
� Output: Generated Output:
[Fixed Flaky Test Code]
Fig. 7: Flaky test fix using GPT-3.5 Turbo with in-context
learning.
UniXcoder without FSL dem
and similar recall in the Ch
Structure, Handle Exception, C
Parameters categories, with p
87%, and 91%, respectively, ou
approaches. For Reorder Data, a
achieves the highest precision s
incorporating FSL, UniXcoder
call of 88%. Furthermore, Cod
slightly better results for the
83% precision and 67% recall
UniXcoder is not significant. F
UniXcoder with FSL yielded h
cision and 97% recall rates, w
achieved a slightly higher rec
precision of 88%. Finally, for Ch
with FSL outperformed all ot
precision and recall. Howeve
UniXcoder is not significant.
The Change Data Format, Ha
Reorder Data and Change A
easiest for all four approaches
precision score (across all app
88%, and 90%, respectively. T
frequency of common keywor
handling, assert statements, res
ing data (with keywords like ”
case of the Change Data Form
for a long complicated string
function. These strings are eas
because we observed that all
category contain a long compl
updated to remove flakiness. A
strong performance across m
high recall and precision. Ho
Step 2: Fix Category Labels
48
Higher CodeBLEU score with fix labels
Step 2: In-Context Learning
49
In-context learning helps for 2/3 categories
FlakyFix: Conclusions
50
• Automated repair of flakiness seems promising
• Two-step process:
• (1) predict the fix category first,
• (2) generate an actual fix using a pre-trained LLM with this
category, the test code, and few-shot learning
• High similarity degree of output (generated fix) with ground truth
(actual fix)
Future Work
51
• Better evaluation metrics
• Analyze the execution of fixed test cases
• Larger, more diverse benchmark dataset
• User study and feedback
• Fix causes of flakiness in production code
Conclusions
52
LLM Fine Tuning
• Fine tuning: Additional training datasets (e.g., requirements, tests,
repairs) to turn a general-purpose model into a specialized one
• Availability: (1) Data may not be available (in sufficient amount), (2)
Data may change frequently
• Reusability: Fine-tuned model is not reusable for other tasks/domains
• Overfitting: Model may fit too closely to training dataset and not
generalize
• Fine tuning was possible for TaRGet but not FlakyFix
• We checked for overfitting with TaRGet
53
Prompt Engineering
• Instruction: a specific task or instruction you want the model to
perform, e.g., guidance to generate test inputs
• Context: external information or additional context that can steer
the model to better responses, e.g., (part of the) code under test
• Input Data: the input or question that we are interested in finding
a response for, e.g., requirements
• Output Indicator: the type or format of the output, e.g., test
format (e.g., XML)
• Training examples , e.g., actual test inputs and outputs
54
Prompt Engineering
• Optimizing prompts, in terms of effectiveness and cost, for the
many relevant testing problems and contexts, such as prioritizing
and selecting context information (e.g., context prioritization in
TaRGet).
• FlakyFix: Optimized prompts with additional information (fix
category) and examples
• TaRGet: Prioritized SUT hunks according to relevance for test case
(changed code)
• Dynamic prompt optimization driven by multi-objective search, i.e.,
a new SBSE application.
• Static and dynamic analyses for prompt engineering
55
Empirical Evaluation
• Usually, many alternatives for automating a task
• Large evaluation space: Models * Application strategies * Hyperparameters
• Evaluation metrics are not enough
• The real question is what is the gain for the end user compared to current
practices and tools, in real development processes
• Only user studies can answer that --- they are very difficult to run and they
are rare
56
Benchmark Datasets
• Results are strongly influenced by the benchmarks used
• Often hard and expensive to build: Size and diversity
• Generalizability?
• TaRGET:
• Huge effort in building a large and diverse benchmark
• Attempt to address generalizability
58
Solution Applicability
• Data scarcity: Availability of data (labeled or not) for fine-tuning,
e.g., TaRBench versus FlakyFix.
• Size of the model (parameters) determines storage, memory, and
computational requirements, e.g., in TaRGet, we selected smaller
models.
• Automated support for prompt engineering, e.g., generating,
selecting, and prioritizing context information, e.g., TaRGet
prioritization of hunk changes.
• Dealing effectively with LLM uncertainty, e.g., TaRGet
trustworthiness model
61
Large Language Models
for Test Case Repair
Lionel Briand, FACM, FIEEE, FRSC
http://www.lbriand.info
SyMeCo Fellowship
• SyMeCo is a Marie Skłodowska-Curie postdoctoral fellowship
programme coordinated by Lero
• Co-funded by Science Foundation Ireland and the EU
• 16 fellowships of 2-year duration based in Ireland across 8
Higher Education Institutions
• Open to researchers of any nationality
Backup
65
LLMs are Moving Fast
66
1:10 X Hou, Y Zhao, Y Liu, Z Yang, K Wang, L Li, X Luo, D Lo, J Grundy, and H Wang
2018 2019 2020 2021 2022 2023
Encoder-only
Encoder
Features
Input Text
BERT (50)
RoBERTA (24)
ALBERT (6)
CodeBERT (51)
GraphCodeBERT (25)
Trace BERT (3)
BERTOverflow (3)
CostSens BERT (1)
CodeRetriever (1)
seBERT (2)
PRCBERT (1)
CuBERT (3)
Sentence-BERT (2)
2024
Encoder-decoder
Encoder
Features
Input Text
Decoder
Output Text
BART PLBART (15)
T5 (20) CodeT5 (46)
CodeT5+ (7)
CoTexT (4) AlphaCode (6)
NatGen (2)
UniXcoder (16)
Codetrans (2)
CoditT5 (1)
CodeReviewer (2)
SPT-Code (2)
Decoder-only
Decoder
Output Text
Input Text
GPT-1 (4)
GPT-2 (17)
GPT-3 (12) GPT-4 (53)
GPT-Neo (13) ChatGPT (72)
GPT-J (13)
CodeGPT (26) CodeGen (44)
GPT-3.5 (54)
Copilot (7)
XLNet (4)
CodeGeeX2 (3)
CodeGen2 (7)
CodeGeeX (8)
InstructGPT (5) CodeParrot (6)
PolyCoder (8)
PyCodeGPT (5) StarCoder (25) Claude2 (2)
OPT (5)
PanGu-Coder (1)
LaMDA (2)
PaLM2 (1)
PaLM (4) PaLM-Coder (3) Claude (3)
Bard (2)
StableLM (1)
WizardCoder (12)
LLaMA (14)
DeepSeek (3)
Vicuna (11)
BLOOM (5)
InCoder (29)
GPT-NeoX (5)
CodeLlama (19)
CodeLlama2 (1)
Llama2 (10) Llama2-Chat (2)
DeepSeek Coder (1)
SantaCoder (5) CodeFuse (1)
Codex (62)
Fig. 4. Distribution of the LLMs (as well as LLM-based applications) discussed in the collected papers. The
numbers in parentheses indicate the count of papers in which each LLM has been utilized.
Hou et al., “Large Language Models for Software Engineering:
A Systematic Literature Review ”, ArXiv, 2024
LLM Applications
67
Hou et al., “Large Language Models for Software Engineering:
A Systematic Literature Review ”, ArXiv, 2024
Large Language Models for So�ware Engineering: A Systematic Literature Review 25
Software
development
58.37%
Software
maintenance
24.89%
Software quality
assurance
10.30%
Requirements
engineering
4.72%
Software design
1.29%
Software
management
0.43%
(a) Distribution of LLM usages in SE activities.
Generation
64.34%
Classification
24.48%
Recommendation
9.79%
Regression
1.40%
(b) Problem classification based on collected studies.
Fig. 8. Distribution of LLM utilization across di�erent SE activities and problem types.
software updates and improvements. The software quality assurance domain holds approximately

Large Language Models for Test Case Evolution and Repair

  • 1.
    Large Language Models forTest Case Repair Lionel Briand, FACM, FIEEE, FRSC http://www.lbriand.info
  • 2.
  • 3.
    State of theArt • Significant body of work in LLMs and testing • Software Testing with Large Language Models: Survey, Landscape, and Vision (Wang et al. 2023), ArXiv 4 2 f parameters. Such models hold tremendous ckling complex practical tasks in domains eration and artistic creation. With their city and enhanced capabilities, LLMs have changers in NLP and AI, and are driving n other fields like coding, software testing. LMs have been used for various coding including code generation and code n [10], [11]. However, there have been the correctness and reliability of the code LLMs, as some studies have shown that ated by LLMs may not always be correct, eet the expected software requirements. By hen LLMs are used for software testing generating test cases or validating the oftware behavior, the impact of this problem eaker. This is because the primary goal of g is to identify issues or problems in the m, rather than to generate correct code or oftware requirements. At worst, the only that the corresponding defects are not rthermore, in some cases, the seemingly uts from LLMs may actually be beneficial er cases in software, and can help uncover
  • 4.
    LLMs and TestAutomation • Natural match • Translation: Requirements to test specifications • Synthesis: Test code generation • Clustering: Test minimization • Classification: Flaky tests 5
  • 5.
    Software Testing Tasks •Identifying and designing test scenarios / specifications • Test data generation • Automated test infrastructure building • Test minimization • Test evolution and repair 6
  • 6.
    Applying LLMs • Foreach test problem and context, we need to determine: • How to train them (architecture, data, objective) • How to apply them: • Zero-shot: apply to new tasks without any training examples for those specific tasks • Fine-tune: adjust the entire network to perform better in the target task • Few-shot “in context” learning: Provide a few training examples in input 7
  • 7.
    Types of Models 9 LLMsfor NLP Tasks in RE: A Systematic Guide 11 LLMs like T5 are specialized for text-to-text translation. In the following, we will explain these usage modes in detail. Decoder-only LLM Task Model Encoder-only LLM Using Decoder-only LLMs Using Encoder-only LLMs Embedding Input Output Prompt Output Encoder- Decoder LLM Using Encoder- Decoder LLMs Input Output Fig. 2. Simplified overview of using different LLM architectures to solve tasks Vogelsang and Fishbach, “Using Large Language Models for Natural Language Processing Tasks in Requirements Engineering: A Systematic Guideline”, ArXiv, 2024 FlakiFix TaRGet LTM BERT GPT-4 T5
  • 8.
  • 9.
    Test Case Evolution 15 Software SystemUpdate Test Case Failure Why? 1- Fault in the System Under Test (SUT) 2- Broken Test Case Fix the fault Test Case Repair
  • 10.
    Test Repair Example 16 SystemUnder Test (SUT) Test Case Code Changes (Hunks) in the SUT Test Case Repair
  • 11.
    Why is itan Acute Problem? 17 1- Frequent Updates 2- Software Quality High Software Quality Test Case Repair Maintaining Software Quality High Quality Test Cases Rapid Software Updates Frequent Test Case Breakages High Demand for Test Repair
  • 12.
    Test Case EvolutionCost • Frequent system updates and Large test suites • Maintaining the quality of test suites is expensive • Automated test case evolution? 18
  • 13.
    TaRGet (Test Repair Generator) Yaraghiet al., “Automated Test Case Repair Using Language Models”, ArXiv, 2024 Input (context): 1. Test Context • Broken test code 2. Repair Context • SUT code changes relevant for the repair (hunks) 19
  • 14.
    Inputs and Output 20 Input Output Repair 1.Test Context • Broken test code and broken line(s) 2. Repair Context • SUT code changes relevant for the repair
  • 15.
    Repair Context Prioritization 21 HunkPrioritization Heuristics: • Call graph depth • Change in Method/Class • TF-IDF similarity with test code • Repetition 1. Input size limitations in CLMs 2. The effect of repair context order Prioritizing SUT changes based on relevancy for repair
  • 16.
    Hunk Changes Representations 22 1)Word Level Representation 2) Line Level Representation Code Hunk
  • 17.
    Output Representations 23 Test CaseRepair 1) Code Sequence 2) Edit Sequence
  • 18.
    Overview 24 • R: AllSUT hunks • Rm⊂ R / Rc⊂ R : Method/Class level hunks within test call graph • CGD: Call Graph Depth • CT: Context Type (method/class) • BrkSim/TSim: Breakage/Test TF-IDF similarity with test code
  • 19.
    LLM Selection 25 Model NameParameter Size Architecture CodeT5+ 770 Million Encoder-decoder CodeGen Multi 350 Million Decoder-only PLBART 140 Million Encoder-decoder • Open-source availability • Pre-training on code • Parameter size under 1 billion (increasing applicability) • Highest performance on HumanEval Selected Models
  • 20.
  • 21.
    Research Questions 27 RQ1: EvaluatingPerformance of CLMs and Input-output Formats RQ2: Repair Analysis RQ2.1: Analyzing Repair Characteristics RQ2.2: Predicting Repair Trustworthiness RQ3: Fine-tuning analysis RQ3.1: Analyzing Fine-tuning Data Size RQ3.2: Assessing Model Generalization
  • 22.
    Evaluation Metrics 28 Exact MatchRate (EM) Measures repair candidates that exactly match the ground truth Plausible Repair Rate (PR) Measures repair candidates that successfully compile and pass BLEU and CodeBLEU Measure textual similarities between the repair candidate and the ground truth
  • 23.
    Repair Characteristics in Benchmark 29 1.Repair categories • ORC: Test oracle change • ARG: Argument modification • INV: Invocation modification • OTH: Others 2. Abstract Syntax Tree (AST) edit actions • Number of changes in the code structure
  • 24.
    Evaluation vs Characteristics 30 •The model shows reduced effectiveness in handling complex repairs. • The model, for complex repairs, is still achieving 60% PR.
  • 25.
    Predicting Repair Trustworthiness • Requiredfor making any such LLM solution widely applicable • Created a Random Forest Classifier utilizing test and repair context features: • complexity of SUT changes • complexity of test code • similarity between test and repair contexts 31
  • 26.
    Predicting Repair Trustworthiness • Highreliability and accuracy in predicting plausible repairs 1. 90% precision (confidence in recommending repair trust) 2. 88% recall (12% miss rate of trustworthy repairs) • Practical Implications 1. Enhances TaRGet’s practicality 2. Saves time by avoiding low-quality repairs 32
  • 27.
    Data Leak? 1. CodeT5+has two pre-training sets: 1. CodeSearchNet: Latest paper version is published June 2020 2. GitHubCode: Date of query on BigQuery was March 16th, 2022. 2. CodeGen 350M-multi 1. One of the datasets, ThePile is for 2020, but for the other one, BigQuery, no query date was mentioned. 2. Uploaded to Hugging Face on April 11th, 2022. No significant changes after that 3. PLBART 1. One of the datasets, Stackexchange, is dated Sep 7th, 2020. For the other one, BigQuery, the query date is not specified. 2. Uploaded to Hugging Face on Oct 6, 2021. No changes after that. 33
  • 28.
    Data Leak Analysis 34 ModelCount Test set Exact Match Plausible Rate CodeBLEU BLEU CodeT5+ 59% 2022-03-16 (before) 64.7 78.0 77.4 76.8 41% 2022-03-16 (after) 68.1 83.0 81.8 83.6 PLBART 54% 2021-10-06 (before) 56.6 79.0 76.4 76.8 46% 2021-10-06 (after) 62.0 79.6 80.4 82.2 CodeGen 62% 2022-04-11 (before) 57.8 76.7 75.0 75.0 38% 2022-04-11 (after) 58.6 77.4 76.5 78.1 1. On average, 40% (2.8k) of the test set is after the pretraining date 2. Across all models, data from after the pretraining date consistently outperforms data from before pretraining. 3. Therefore, there is no indication of biased evaluation from data leakage.
  • 29.
    Impact of Fine-tuningData Size • Downsized fine-tuning data to 20%, 40%, 60%, and 80% of the most recent data. • The impact of data size is significant for EM. • Using an additional 7,000 fine- tuning samples, results in an average increase of 1.85 percentage points in EM. 35
  • 30.
    Assessing Generalization • Stratifiedproject exclusion from fine-tuning data: • Creating 10 training set folds, each excluding six or five projects • Fine-tuned 10 project-agnostic models based on the above folds and evaluated them on excluded projects • Project-agnostic models are acceptable alternatives, with an average difference of 4.9 EM points. • TaRGet can generalize to unseen projects. 36
  • 31.
    TaRGet: Conclusions • TaRGetshows that CLMs can be effectively tailored for test evolution, achieving 66% Exact Match Accuracy, 80% Plausible Repair Accuracy, and 80% CodeBleu. • TaRGet performs less effectively on “complex repair scenarios”. • Utilizing a trust assessment model, we know when to trust a repair and can thus maximize the benefits of using TaRGet. • TaRGet has the capability to generalize across previously unseen projects. 37
  • 32.
    Future Work • Additionalinformation in the repair context: Unchanged lines around SUT changes, added and deleted files and methods, error message from broken test • General-purpose LLMs: Larger context windows, useful in the repair context • More diverse benchmark • Better evaluation metrics: test behavior and semantics, e.g., coverage similarity between candidate repair and ground truth 38
  • 33.
  • 34.
    Test Flakiness Repair •Flaky tests intermittently pass and fail even for the same version of source code (i.e., non- deterministic testing results). • Previous study showed that flakiness finds its root causes in the test in more than 70% of the cases. • Why detect and repair flaky tests? • Test failures caused by flaky tests can be hard to reproduce as re-running is required (computationally expensive) • Flaky tests might hide real bugs in the source code • Tests become unreliable • Software releases might be delayed • Hard to manually detect and fix so developers simply ignore these tests 40
  • 35.
    FlakyFix 41 3 Fatima et al.,“FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair”, ArXiv, 2024 Focused on repairing test code
  • 36.
    Step 1: PredictingFix Category 42 Model Change Data Structure Cause of Flakiness HashMap Should be replaced with LinkedHashMap to maintain the order in which their elements are stored, regardless of how many times a code is executed. Predicts UnixCoder and CodeBert Fine-tuning
  • 37.
    Step 1 Overview 43 §Usually, small labelled datasets of flaky tests § Few-Shot learning (FSL) classifiers learn what makes elements similar or belonging to the same class with limited data. § Fine-tune code LLMs (UnixCoder and CodeBert) using Siamese Network to predict flaky test fix category
  • 38.
    International Dataset ofFlaky Tests (IDoFT)1 44 • Largest available dataset of flaky tests where the cause of flakiness is in the test code • 562 Flaky Tests in Java and their fixes • Flaky Tests belong to 96 different projects, helpful for generalizability 1. https://mir.cs.illinois.edu/flakytests
  • 39.
    Step 1 Results 45 FlakyTest Fix Category Average Precision (%) Average Recall (%) Change Assertion 96 86 Reset Variable 94 97 Change Condition 87 87 Reorder Data 90 72 Change Data Structure 81 87 Handle Exception 98 89 Change Data Format 90 94 Reorder Parameters 94 91 Misc 67 62 Accurate, except Misc, as expected
  • 40.
    Step 2: Repair 46 ChangeData Structure Generates Repaired Flaky Test 175 B size Window: >16k tokens
  • 41.
    Step 2 Overview 47 Prompt1 (fix category label) Prompt 2 (In-Context Learning) 10 (41 examples), and h of these three fix flaky tests, along with abels, as prompts for the model to provide ven fix category label, se example tests and ur prompt explicitly e fixed test code, ex- tions. This approach the model generates g post-processing. In GPT model using a ests. This expansion odel’s capabilities in pt is de] ed code to make provide iption. � Input: Prompt This test case is Flaky: [Flaky Code] This test can be fixed by changing the following information in the code: [Fix Category Label] Just Provide the full fixed code of this test case only without any other text description’ � Output: Generated Output: [Fixed Flaky Test Code] Fig. 6: Flaky test fix using GPT-3.5 Turbo with fix category label. � Input: Prompt This test case is Flaky: [Flaky Code] This test can be fixed by changing the following information in the code: [Fix Category Label] Just Provide the full fixed code of this test case only without any other text description’ Here are some Flaky tests examples, their fixes and fix category labels: [Examples] � Output: Generated Output: [Fixed Flaky Test Code] Fig. 7: Flaky test fix using GPT-3.5 Turbo with in-context learning. UniXcoder without FSL dem and similar recall in the Ch Structure, Handle Exception, C Parameters categories, with p 87%, and 91%, respectively, ou approaches. For Reorder Data, a achieves the highest precision s incorporating FSL, UniXcoder call of 88%. Furthermore, Cod slightly better results for the 83% precision and 67% recall UniXcoder is not significant. F UniXcoder with FSL yielded h cision and 97% recall rates, w achieved a slightly higher rec precision of 88%. Finally, for Ch with FSL outperformed all ot precision and recall. Howeve UniXcoder is not significant. The Change Data Format, Ha Reorder Data and Change A easiest for all four approaches precision score (across all app 88%, and 90%, respectively. T frequency of common keywor handling, assert statements, res ing data (with keywords like ” case of the Change Data Form for a long complicated string function. These strings are eas because we observed that all category contain a long compl updated to remove flakiness. A strong performance across m high recall and precision. Ho
  • 42.
    Step 2: FixCategory Labels 48 Higher CodeBLEU score with fix labels
  • 43.
    Step 2: In-ContextLearning 49 In-context learning helps for 2/3 categories
  • 44.
    FlakyFix: Conclusions 50 • Automatedrepair of flakiness seems promising • Two-step process: • (1) predict the fix category first, • (2) generate an actual fix using a pre-trained LLM with this category, the test code, and few-shot learning • High similarity degree of output (generated fix) with ground truth (actual fix)
  • 45.
    Future Work 51 • Betterevaluation metrics • Analyze the execution of fixed test cases • Larger, more diverse benchmark dataset • User study and feedback • Fix causes of flakiness in production code
  • 46.
  • 47.
    LLM Fine Tuning •Fine tuning: Additional training datasets (e.g., requirements, tests, repairs) to turn a general-purpose model into a specialized one • Availability: (1) Data may not be available (in sufficient amount), (2) Data may change frequently • Reusability: Fine-tuned model is not reusable for other tasks/domains • Overfitting: Model may fit too closely to training dataset and not generalize • Fine tuning was possible for TaRGet but not FlakyFix • We checked for overfitting with TaRGet 53
  • 48.
    Prompt Engineering • Instruction:a specific task or instruction you want the model to perform, e.g., guidance to generate test inputs • Context: external information or additional context that can steer the model to better responses, e.g., (part of the) code under test • Input Data: the input or question that we are interested in finding a response for, e.g., requirements • Output Indicator: the type or format of the output, e.g., test format (e.g., XML) • Training examples , e.g., actual test inputs and outputs 54
  • 49.
    Prompt Engineering • Optimizingprompts, in terms of effectiveness and cost, for the many relevant testing problems and contexts, such as prioritizing and selecting context information (e.g., context prioritization in TaRGet). • FlakyFix: Optimized prompts with additional information (fix category) and examples • TaRGet: Prioritized SUT hunks according to relevance for test case (changed code) • Dynamic prompt optimization driven by multi-objective search, i.e., a new SBSE application. • Static and dynamic analyses for prompt engineering 55
  • 50.
    Empirical Evaluation • Usually,many alternatives for automating a task • Large evaluation space: Models * Application strategies * Hyperparameters • Evaluation metrics are not enough • The real question is what is the gain for the end user compared to current practices and tools, in real development processes • Only user studies can answer that --- they are very difficult to run and they are rare 56
  • 51.
    Benchmark Datasets • Resultsare strongly influenced by the benchmarks used • Often hard and expensive to build: Size and diversity • Generalizability? • TaRGET: • Huge effort in building a large and diverse benchmark • Attempt to address generalizability 58
  • 52.
    Solution Applicability • Datascarcity: Availability of data (labeled or not) for fine-tuning, e.g., TaRBench versus FlakyFix. • Size of the model (parameters) determines storage, memory, and computational requirements, e.g., in TaRGet, we selected smaller models. • Automated support for prompt engineering, e.g., generating, selecting, and prioritizing context information, e.g., TaRGet prioritization of hunk changes. • Dealing effectively with LLM uncertainty, e.g., TaRGet trustworthiness model 61
  • 53.
    Large Language Models forTest Case Repair Lionel Briand, FACM, FIEEE, FRSC http://www.lbriand.info
  • 54.
    SyMeCo Fellowship • SyMeCois a Marie Skłodowska-Curie postdoctoral fellowship programme coordinated by Lero • Co-funded by Science Foundation Ireland and the EU • 16 fellowships of 2-year duration based in Ireland across 8 Higher Education Institutions • Open to researchers of any nationality
  • 55.
  • 56.
    LLMs are MovingFast 66 1:10 X Hou, Y Zhao, Y Liu, Z Yang, K Wang, L Li, X Luo, D Lo, J Grundy, and H Wang 2018 2019 2020 2021 2022 2023 Encoder-only Encoder Features Input Text BERT (50) RoBERTA (24) ALBERT (6) CodeBERT (51) GraphCodeBERT (25) Trace BERT (3) BERTOverflow (3) CostSens BERT (1) CodeRetriever (1) seBERT (2) PRCBERT (1) CuBERT (3) Sentence-BERT (2) 2024 Encoder-decoder Encoder Features Input Text Decoder Output Text BART PLBART (15) T5 (20) CodeT5 (46) CodeT5+ (7) CoTexT (4) AlphaCode (6) NatGen (2) UniXcoder (16) Codetrans (2) CoditT5 (1) CodeReviewer (2) SPT-Code (2) Decoder-only Decoder Output Text Input Text GPT-1 (4) GPT-2 (17) GPT-3 (12) GPT-4 (53) GPT-Neo (13) ChatGPT (72) GPT-J (13) CodeGPT (26) CodeGen (44) GPT-3.5 (54) Copilot (7) XLNet (4) CodeGeeX2 (3) CodeGen2 (7) CodeGeeX (8) InstructGPT (5) CodeParrot (6) PolyCoder (8) PyCodeGPT (5) StarCoder (25) Claude2 (2) OPT (5) PanGu-Coder (1) LaMDA (2) PaLM2 (1) PaLM (4) PaLM-Coder (3) Claude (3) Bard (2) StableLM (1) WizardCoder (12) LLaMA (14) DeepSeek (3) Vicuna (11) BLOOM (5) InCoder (29) GPT-NeoX (5) CodeLlama (19) CodeLlama2 (1) Llama2 (10) Llama2-Chat (2) DeepSeek Coder (1) SantaCoder (5) CodeFuse (1) Codex (62) Fig. 4. Distribution of the LLMs (as well as LLM-based applications) discussed in the collected papers. The numbers in parentheses indicate the count of papers in which each LLM has been utilized. Hou et al., “Large Language Models for Software Engineering: A Systematic Literature Review ”, ArXiv, 2024
  • 57.
    LLM Applications 67 Hou etal., “Large Language Models for Software Engineering: A Systematic Literature Review ”, ArXiv, 2024 Large Language Models for So�ware Engineering: A Systematic Literature Review 25 Software development 58.37% Software maintenance 24.89% Software quality assurance 10.30% Requirements engineering 4.72% Software design 1.29% Software management 0.43% (a) Distribution of LLM usages in SE activities. Generation 64.34% Classification 24.48% Recommendation 9.79% Regression 1.40% (b) Problem classification based on collected studies. Fig. 8. Distribution of LLM utilization across di�erent SE activities and problem types. software updates and improvements. The software quality assurance domain holds approximately