Presentation of our work presented at ICSE 2024 (NIER track) in Lisbon
Abstract:
Large Language Models (LLMs) have gained considerable traction within the Software Engineering (SE) community, impacting various SE tasks from code completion to test generation, from program repair to code summarization. Despite their promise, researchers must still be careful as numerous intricate factors can influence the outcomes of experiments involving LLMs.
This paper initiates an open discussion on potential threats to the validity of LLM-based research including issues such as closed-source models, possible data leakage between LLM training data and research evaluation, and the reproducibility of LLM-based findings.
In response, this paper proposes a set of guidelines tailored for SE researchers and Language Model (LM) providers to mitigate these concerns.
The implications of the guidelines are illustrated using existing good practices followed by LLM providers and a practical example for SE researchers in the context of test case generation.
9. 9
Threats to Validity (1)
Training vs. Test Sets
1.Data leakage due to pre-training
• e.g., evaluating ChatGPT on GitHub projects
2.Data leakage due to
fi
ne-tuning
• e.g., two projects (one training, one test)
sharing the same API usage
11. Example (2)
11
Siddiq et al.,Arxiv 2023
Remarks:
• HumanEval is hosted on GitHub
since 2021
• SF110 is hosted on SourceForge
(not GitHub)
EvoSuite achieves 75% branch
coverage on SF110
12. Threats to Validity (1)
12
Guidelines
Use different sources (not only GitHub)
Use Code clone detection techniques
(better than Edit Distance) between
generated code and training sets
Assess LLMs on metamorphic programs
Use open-source or data traceable LLMs
13. Metamorphic Testing
13
Gecco 2023, 15-19 July, 2023, Lisbon, Portugal
Figure 2: Comparison of F1 for random and genetic search
Figure 4: Metric-m
Metamorphic testing: changing the syntax
of the code without changing its semantic
(e.g., reverse if-the-else statements)
MT can easily fool
code2vec leading to
a sensitive drop in
performance
GECCO 2023
14. Threats to Validity (2)
14
Reproducibility
1.OutputVariability
• Different results when using different
(random) seeds
2.Time-Based Output Drift
• Same setting (e.g., seed) leads to different
results over time [Chen 2023]
3.Traceability
• At best, researchers release the prompts
15. Threats to Validity (2)
15
Reproducibility
1.OutputVariability
• Different results when using different
(random) seeds
2.Time-Based Output Drift
• Same setting (e.g., seed) leads to different
results over time [Chen 2023]
3.Traceability
• At best, researchers release the prompts
Guidelines
Use a
fi
xed random seed (if possible)
Assess output variability (multiple
runs, multiple seeds, different sessions,
different time intervals)
Provide execution metadata
- LLM version
- Query timestamp
- Variability analysis
- …
16. Threats to Validity (3)
16
Reproducibility
1.Model evolution unpredictability
• Close source models with opaque
evolution
2.Attributing improvement
• Is Approach1 better than Approach2?
3.Privacy Implications
• Is private data safe?
• Licensing?
17. Threats to Validity (3)
17
Reproducibility
1.Model evolution unpredictability
• Close source models with opaque
evolution
2.Attributing improvement
• Is Approach1 better than Approach2?
3.Privacy Implications
• Is private data safe?
• Licensing?
Guidelines
Enhance model transparency (for
LLMs providers)
Use versioning information (for LLMs
providers)
Perform comparative analysis (for
researchers)
- Open-sourced vs. closed-sourced models
18. Additional Remarks
18
Current evaluations of pre-trained models presents
various threats to validity
As a community, we have to make new guidelines on
how to fairly assess LLMs vs . well established
(unsupervised) approaches
NLP evolution metrics do not re
fl
ect do not
adequately re
fl
ect the SE performance metrics for non-
NLP tasks
E.g., % of compiling tests is not a valid SE metrics for test
case generation approaches (based on LLMs)
19. Breaking the Silence: the Threats of
Using LLMs in SE Research
19
Annibale Panichella
Thomas Durieux
June Sallou