Breaking the Silence: the Threats of Using LLMs in Software Engineering

•

0 likes•3 views

Presentation of our work presented at ICSE 2024 (NIER track) in Lisbon Abstract: Large Language Models (LLMs) have gained considerable traction within the Software Engineering (SE) community, impacting various SE tasks from code completion to test generation, from program repair to code summarization. Despite their promise, researchers must still be careful as numerous intricate factors can influence the outcomes of experiments involving LLMs. This paper initiates an open discussion on potential threats to the validity of LLM-based research including issues such as closed-source models, possible data leakage between LLM training data and research evaluation, and the reproducibility of LLM-based findings. In response, this paper proposes a set of guidelines tailored for SE researchers and Language Model (LM) providers to mitigate these concerns. The implications of the guidelines are illustrated using existing good practices followed by LLM providers and a practical example for SE researchers in the context of test case generation.

Engineering

Breaking the Silence: the Threats of
Using LLMs in SE Research
1
Annibale Panichella
Thomas Durieux
June Sallou

5
Lekker
broodje
Precise
Have you read
this paper?

6
Lekker
broodje
Precise
Have you read
this paper?
ChatGPT on
Defects4j?
Multiple runs?

8
Threats to validity Guidelines
From Threats to Opportunities

9
Threats to Validity (1)
Training vs. Test Sets
1.Data leakage due to pre-training
• e.g., evaluating ChatGPT on GitHub projects
2.Data leakage due to
fi
ne-tuning
• e.g., two projects (one training, one test)
sharing the same API usage

10
Example
ChatGPT knows Defect4j (GitHub)
Does it make sense to evaluate
ChatGPT on Defects4j data?

Example (2)
11
Siddiq et al.,Arxiv 2023
Remarks:
• HumanEval is hosted on GitHub
since 2021
• SF110 is hosted on SourceForge
(not GitHub)
EvoSuite achieves 75% branch
coverage on SF110

Threats to Validity (1)
12
Guidelines
Use different sources (not only GitHub)
Use Code clone detection techniques
(better than Edit Distance) between
generated code and training sets
Assess LLMs on metamorphic programs
Use open-source or data traceable LLMs

Metamorphic Testing
13
Gecco 2023, 15-19 July, 2023, Lisbon, Portugal
Figure 2: Comparison of F1 for random and genetic search
Figure 4: Metric-m
Metamorphic testing: changing the syntax
of the code without changing its semantic
(e.g., reverse if-the-else statements)
MT can easily fool
code2vec leading to
a sensitive drop in
performance
GECCO 2023

Threats to Validity (2)
14
Reproducibility
1.OutputVariability
• Different results when using different
(random) seeds
2.Time-Based Output Drift
• Same setting (e.g., seed) leads to different
results over time [Chen 2023]
3.Traceability
• At best, researchers release the prompts

Threats to Validity (2)
15
Reproducibility
1.OutputVariability
• Different results when using different
(random) seeds
2.Time-Based Output Drift
• Same setting (e.g., seed) leads to different
results over time [Chen 2023]
3.Traceability
• At best, researchers release the prompts
Guidelines
Use a
fi
xed random seed (if possible)
Assess output variability (multiple
runs, multiple seeds, different sessions,
different time intervals)
Provide execution metadata
- LLM version
- Query timestamp
- Variability analysis
- …

Threats to Validity (3)
16
Reproducibility
1.Model evolution unpredictability
• Close source models with opaque
evolution
2.Attributing improvement
• Is Approach1 better than Approach2?
3.Privacy Implications
• Is private data safe?
• Licensing?

Threats to Validity (3)
17
Reproducibility
1.Model evolution unpredictability
• Close source models with opaque
evolution
2.Attributing improvement
• Is Approach1 better than Approach2?
3.Privacy Implications
• Is private data safe?
• Licensing?
Guidelines
Enhance model transparency (for
LLMs providers)
Use versioning information (for LLMs
providers)
Perform comparative analysis (for
researchers)
- Open-sourced vs. closed-sourced models

Additional Remarks
18
Current evaluations of pre-trained models presents
various threats to validity
As a community, we have to make new guidelines on
how to fairly assess LLMs vs . well established
(unsupervised) approaches
NLP evolution metrics do not re
fl
ect do not
adequately re
fl
ect the SE performance metrics for non-
NLP tasks
E.g., % of compiling tests is not a valid SE metrics for test
case generation approaches (based on LLMs)

Breaking the Silence: the Threats of
Using LLMs in SE Research
19
Annibale Panichella
Thomas Durieux
June Sallou

Similar to Breaking the Silence: the Threats of Using LLMs in Software Engineering

Good practices (and challenges) for reproducibilityJavier Quílez Oliete

Writting Better Softwaresvilen.ivanov

Qrs16.pptPtidej Team

QRS2016 - Towards Understanding Interactive DebuggingFabio Petrillo

Qrs16.pptYann-Gaël Guéhéneuc

ProspectusPresentationPrinterFriendlymartijnetje

2017 Microservices Practitioner Virtual Summit - Opening Keynote: Trends in M...Ambassador Labs

An Application-Oriented Approach for Computer Security EducationXiao Qin

Why Johnny Can't Blow the Whistlegregnorc

David Parnas - Documentation Based Software Testing - SoftTest IrelandDavid O'Dowd

Using security to drive chaos engineering - April 2018Dinis Cruz

Testing survey by_directionsTao He

Can we induce change with what we measure?Michaela Greiler

Measuring and driving DevOps practices in the real worldMessageMedia

Lessons learned after 190M lessons servedRicardo Bánffy

Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.

fe.docxlmelaine

Opal Hermes - towards representative benchmarksMichaelEichberg1

20060712 automated model based testing of community-driven open-source gui ap...Will Shen

MetaploitAjinkya Pathak

Similar to Breaking the Silence: the Threats of Using LLMs in Software Engineering (20)

Good practices (and challenges) for reproducibility

Writting Better Software

Qrs16.ppt

QRS2016 - Towards Understanding Interactive Debugging

Qrs16.ppt

ProspectusPresentationPrinterFriendly

2017 Microservices Practitioner Virtual Summit - Opening Keynote: Trends in M...

An Application-Oriented Approach for Computer Security Education

Why Johnny Can't Blow the Whistle

David Parnas - Documentation Based Software Testing - SoftTest Ireland

Using security to drive chaos engineering - April 2018

Testing survey by_directions

Can we induce change with what we measure?

Measuring and driving DevOps practices in the real world

Lessons learned after 190M lessons served

Docker in Open Science Data Analysis Challenges by Bruce Hoff

fe.docx

Opal Hermes - towards representative benchmarks

20060712 automated model based testing of community-driven open-source gui ap...

Metaploit

Recently uploaded

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani

College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

UNIT - IV - Air Compressors and its Performancesivaprakash250

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Introduction and different types of Ethernet.pptxupamatechverse

Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat

Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona

result management system report for college projectTonystark477637

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

Recently uploaded (20)

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts

Coefficient of Thermal Expansion and their Importance.pptx

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

UNIT - IV - Air Compressors and its Performance

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Introduction and different types of Ethernet.pptx

Microscopic Analysis of Ceramic Materials.pptx

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts

Processing & Properties of Floor and Wall Tiles.pptx

result management system report for college project

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts

Breaking the Silence: the Threats of Using LLMs in Software Engineering

1. Breaking the Silence: the Threats of Using LLMs in SE Research 1 Annibale Panichella Thomas Durieux June Sallou

2. How Did We Get Here? 2

3. 3 Lekker broodje

4. 4 Lekker broodje Precise

5. 5 Lekker broodje Precise Have you read this paper?

6. 6 Lekker broodje Precise Have you read this paper? ChatGPT on Defects4j? Multiple runs?

7. Our Discussion 7

8. 8 Threats to validity Guidelines From Threats to Opportunities

9. 9 Threats to Validity (1) Training vs. Test Sets 1.Data leakage due to pre-training • e.g., evaluating ChatGPT on GitHub projects 2.Data leakage due to fi ne-tuning • e.g., two projects (one training, one test) sharing the same API usage

10. 10 Example ChatGPT knows Defect4j (GitHub) Does it make sense to evaluate ChatGPT on Defects4j data?

11. Example (2) 11 Siddiq et al.,Arxiv 2023 Remarks: • HumanEval is hosted on GitHub since 2021 • SF110 is hosted on SourceForge (not GitHub) EvoSuite achieves 75% branch coverage on SF110

12. Threats to Validity (1) 12 Guidelines Use different sources (not only GitHub) Use Code clone detection techniques (better than Edit Distance) between generated code and training sets Assess LLMs on metamorphic programs Use open-source or data traceable LLMs

13. Metamorphic Testing 13 Gecco 2023, 15-19 July, 2023, Lisbon, Portugal Figure 2: Comparison of F1 for random and genetic search Figure 4: Metric-m Metamorphic testing: changing the syntax of the code without changing its semantic (e.g., reverse if-the-else statements) MT can easily fool code2vec leading to a sensitive drop in performance GECCO 2023

14. Threats to Validity (2) 14 Reproducibility 1.OutputVariability • Different results when using different (random) seeds 2.Time-Based Output Drift • Same setting (e.g., seed) leads to different results over time [Chen 2023] 3.Traceability • At best, researchers release the prompts

15. Threats to Validity (2) 15 Reproducibility 1.OutputVariability • Different results when using different (random) seeds 2.Time-Based Output Drift • Same setting (e.g., seed) leads to different results over time [Chen 2023] 3.Traceability • At best, researchers release the prompts Guidelines Use a fi xed random seed (if possible) Assess output variability (multiple runs, multiple seeds, different sessions, different time intervals) Provide execution metadata - LLM version - Query timestamp - Variability analysis - …

16. Threats to Validity (3) 16 Reproducibility 1.Model evolution unpredictability • Close source models with opaque evolution 2.Attributing improvement • Is Approach1 better than Approach2? 3.Privacy Implications • Is private data safe? • Licensing?

17. Threats to Validity (3) 17 Reproducibility 1.Model evolution unpredictability • Close source models with opaque evolution 2.Attributing improvement • Is Approach1 better than Approach2? 3.Privacy Implications • Is private data safe? • Licensing? Guidelines Enhance model transparency (for LLMs providers) Use versioning information (for LLMs providers) Perform comparative analysis (for researchers) - Open-sourced vs. closed-sourced models

18. Additional Remarks 18 Current evaluations of pre-trained models presents various threats to validity As a community, we have to make new guidelines on how to fairly assess LLMs vs . well established (unsupervised) approaches NLP evolution metrics do not re fl ect do not adequately re fl ect the SE performance metrics for non- NLP tasks E.g., % of compiling tests is not a valid SE metrics for test case generation approaches (based on LLMs)

19. Breaking the Silence: the Threats of Using LLMs in SE Research 19 Annibale Panichella Thomas Durieux June Sallou

Breaking the Silence: the Threats of Using LLMs in Software Engineering

Recommended

Recommended

More Related Content

Similar to Breaking the Silence: the Threats of Using LLMs in Software Engineering

Similar to Breaking the Silence: the Threats of Using LLMs in Software Engineering (20)

More from Annibale Panichella

More from Annibale Panichella (20)

Recently uploaded

Recently uploaded (20)

Breaking the Silence: the Threats of Using LLMs in Software Engineering