SlideShare a Scribd company logo
1 of 19
Download to read offline
Breaking the Silence: the Threats of
Using LLMs in SE Research
1
Annibale Panichella
Thomas Durieux
June Sallou
How Did We Get Here?
2
3
Lekker
broodje
4
Lekker
broodje
Precise
5
Lekker
broodje
Precise
Have you read
this paper?
6
Lekker
broodje
Precise
Have you read
this paper?
ChatGPT on
Defects4j?
Multiple runs?
Our Discussion
7
8
Threats to validity Guidelines
From Threats to Opportunities
9
Threats to Validity (1)
Training vs. Test Sets
1.Data leakage due to pre-training
• e.g., evaluating ChatGPT on GitHub projects
2.Data leakage due to
fi
ne-tuning
• e.g., two projects (one training, one test)
sharing the same API usage
10
Example
ChatGPT knows Defect4j (GitHub)
Does it make sense to evaluate
ChatGPT on Defects4j data?
Example (2)
11
Siddiq et al.,Arxiv 2023
Remarks:
• HumanEval is hosted on GitHub
since 2021
• SF110 is hosted on SourceForge
(not GitHub)
EvoSuite achieves 75% branch
coverage on SF110
Threats to Validity (1)
12
Guidelines
Use different sources (not only GitHub)
Use Code clone detection techniques
(better than Edit Distance) between
generated code and training sets
Assess LLMs on metamorphic programs
Use open-source or data traceable LLMs
Metamorphic Testing
13
Gecco 2023, 15-19 July, 2023, Lisbon, Portugal
Figure 2: Comparison of F1 for random and genetic search
Figure 4: Metric-m
Metamorphic testing: changing the syntax
of the code without changing its semantic
(e.g., reverse if-the-else statements)
MT can easily fool
code2vec leading to
a sensitive drop in
performance
GECCO 2023
Threats to Validity (2)
14
Reproducibility
1.OutputVariability
• Different results when using different
(random) seeds
2.Time-Based Output Drift
• Same setting (e.g., seed) leads to different
results over time [Chen 2023]
3.Traceability
• At best, researchers release the prompts
Threats to Validity (2)
15
Reproducibility
1.OutputVariability
• Different results when using different
(random) seeds
2.Time-Based Output Drift
• Same setting (e.g., seed) leads to different
results over time [Chen 2023]
3.Traceability
• At best, researchers release the prompts
Guidelines
Use a
fi
xed random seed (if possible)
Assess output variability (multiple
runs, multiple seeds, different sessions,
different time intervals)
Provide execution metadata
- LLM version
- Query timestamp
- Variability analysis
- …
Threats to Validity (3)
16
Reproducibility
1.Model evolution unpredictability
• Close source models with opaque
evolution
2.Attributing improvement
• Is Approach1 better than Approach2?
3.Privacy Implications
• Is private data safe?
• Licensing?
Threats to Validity (3)
17
Reproducibility
1.Model evolution unpredictability
• Close source models with opaque
evolution
2.Attributing improvement
• Is Approach1 better than Approach2?
3.Privacy Implications
• Is private data safe?
• Licensing?
Guidelines
Enhance model transparency (for
LLMs providers)
Use versioning information (for LLMs
providers)
Perform comparative analysis (for
researchers)
- Open-sourced vs. closed-sourced models
Additional Remarks
18
Current evaluations of pre-trained models presents
various threats to validity
As a community, we have to make new guidelines on
how to fairly assess LLMs vs . well established
(unsupervised) approaches
NLP evolution metrics do not re
fl
ect do not
adequately re
fl
ect the SE performance metrics for non-
NLP tasks
E.g., % of compiling tests is not a valid SE metrics for test
case generation approaches (based on LLMs)
Breaking the Silence: the Threats of
Using LLMs in SE Research
19
Annibale Panichella
Thomas Durieux
June Sallou

More Related Content

Similar to Breaking the Silence: the Threats of Using LLMs in Software Engineering

Good practices (and challenges) for reproducibility
Good practices (and challenges) for reproducibilityGood practices (and challenges) for reproducibility
Good practices (and challenges) for reproducibilityJavier Quílez Oliete
 
Writting Better Software
Writting Better SoftwareWritting Better Software
Writting Better Softwaresvilen.ivanov
 
QRS2016 - Towards Understanding Interactive Debugging
QRS2016 - Towards Understanding Interactive DebuggingQRS2016 - Towards Understanding Interactive Debugging
QRS2016 - Towards Understanding Interactive DebuggingFabio Petrillo
 
ProspectusPresentationPrinterFriendly
ProspectusPresentationPrinterFriendlyProspectusPresentationPrinterFriendly
ProspectusPresentationPrinterFriendlymartijnetje
 
2017 Microservices Practitioner Virtual Summit - Opening Keynote: Trends in M...
2017 Microservices Practitioner Virtual Summit - Opening Keynote: Trends in M...2017 Microservices Practitioner Virtual Summit - Opening Keynote: Trends in M...
2017 Microservices Practitioner Virtual Summit - Opening Keynote: Trends in M...Ambassador Labs
 
An Application-Oriented Approach for Computer Security Education
An Application-Oriented Approach for Computer Security EducationAn Application-Oriented Approach for Computer Security Education
An Application-Oriented Approach for Computer Security EducationXiao Qin
 
Why Johnny Can't Blow the Whistle
Why Johnny Can't Blow the WhistleWhy Johnny Can't Blow the Whistle
Why Johnny Can't Blow the Whistlegregnorc
 
David Parnas - Documentation Based Software Testing - SoftTest Ireland
David Parnas - Documentation Based Software Testing - SoftTest IrelandDavid Parnas - Documentation Based Software Testing - SoftTest Ireland
David Parnas - Documentation Based Software Testing - SoftTest IrelandDavid O'Dowd
 
Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Dinis Cruz
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directionsTao He
 
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?Michaela Greiler
 
Measuring and driving DevOps practices in the real world
Measuring and driving DevOps practices in the real worldMeasuring and driving DevOps practices in the real world
Measuring and driving DevOps practices in the real worldMessageMedia
 
Lessons learned after 190M lessons served
Lessons learned after 190M lessons servedLessons learned after 190M lessons served
Lessons learned after 190M lessons servedRicardo Bánffy
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
 
Opal Hermes - towards representative benchmarks
Opal  Hermes - towards representative benchmarksOpal  Hermes - towards representative benchmarks
Opal Hermes - towards representative benchmarksMichaelEichberg1
 
20060712 automated model based testing of community-driven open-source gui ap...
20060712 automated model based testing of community-driven open-source gui ap...20060712 automated model based testing of community-driven open-source gui ap...
20060712 automated model based testing of community-driven open-source gui ap...Will Shen
 

Similar to Breaking the Silence: the Threats of Using LLMs in Software Engineering (20)

Good practices (and challenges) for reproducibility
Good practices (and challenges) for reproducibilityGood practices (and challenges) for reproducibility
Good practices (and challenges) for reproducibility
 
Writting Better Software
Writting Better SoftwareWritting Better Software
Writting Better Software
 
Qrs16.ppt
Qrs16.pptQrs16.ppt
Qrs16.ppt
 
QRS2016 - Towards Understanding Interactive Debugging
QRS2016 - Towards Understanding Interactive DebuggingQRS2016 - Towards Understanding Interactive Debugging
QRS2016 - Towards Understanding Interactive Debugging
 
Qrs16.ppt
Qrs16.pptQrs16.ppt
Qrs16.ppt
 
ProspectusPresentationPrinterFriendly
ProspectusPresentationPrinterFriendlyProspectusPresentationPrinterFriendly
ProspectusPresentationPrinterFriendly
 
2017 Microservices Practitioner Virtual Summit - Opening Keynote: Trends in M...
2017 Microservices Practitioner Virtual Summit - Opening Keynote: Trends in M...2017 Microservices Practitioner Virtual Summit - Opening Keynote: Trends in M...
2017 Microservices Practitioner Virtual Summit - Opening Keynote: Trends in M...
 
An Application-Oriented Approach for Computer Security Education
An Application-Oriented Approach for Computer Security EducationAn Application-Oriented Approach for Computer Security Education
An Application-Oriented Approach for Computer Security Education
 
Why Johnny Can't Blow the Whistle
Why Johnny Can't Blow the WhistleWhy Johnny Can't Blow the Whistle
Why Johnny Can't Blow the Whistle
 
David Parnas - Documentation Based Software Testing - SoftTest Ireland
David Parnas - Documentation Based Software Testing - SoftTest IrelandDavid Parnas - Documentation Based Software Testing - SoftTest Ireland
David Parnas - Documentation Based Software Testing - SoftTest Ireland
 
Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directions
 
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?
 
Measuring and driving DevOps practices in the real world
Measuring and driving DevOps practices in the real worldMeasuring and driving DevOps practices in the real world
Measuring and driving DevOps practices in the real world
 
Lessons learned after 190M lessons served
Lessons learned after 190M lessons servedLessons learned after 190M lessons served
Lessons learned after 190M lessons served
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
fe.docx
fe.docxfe.docx
fe.docx
 
Opal Hermes - towards representative benchmarks
Opal  Hermes - towards representative benchmarksOpal  Hermes - towards representative benchmarks
Opal Hermes - towards representative benchmarks
 
20060712 automated model based testing of community-driven open-source gui ap...
20060712 automated model based testing of community-driven open-source gui ap...20060712 automated model based testing of community-driven open-source gui ap...
20060712 automated model based testing of community-driven open-source gui ap...
 
Metaploit
MetaploitMetaploit
Metaploit
 

More from Annibale Panichella

Searching for Quality: Genetic Algorithms and Metamorphic Testing for Softwar...
Searching for Quality: Genetic Algorithms and Metamorphic Testing for Softwar...Searching for Quality: Genetic Algorithms and Metamorphic Testing for Softwar...
Searching for Quality: Genetic Algorithms and Metamorphic Testing for Softwar...Annibale Panichella
 
A Fast Multi-objective Evolutionary Approach for Designing Large-Scale Optica...
A Fast Multi-objective Evolutionary Approach for Designing Large-Scale Optica...A Fast Multi-objective Evolutionary Approach for Designing Large-Scale Optica...
A Fast Multi-objective Evolutionary Approach for Designing Large-Scale Optica...Annibale Panichella
 
An Improved Pareto Front Modeling Algorithm for Large-scale Many-Objective Op...
An Improved Pareto Front Modeling Algorithm for Large-scale Many-Objective Op...An Improved Pareto Front Modeling Algorithm for Large-scale Many-Objective Op...
An Improved Pareto Front Modeling Algorithm for Large-scale Many-Objective Op...Annibale Panichella
 
An Adaptive Evolutionary Algorithm based on Non-Euclidean Geometry for Many-O...
An Adaptive Evolutionary Algorithm based on Non-Euclidean Geometry for Many-O...An Adaptive Evolutionary Algorithm based on Non-Euclidean Geometry for Many-O...
An Adaptive Evolutionary Algorithm based on Non-Euclidean Geometry for Many-O...Annibale Panichella
 
Speeding-up Software Testing With Computational Intelligence
Speeding-up Software Testing With Computational IntelligenceSpeeding-up Software Testing With Computational Intelligence
Speeding-up Software Testing With Computational IntelligenceAnnibale Panichella
 
Incremental Control Dependency Frontier Exploration for Many-Criteria Test C...
Incremental Control Dependency Frontier Exploration for Many-Criteria  Test C...Incremental Control Dependency Frontier Exploration for Many-Criteria  Test C...
Incremental Control Dependency Frontier Exploration for Many-Criteria Test C...Annibale Panichella
 
Java Unit Testing Tool Competition — Fifth Round
Java Unit Testing Tool Competition — Fifth RoundJava Unit Testing Tool Competition — Fifth Round
Java Unit Testing Tool Competition — Fifth RoundAnnibale Panichella
 
Evolutionary Testing for Crash Reproduction
Evolutionary Testing for Crash ReproductionEvolutionary Testing for Crash Reproduction
Evolutionary Testing for Crash ReproductionAnnibale Panichella
 
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...Annibale Panichella
 
Security Threat Identification and Testing
Security Threat Identification and TestingSecurity Threat Identification and Testing
Security Threat Identification and TestingAnnibale Panichella
 
Reformulating Branch Coverage as a Many-Objective Optimization Problem
Reformulating Branch Coverage as a Many-Objective Optimization ProblemReformulating Branch Coverage as a Many-Objective Optimization Problem
Reformulating Branch Coverage as a Many-Objective Optimization ProblemAnnibale Panichella
 
Results for EvoSuite-MOSA at the Third Unit Testing Tool Competition
Results for EvoSuite-MOSA at the Third Unit Testing Tool CompetitionResults for EvoSuite-MOSA at the Third Unit Testing Tool Competition
Results for EvoSuite-MOSA at the Third Unit Testing Tool CompetitionAnnibale Panichella
 
Adaptive User Feedback for IR-based Traceability Recovery
Adaptive User Feedback for IR-based Traceability RecoveryAdaptive User Feedback for IR-based Traceability Recovery
Adaptive User Feedback for IR-based Traceability RecoveryAnnibale Panichella
 
Diversity mechanisms for evolutionary populations in Search-Based Software En...
Diversity mechanisms for evolutionary populations in Search-Based Software En...Diversity mechanisms for evolutionary populations in Search-Based Software En...
Diversity mechanisms for evolutionary populations in Search-Based Software En...Annibale Panichella
 
Estimating the Evolution Direction of Populations to Improve Genetic Algorithms
Estimating the Evolution Direction of Populations to Improve Genetic AlgorithmsEstimating the Evolution Direction of Populations to Improve Genetic Algorithms
Estimating the Evolution Direction of Populations to Improve Genetic AlgorithmsAnnibale Panichella
 
When and How Using Structural Information to Improve IR-Based Traceability Re...
When and How Using Structural Information to Improve IR-Based Traceability Re...When and How Using Structural Information to Improve IR-Based Traceability Re...
When and How Using Structural Information to Improve IR-Based Traceability Re...Annibale Panichella
 

More from Annibale Panichella (20)

Searching for Quality: Genetic Algorithms and Metamorphic Testing for Softwar...
Searching for Quality: Genetic Algorithms and Metamorphic Testing for Softwar...Searching for Quality: Genetic Algorithms and Metamorphic Testing for Softwar...
Searching for Quality: Genetic Algorithms and Metamorphic Testing for Softwar...
 
A Fast Multi-objective Evolutionary Approach for Designing Large-Scale Optica...
A Fast Multi-objective Evolutionary Approach for Designing Large-Scale Optica...A Fast Multi-objective Evolutionary Approach for Designing Large-Scale Optica...
A Fast Multi-objective Evolutionary Approach for Designing Large-Scale Optica...
 
An Improved Pareto Front Modeling Algorithm for Large-scale Many-Objective Op...
An Improved Pareto Front Modeling Algorithm for Large-scale Many-Objective Op...An Improved Pareto Front Modeling Algorithm for Large-scale Many-Objective Op...
An Improved Pareto Front Modeling Algorithm for Large-scale Many-Objective Op...
 
VST2022.pdf
VST2022.pdfVST2022.pdf
VST2022.pdf
 
IPA Fall Days 2019
 IPA Fall Days 2019 IPA Fall Days 2019
IPA Fall Days 2019
 
An Adaptive Evolutionary Algorithm based on Non-Euclidean Geometry for Many-O...
An Adaptive Evolutionary Algorithm based on Non-Euclidean Geometry for Many-O...An Adaptive Evolutionary Algorithm based on Non-Euclidean Geometry for Many-O...
An Adaptive Evolutionary Algorithm based on Non-Euclidean Geometry for Many-O...
 
Speeding-up Software Testing With Computational Intelligence
Speeding-up Software Testing With Computational IntelligenceSpeeding-up Software Testing With Computational Intelligence
Speeding-up Software Testing With Computational Intelligence
 
Incremental Control Dependency Frontier Exploration for Many-Criteria Test C...
Incremental Control Dependency Frontier Exploration for Many-Criteria  Test C...Incremental Control Dependency Frontier Exploration for Many-Criteria  Test C...
Incremental Control Dependency Frontier Exploration for Many-Criteria Test C...
 
Sbst2018 contest2018
Sbst2018 contest2018Sbst2018 contest2018
Sbst2018 contest2018
 
Java Unit Testing Tool Competition — Fifth Round
Java Unit Testing Tool Competition — Fifth RoundJava Unit Testing Tool Competition — Fifth Round
Java Unit Testing Tool Competition — Fifth Round
 
ICSE 2017 - Evocrash
ICSE 2017 - EvocrashICSE 2017 - Evocrash
ICSE 2017 - Evocrash
 
Evolutionary Testing for Crash Reproduction
Evolutionary Testing for Crash ReproductionEvolutionary Testing for Crash Reproduction
Evolutionary Testing for Crash Reproduction
 
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
 
Security Threat Identification and Testing
Security Threat Identification and TestingSecurity Threat Identification and Testing
Security Threat Identification and Testing
 
Reformulating Branch Coverage as a Many-Objective Optimization Problem
Reformulating Branch Coverage as a Many-Objective Optimization ProblemReformulating Branch Coverage as a Many-Objective Optimization Problem
Reformulating Branch Coverage as a Many-Objective Optimization Problem
 
Results for EvoSuite-MOSA at the Third Unit Testing Tool Competition
Results for EvoSuite-MOSA at the Third Unit Testing Tool CompetitionResults for EvoSuite-MOSA at the Third Unit Testing Tool Competition
Results for EvoSuite-MOSA at the Third Unit Testing Tool Competition
 
Adaptive User Feedback for IR-based Traceability Recovery
Adaptive User Feedback for IR-based Traceability RecoveryAdaptive User Feedback for IR-based Traceability Recovery
Adaptive User Feedback for IR-based Traceability Recovery
 
Diversity mechanisms for evolutionary populations in Search-Based Software En...
Diversity mechanisms for evolutionary populations in Search-Based Software En...Diversity mechanisms for evolutionary populations in Search-Based Software En...
Diversity mechanisms for evolutionary populations in Search-Based Software En...
 
Estimating the Evolution Direction of Populations to Improve Genetic Algorithms
Estimating the Evolution Direction of Populations to Improve Genetic AlgorithmsEstimating the Evolution Direction of Populations to Improve Genetic Algorithms
Estimating the Evolution Direction of Populations to Improve Genetic Algorithms
 
When and How Using Structural Information to Improve IR-Based Traceability Re...
When and How Using Structural Information to Improve IR-Based Traceability Re...When and How Using Structural Information to Improve IR-Based Traceability Re...
When and How Using Structural Information to Improve IR-Based Traceability Re...
 

Recently uploaded

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 

Recently uploaded (20)

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 

Breaking the Silence: the Threats of Using LLMs in Software Engineering

  • 1. Breaking the Silence: the Threats of Using LLMs in SE Research 1 Annibale Panichella Thomas Durieux June Sallou
  • 2. How Did We Get Here? 2
  • 6. 6 Lekker broodje Precise Have you read this paper? ChatGPT on Defects4j? Multiple runs?
  • 8. 8 Threats to validity Guidelines From Threats to Opportunities
  • 9. 9 Threats to Validity (1) Training vs. Test Sets 1.Data leakage due to pre-training • e.g., evaluating ChatGPT on GitHub projects 2.Data leakage due to fi ne-tuning • e.g., two projects (one training, one test) sharing the same API usage
  • 10. 10 Example ChatGPT knows Defect4j (GitHub) Does it make sense to evaluate ChatGPT on Defects4j data?
  • 11. Example (2) 11 Siddiq et al.,Arxiv 2023 Remarks: • HumanEval is hosted on GitHub since 2021 • SF110 is hosted on SourceForge (not GitHub) EvoSuite achieves 75% branch coverage on SF110
  • 12. Threats to Validity (1) 12 Guidelines Use different sources (not only GitHub) Use Code clone detection techniques (better than Edit Distance) between generated code and training sets Assess LLMs on metamorphic programs Use open-source or data traceable LLMs
  • 13. Metamorphic Testing 13 Gecco 2023, 15-19 July, 2023, Lisbon, Portugal Figure 2: Comparison of F1 for random and genetic search Figure 4: Metric-m Metamorphic testing: changing the syntax of the code without changing its semantic (e.g., reverse if-the-else statements) MT can easily fool code2vec leading to a sensitive drop in performance GECCO 2023
  • 14. Threats to Validity (2) 14 Reproducibility 1.OutputVariability • Different results when using different (random) seeds 2.Time-Based Output Drift • Same setting (e.g., seed) leads to different results over time [Chen 2023] 3.Traceability • At best, researchers release the prompts
  • 15. Threats to Validity (2) 15 Reproducibility 1.OutputVariability • Different results when using different (random) seeds 2.Time-Based Output Drift • Same setting (e.g., seed) leads to different results over time [Chen 2023] 3.Traceability • At best, researchers release the prompts Guidelines Use a fi xed random seed (if possible) Assess output variability (multiple runs, multiple seeds, different sessions, different time intervals) Provide execution metadata - LLM version - Query timestamp - Variability analysis - …
  • 16. Threats to Validity (3) 16 Reproducibility 1.Model evolution unpredictability • Close source models with opaque evolution 2.Attributing improvement • Is Approach1 better than Approach2? 3.Privacy Implications • Is private data safe? • Licensing?
  • 17. Threats to Validity (3) 17 Reproducibility 1.Model evolution unpredictability • Close source models with opaque evolution 2.Attributing improvement • Is Approach1 better than Approach2? 3.Privacy Implications • Is private data safe? • Licensing? Guidelines Enhance model transparency (for LLMs providers) Use versioning information (for LLMs providers) Perform comparative analysis (for researchers) - Open-sourced vs. closed-sourced models
  • 18. Additional Remarks 18 Current evaluations of pre-trained models presents various threats to validity As a community, we have to make new guidelines on how to fairly assess LLMs vs . well established (unsupervised) approaches NLP evolution metrics do not re fl ect do not adequately re fl ect the SE performance metrics for non- NLP tasks E.g., % of compiling tests is not a valid SE metrics for test case generation approaches (based on LLMs)
  • 19. Breaking the Silence: the Threats of Using LLMs in SE Research 19 Annibale Panichella Thomas Durieux June Sallou