Can LLMs Make
SoftwareTesting
Greener?
Xutong Liu, Andy Zaidman
09-02-202
6
An Empirical Study on JUnit
Test Energy Reengineering
Presentation by: Xutong Liu
2.
09-02-202
6
2
Xutong Liu
•Postdoc, SERG(Software Engineering Research Group) in TU Delft
•Research interests: Sustainable Software Engineering, Green Testing
•PhD in Computer Science, Nanjing University
• Major: Software Quality Assurance
• Thesis topic: Defect Prediction
3.
09-02-202
6
3
Green Testing? Why?
Elasticsearchyearly builds ≈ 9.7% of the avg household energy consumption
in EU1
Elasticsearch yearly builds CO₂ ≈ driving a petrol car 222 km1
Testing ≈ 50% of total build energy consumption2
1. Zaidman, Andy. "An inconvenient truth in software engineering? The environmental impact of testing open source Java projects."
AST 2024
2. Arntzenius, R. F. "Measuring Energy Consumption during Continuous Integration of Open-Source Java Projects." (2024).
4.
09-02-202
6
4
Green Testing? Why?
Elasticsearchyearly builds ≈ 9.7% of the avg household energy consumption
in EU1
Elasticsearch yearly builds CO₂ ≈ driving a petrol car 222 km1
Testing ≈ 50% of total build energy consumption2
1. Zaidman, Andy. "An inconvenient truth in software engineering? The environmental impact of testing open source Java projects."
AST 2024
2. Arntzenius, R. F. "Measuring Energy Consumption during Continuous Integration of Open-Source Java Projects." (2024).
Software testing = essential for reliability
But… test suites consume significant energy ⚡🌍
5.
09-02-202
6
5
LLM based coderefactoring
1. Deljouyi, Amirhossein, et al. "Leveraging large language models for enhancing the understandability of generated unit tests." ICSE 2025.
2. Apsan, Radu, et al. "Generating Energy-Efficient Code via Large-Language Models--Where are we now?." arXiv preprint arXiv:2509.10099 (2025).
LLMs at test refactoring:
Enhancing the understandability of generated unit tests1
6.
LLM based coderefactoring
1. Deljouyi, Amirhossein, et al. "Leveraging large language models for enhancing the understandability of generated unit tests." ICSE 2025.
2. Apsan, Radu, et al. "Generating Energy-Efficient Code via Large-Language Models--Where are we now?." arXiv preprint arXiv:2509.10099 (2025).
LLMs at test refactoring:
Enhancing the understandability of generated unit tests1
LLMs at energy-saving task:
Copilot can generate energy-efficient non-test code2
7.
LLM based coderefactoring
1. Deljouyi, Amirhossein, et al. "Leveraging large language models for enhancing the understandability of generated unit tests." ICSE 2025.
2. Apsan, Radu, et al. "Generating Energy-Efficient Code via Large-Language Models--Where are we now?." arXiv preprint arXiv:2509.10099 (2025).
Research gap:
Can LLMs reengineer unit test code to reduce energy consumption?
LLMs at test refactoring:
Enhancing the understandability of generated unit tests1
LLMs at energy-saving task:
Copilot can generate energy-efficient non-test code2
8.
LLM based UnitTest reengineering
Factors of unit test energy consumption
Unit Under Test + Unit Test + Enviroment
9.
LLM based UnitTest reengineering
Research Questions
RQ1. How much energy can be saved by LLM-based test reengineering?
RQ2. What strategies do LLMs attempt in reengineering?
RQ3. What is the impact on code coverage?
Factors of unit test energy consumption
Unit Under Test + Unit Test + Enviroment
10.
LLM based UnitTest reengineering
Research Questions
RQ1. How much energy can be saved by LLM-based test reengineering?
RQ2. What strategies do LLMs attempt in reengineering?
RQ3. What is the impact on code coverage?
Factors of unit test energy consumption
Unit Under Test + Unit Test + Enviroment
Discussion: 💡 Why does reengineering succeed or fail?
11.
LLM based UnitTest reengineering
Research Questions
RQ1. How much energy can be saved by LLM-based test reengineering?
RQ2. What strategies do LLMs attempt in reengineering?
RQ3. What is the impact on code coverage?
Factors of unit test energy consumption
Unit Under Test + Unit Test + Enviroment
Discussion: 💡 Why does reengineering succeed or fail?
This study is: an exploratory
study with the basic LLM
setup to investigate whether
current LLM can save energy
consumption of unit test.
This study is not: LLM
prompt optimization study to
improve LLM until we see
clear energy saving of unit test.
JoularJX: Energy MeasurementTool
Advantages ✅
Provides energy consumption at Java method granularity
Requires only one execution to collect results
Easy integration into CI pipeline
Limitations ⚠️
Cannot capture methods with duration < 1 ms
NOT all test methods can be
captured by joularjx.
14.
RQ1: Can anLLM effectively reduce the energy
consumption of unit tests?
9,705 tests processed → 3,386 buildable (34.9%
runnable rate)
1,999 measurable → 1,471 unique (dedup)
799 energy ↓ → 8.2% more energy efficient
Project-level : 4 out of 8 projects reduced on test suite
15.
RQ2: Patterns behindenergy savings
799 energy ↓
Strict inclusion criteria (joules diff < 0
&& p < 0.05 && Non-negligible effect
size)
Result: 28 test methods selected
✅ Let’s Focus on Positive Cases in this RQ
Human-reviewed categories
A. Invalid UUTinvocation (4)
• Original test: Invokes
IOUtils.skipFully()
• Validates skipping bytes & exception
handling for invalid input
• LLM-modified test: Removes call to
skipFully()
• Replaces with manual loop using
InputStream.read()
Summary: LLM modifications in this category
may reduce test complexity by replacing utility
calls with simpler constructs, but they
undermine the test’s intent by bypassing the
actual UUT and weakening coverage and
correctness.
18.
B. Assertion ReductionModifications
• Original test: verify
applyPatchForFeatures() swaps the
values of total and failed features
• Assertions: 3 assertions check ordering and
correct swapping
• LLM-modified test: removes assertion (1),
keeps only (2) & (3)
B1. Reduce intermediate assertion (4)
Summary: Category B reduces redundancy and
improve efficiency by eliminating seemingly
subsumed assertions, but at the cost of weakening
test completeness and robustness.
19.
C. Syntax-Level Refactoring(6)
• Original test: FileTime.from()
• LLM-modified test: Replaced with FileTime.fromMillis()reduce the
temporal precision to milliseconds
• Impact: Potentially weakens the test, narrower scope of boundary validation
Summary: Instead of performing deliberate energy-oriented optimizations, in Category
C LLM treats the task as a general code refactoring problem - leveraging its strength in
restructuring code, but without explicitly focusing on reducing energy consumption.
20.
D. Iteration-focused Optimization(2)
Efficiency vs. Thoroughness
Summary: Category D aim to improve performance or efficiency, but can also compromise
the thoroughness or precision of the test if not carefully constrained.
21.
E. Concurrency Optimization(2)
• Context: Test simulating concurrent file deletion during
file tree traversal to trigger race conditions
• Original Test:
• Single-thread sequential deletion (10,000 files)
• Artificial delay + polling with atomic flags
• LLM-Modified Test
• 10,000 independent tasks submitted to executor
• No polling overhead, simpler concurrency handling
Summary: concurrency-focused changes reduce energy use and
runtime overhead, but at the cost of test reliability in detecting
subtle concurrency bugs.
22.
How do LLM-generatedmodifications impact the
code coverage of unit tests?
Findings
•Instruction coverage
• 5/8 projects show small decreases (-1% to -2%)
• 3 projects unchanged (cucumber-reporting,
rabbitmq-mock, spring-petclinic)
•Branch coverage
• 4 projects show slight decreases (-1% to -5%)
• 4 projects unchanged
Summary:
1. Overall, code coverage remains stable or slightly
declines after LLM modifications.
2. Energy-saving changes come at a modest cost in
coverage.
23.
Where can LLMslearn energy-saving knowledge?
Low success rate →
Lack of knowledge?
1.Inside projects
2.Stack Overflow
3.Github
sources
24.
Where can LLMslearn energy-saving knowledge?
Inside projects Stack Overflow GitHub
20K+ commits scanned
0 energy-saving commits
2,123 “refactor/optimize” (generic)
📂 💬
Summary:
Energy-related knowledge is rare & domain-specific
LLM training lacks knowledege → falls back to generic refactoring
📦
233 questions (2008–2025)
47.6% score 0 (low value)
≤
Mostly mobile apps; few Java/test
3M+ commits, 600K+ files scanned
239 entries (41/100 repos)
Linux driver “power opt”,
hardware-accelerated
25.
Threats to Validity
InternalValidity
• Energy reduction may stem from external noise (system activity, JVM
randomness)
• Mitigation: controlled “zen” mode, repeated runs, noise filtering, statistical
tests
Construct Validity
• Energy measured with JoularJX (1 ms resolution limit)
• Very short tests may not be recorded reliably
• Assumed low impact, but precision limits remain
External Validity
• Study scope: 8 open-source Java projects (JUnit)
• Hardware limitation
26.
Summary
Only 8.2% of9K+ tests
showed energy reduction
Code coverage mostly
unchanged
Overall gains are modest
Current effectiveness is limited
Reason lies in knowledge
scarcity
Add runtime energy
feedback into LLM workflow
Build datasets of code +
energy profiles
📊Findings 💡Insight 🚀Future Work
#16 In cases where LLM-generated changes significantly reduced energy consumption, what types of modifications do we observe?
#25 Collected all 233 posts under the energy tag
47.6% scored ≤ 0 → low perceived value
Top co-tags: android, python, ios
Collected Top 100 repos (by stars)
Defined 45 energy-related keywords
Scanned 3.3M+ commits, 600K+ files, 1.0M+ PRs, 1.2M+ issues
Matches: 239 entries in 41 repos
Examples: Linux driver “power opt”, hardware-accelerated