LLM for Green Test: Java Unit Test Reengineering

Can LLMs Make
Software Testing
Greener?
Xutong Liu, Andy Zaidman
09-02-202
6
An Empirical Study on JUnit
Test Energy Reengineering
Presentation by: Xutong Liu

09-02-202
6
2
Xutong Liu
•Postdoc, SERG (Software Engineering Research Group) in TU Delft
•Research interests: Sustainable Software Engineering, Green Testing
•PhD in Computer Science, Nanjing University
• Major: Software Quality Assurance
• Thesis topic: Defect Prediction

09-02-202
6
3
Green Testing? Why?
Elasticsearch yearly builds ≈ 9.7% of the avg household energy consumption
in EU1
Elasticsearch yearly builds CO₂ ≈ driving a petrol car 222 km1
Testing ≈ 50% of total build energy consumption2
1. Zaidman, Andy. "An inconvenient truth in software engineering? The environmental impact of testing open source Java projects."
AST 2024
2. Arntzenius, R. F. "Measuring Energy Consumption during Continuous Integration of Open-Source Java Projects." (2024).

09-02-202
6
4
Green Testing? Why?
Elasticsearch yearly builds ≈ 9.7% of the avg household energy consumption
in EU1
Elasticsearch yearly builds CO₂ ≈ driving a petrol car 222 km1
Testing ≈ 50% of total build energy consumption2
1. Zaidman, Andy. "An inconvenient truth in software engineering? The environmental impact of testing open source Java projects."
AST 2024
2. Arntzenius, R. F. "Measuring Energy Consumption during Continuous Integration of Open-Source Java Projects." (2024).
Software testing = essential for reliability
But… test suites consume significant energy ⚡🌍

09-02-202
6
5
LLM based code refactoring
1. Deljouyi, Amirhossein, et al. "Leveraging large language models for enhancing the understandability of generated unit tests." ICSE 2025.
2. Apsan, Radu, et al. "Generating Energy-Efficient Code via Large-Language Models--Where are we now?." arXiv preprint arXiv:2509.10099 (2025).
LLMs at test refactoring:
Enhancing the understandability of generated unit tests1

LLMs at energy-saving task:
Copilot can generate energy-efficient non-test code2

Research gap:
Can LLMs reengineer unit test code to reduce energy consumption?
LLMs at energy-saving task:
Copilot can generate energy-efficient non-test code2

LLM based Unit Test reengineering
Factors of unit test energy consumption
Unit Under Test + Unit Test + Enviroment

Research Questions
RQ1. How much energy can be saved by LLM-based test reengineering?
RQ2. What strategies do LLMs attempt in reengineering?
RQ3. What is the impact on code coverage?

Research Questions
Discussion: 💡 Why does reengineering succeed or fail?

Research Questions
Discussion: 💡 Why does reengineering succeed or fail?
This study is: an exploratory
study with the basic LLM
setup to investigate whether
current LLM can save energy
consumption of unit test.
This study is not: LLM
prompt optimization study to
improve LLM until we see
clear energy saving of unit test.

System pipeline diagram
Prompt Design and pipeline

JoularJX: Energy Measurement Tool
 Advantages ✅
 Provides energy consumption at Java method granularity
 Requires only one execution to collect results
 Easy integration into CI pipeline
 Limitations ⚠️
 Cannot capture methods with duration < 1 ms
NOT all test methods can be
captured by joularjx.

RQ1: Can an LLM effectively reduce the energy
consumption of unit tests?
 9,705 tests processed → 3,386 buildable (34.9%
runnable rate)
 1,999 measurable → 1,471 unique (dedup)
 799 energy ↓ → 8.2% more energy efficient
 Project-level : 4 out of 8 projects reduced on test suite

RQ2: Patterns behind energy savings
 799 energy ↓
 Strict inclusion criteria (joules diff < 0
&& p < 0.05 && Non-negligible effect
size)
 Result: 28 test methods selected
✅ Let’s Focus on Positive Cases in this RQ
Human-reviewed categories

Categories of LLM-optimized unit tests

A. Invalid UUT invocation (4)
• Original test: Invokes
IOUtils.skipFully()
• Validates skipping bytes & exception
handling for invalid input
• LLM-modified test: Removes call to
skipFully()
• Replaces with manual loop using
InputStream.read()
Summary: LLM modifications in this category
may reduce test complexity by replacing utility
calls with simpler constructs, but they
undermine the test’s intent by bypassing the
actual UUT and weakening coverage and
correctness.

B. Assertion Reduction Modifications
• Original test: verify
applyPatchForFeatures() swaps the
values of total and failed features
• Assertions: 3 assertions check ordering and
correct swapping
• LLM-modified test: removes assertion (1),
keeps only (2) & (3)
B1. Reduce intermediate assertion (4)
Summary: Category B reduces redundancy and
improve efficiency by eliminating seemingly
subsumed assertions, but at the cost of weakening
test completeness and robustness.

C. Syntax-Level Refactoring (6)
• Original test: FileTime.from()
• LLM-modified test: Replaced with FileTime.fromMillis()reduce the
temporal precision to milliseconds
• Impact: Potentially weakens the test, narrower scope of boundary validation
Summary: Instead of performing deliberate energy-oriented optimizations, in Category
C LLM treats the task as a general code refactoring problem - leveraging its strength in
restructuring code, but without explicitly focusing on reducing energy consumption.

D. Iteration-focused Optimization (2)
Efficiency vs. Thoroughness
Summary: Category D aim to improve performance or efficiency, but can also compromise
the thoroughness or precision of the test if not carefully constrained.

E. Concurrency Optimization (2)
• Context: Test simulating concurrent file deletion during
file tree traversal to trigger race conditions
• Original Test:
• Single-thread sequential deletion (10,000 files)
• Artificial delay + polling with atomic flags
• LLM-Modified Test
• 10,000 independent tasks submitted to executor
• No polling overhead, simpler concurrency handling
Summary: concurrency-focused changes reduce energy use and
runtime overhead, but at the cost of test reliability in detecting
subtle concurrency bugs.

How do LLM-generated modifications impact the
code coverage of unit tests?
Findings
•Instruction coverage
• 5/8 projects show small decreases (-1% to -2%)
• 3 projects unchanged (cucumber-reporting,
rabbitmq-mock, spring-petclinic)
•Branch coverage
• 4 projects show slight decreases (-1% to -5%)
• 4 projects unchanged
Summary:
1. Overall, code coverage remains stable or slightly
declines after LLM modifications.
2. Energy-saving changes come at a modest cost in
coverage.

Where can LLMs learn energy-saving knowledge?
Low success rate →
Lack of knowledge?
1.Inside projects
2.Stack Overflow
3.Github
sources

Where can LLMs learn energy-saving knowledge?
Inside projects Stack Overflow GitHub
20K+ commits scanned
0 energy-saving commits
2,123 “refactor/optimize” (generic)
📂 💬
Summary:
Energy-related knowledge is rare & domain-specific
LLM training lacks knowledege → falls back to generic refactoring
📦
233 questions (2008–2025)
47.6% score 0 (low value)
≤
Mostly mobile apps; few Java/test
3M+ commits, 600K+ files scanned
239 entries (41/100 repos)
Linux driver “power opt”,
hardware-accelerated

Threats to Validity
Internal Validity
• Energy reduction may stem from external noise (system activity, JVM
randomness)
• Mitigation: controlled “zen” mode, repeated runs, noise filtering, statistical
tests
Construct Validity
• Energy measured with JoularJX (1 ms resolution limit)
• Very short tests may not be recorded reliably
• Assumed low impact, but precision limits remain
External Validity
• Study scope: 8 open-source Java projects (JUnit)
• Hardware limitation

Summary
Only 8.2% of 9K+ tests
showed energy reduction
Code coverage mostly
unchanged
Overall gains are modest
Current effectiveness is limited
Reason lies in knowledge
scarcity
Add runtime energy
feedback into LLM workflow
Build datasets of code +
energy profiles
📊Findings 💡Insight 🚀Future Work

Thanks!
Xutong Liu
09-02-202
6

LLM for Green Test: Java Unit Test Reengineering

More Related Content

Similar to LLM for Green Test: Java Unit Test Reengineering

Recently uploaded

LLM for Green Test: Java Unit Test Reengineering

Editor's Notes