Can LLMs Make
Software Testing
Greener?
Xutong Liu, Andy Zaidman
09-02-202
6
An Empirical Study on JUnit
Test Energy Reengineering
Presentation by: Xutong Liu
09-02-202
6
2
Xutong Liu
•Postdoc, SERG (Software Engineering Research Group) in TU Delft
•Research interests: Sustainable Software Engineering, Green Testing
•PhD in Computer Science, Nanjing University
• Major: Software Quality Assurance
• Thesis topic: Defect Prediction
09-02-202
6
3
Green Testing? Why?
Elasticsearch yearly builds ≈ 9.7% of the avg household energy consumption
in EU1
Elasticsearch yearly builds CO₂ ≈ driving a petrol car 222 km1
Testing ≈ 50% of total build energy consumption2
1. Zaidman, Andy. "An inconvenient truth in software engineering? The environmental impact of testing open source Java projects."
AST 2024
2. Arntzenius, R. F. "Measuring Energy Consumption during Continuous Integration of Open-Source Java Projects." (2024).
09-02-202
6
4
Green Testing? Why?
Elasticsearch yearly builds ≈ 9.7% of the avg household energy consumption
in EU1
Elasticsearch yearly builds CO₂ ≈ driving a petrol car 222 km1
Testing ≈ 50% of total build energy consumption2
1. Zaidman, Andy. "An inconvenient truth in software engineering? The environmental impact of testing open source Java projects."
AST 2024
2. Arntzenius, R. F. "Measuring Energy Consumption during Continuous Integration of Open-Source Java Projects." (2024).
Software testing = essential for reliability
But… test suites consume significant energy ⚡🌍
09-02-202
6
5
LLM based code refactoring
1. Deljouyi, Amirhossein, et al. "Leveraging large language models for enhancing the understandability of generated unit tests." ICSE 2025.
2. Apsan, Radu, et al. "Generating Energy-Efficient Code via Large-Language Models--Where are we now?." arXiv preprint arXiv:2509.10099 (2025).
LLMs at test refactoring:
Enhancing the understandability of generated unit tests1
LLM based code refactoring
1. Deljouyi, Amirhossein, et al. "Leveraging large language models for enhancing the understandability of generated unit tests." ICSE 2025.
2. Apsan, Radu, et al. "Generating Energy-Efficient Code via Large-Language Models--Where are we now?." arXiv preprint arXiv:2509.10099 (2025).
LLMs at test refactoring:
Enhancing the understandability of generated unit tests1
LLMs at energy-saving task:
Copilot can generate energy-efficient non-test code2
LLM based code refactoring
1. Deljouyi, Amirhossein, et al. "Leveraging large language models for enhancing the understandability of generated unit tests." ICSE 2025.
2. Apsan, Radu, et al. "Generating Energy-Efficient Code via Large-Language Models--Where are we now?." arXiv preprint arXiv:2509.10099 (2025).
Research gap:
Can LLMs reengineer unit test code to reduce energy consumption?
LLMs at test refactoring:
Enhancing the understandability of generated unit tests1
LLMs at energy-saving task:
Copilot can generate energy-efficient non-test code2
LLM based Unit Test reengineering
Factors of unit test energy consumption
Unit Under Test + Unit Test + Enviroment
LLM based Unit Test reengineering
Research Questions
RQ1. How much energy can be saved by LLM-based test reengineering?
RQ2. What strategies do LLMs attempt in reengineering?
RQ3. What is the impact on code coverage?
Factors of unit test energy consumption
Unit Under Test + Unit Test + Enviroment
LLM based Unit Test reengineering
Research Questions
RQ1. How much energy can be saved by LLM-based test reengineering?
RQ2. What strategies do LLMs attempt in reengineering?
RQ3. What is the impact on code coverage?
Factors of unit test energy consumption
Unit Under Test + Unit Test + Enviroment
Discussion: 💡 Why does reengineering succeed or fail?
LLM based Unit Test reengineering
Research Questions
RQ1. How much energy can be saved by LLM-based test reengineering?
RQ2. What strategies do LLMs attempt in reengineering?
RQ3. What is the impact on code coverage?
Factors of unit test energy consumption
Unit Under Test + Unit Test + Enviroment
Discussion: 💡 Why does reengineering succeed or fail?
This study is: an exploratory
study with the basic LLM
setup to investigate whether
current LLM can save energy
consumption of unit test.
This study is not: LLM
prompt optimization study to
improve LLM until we see
clear energy saving of unit test.
System pipeline diagram
Prompt Design and pipeline
JoularJX: Energy Measurement Tool
 Advantages ✅
 Provides energy consumption at Java method granularity
 Requires only one execution to collect results
 Easy integration into CI pipeline
 Limitations ⚠️
 Cannot capture methods with duration < 1 ms
NOT all test methods can be
captured by joularjx.
RQ1: Can an LLM effectively reduce the energy
consumption of unit tests?
 9,705 tests processed → 3,386 buildable (34.9%
runnable rate)
 1,999 measurable → 1,471 unique (dedup)
 799 energy ↓ → 8.2% more energy efficient
 Project-level : 4 out of 8 projects reduced on test suite
RQ2: Patterns behind energy savings
 799 energy ↓
 Strict inclusion criteria (joules diff < 0
&& p < 0.05 && Non-negligible effect
size)
 Result: 28 test methods selected
✅ Let’s Focus on Positive Cases in this RQ
Human-reviewed categories
Categories of LLM-optimized unit tests
A. Invalid UUT invocation (4)
• Original test: Invokes
IOUtils.skipFully()
• Validates skipping bytes & exception
handling for invalid input
• LLM-modified test: Removes call to
skipFully()
• Replaces with manual loop using
InputStream.read()
Summary: LLM modifications in this category
may reduce test complexity by replacing utility
calls with simpler constructs, but they
undermine the test’s intent by bypassing the
actual UUT and weakening coverage and
correctness.
B. Assertion Reduction Modifications
• Original test: verify
applyPatchForFeatures() swaps the
values of total and failed features
• Assertions: 3 assertions check ordering and
correct swapping
• LLM-modified test: removes assertion (1),
keeps only (2) & (3)
B1. Reduce intermediate assertion (4)
Summary: Category B reduces redundancy and
improve efficiency by eliminating seemingly
subsumed assertions, but at the cost of weakening
test completeness and robustness.
C. Syntax-Level Refactoring (6)
• Original test: FileTime.from()
• LLM-modified test: Replaced with FileTime.fromMillis()reduce the
temporal precision to milliseconds
• Impact: Potentially weakens the test, narrower scope of boundary validation
Summary: Instead of performing deliberate energy-oriented optimizations, in Category
C LLM treats the task as a general code refactoring problem - leveraging its strength in
restructuring code, but without explicitly focusing on reducing energy consumption.
D. Iteration-focused Optimization (2)
Efficiency vs. Thoroughness
Summary: Category D aim to improve performance or efficiency, but can also compromise
the thoroughness or precision of the test if not carefully constrained.
E. Concurrency Optimization (2)
• Context: Test simulating concurrent file deletion during
file tree traversal to trigger race conditions
• Original Test:
• Single-thread sequential deletion (10,000 files)
• Artificial delay + polling with atomic flags
• LLM-Modified Test
• 10,000 independent tasks submitted to executor
• No polling overhead, simpler concurrency handling
Summary: concurrency-focused changes reduce energy use and
runtime overhead, but at the cost of test reliability in detecting
subtle concurrency bugs.
How do LLM-generated modifications impact the
code coverage of unit tests?
Findings
•Instruction coverage
• 5/8 projects show small decreases (-1% to -2%)
• 3 projects unchanged (cucumber-reporting,
rabbitmq-mock, spring-petclinic)
•Branch coverage
• 4 projects show slight decreases (-1% to -5%)
• 4 projects unchanged
Summary:
1. Overall, code coverage remains stable or slightly
declines after LLM modifications.
2. Energy-saving changes come at a modest cost in
coverage.
Where can LLMs learn energy-saving knowledge?
Low success rate →
Lack of knowledge?
1.Inside projects
2.Stack Overflow
3.Github
sources
Where can LLMs learn energy-saving knowledge?
Inside projects Stack Overflow GitHub
20K+ commits scanned
0 energy-saving commits
2,123 “refactor/optimize” (generic)
📂 💬
Summary:
Energy-related knowledge is rare & domain-specific
LLM training lacks knowledege → falls back to generic refactoring
📦
233 questions (2008–2025)
47.6% score 0 (low value)
≤
Mostly mobile apps; few Java/test
3M+ commits, 600K+ files scanned
239 entries (41/100 repos)
Linux driver “power opt”,
hardware-accelerated
Threats to Validity
Internal Validity
• Energy reduction may stem from external noise (system activity, JVM
randomness)
• Mitigation: controlled “zen” mode, repeated runs, noise filtering, statistical
tests
Construct Validity
• Energy measured with JoularJX (1 ms resolution limit)
• Very short tests may not be recorded reliably
• Assumed low impact, but precision limits remain
External Validity
• Study scope: 8 open-source Java projects (JUnit)
• Hardware limitation
Summary
Only 8.2% of 9K+ tests
showed energy reduction
Code coverage mostly
unchanged
Overall gains are modest
Current effectiveness is limited
Reason lies in knowledge
scarcity
Add runtime energy
feedback into LLM workflow
Build datasets of code +
energy profiles
📊Findings 💡Insight 🚀Future Work
Thanks!
Xutong Liu
09-02-202
6
Energy Measurement

LLM for Green Test: Java Unit Test Reengineering

  • 1.
    Can LLMs Make SoftwareTesting Greener? Xutong Liu, Andy Zaidman 09-02-202 6 An Empirical Study on JUnit Test Energy Reengineering Presentation by: Xutong Liu
  • 2.
    09-02-202 6 2 Xutong Liu •Postdoc, SERG(Software Engineering Research Group) in TU Delft •Research interests: Sustainable Software Engineering, Green Testing •PhD in Computer Science, Nanjing University • Major: Software Quality Assurance • Thesis topic: Defect Prediction
  • 3.
    09-02-202 6 3 Green Testing? Why? Elasticsearchyearly builds ≈ 9.7% of the avg household energy consumption in EU1 Elasticsearch yearly builds CO₂ ≈ driving a petrol car 222 km1 Testing ≈ 50% of total build energy consumption2 1. Zaidman, Andy. "An inconvenient truth in software engineering? The environmental impact of testing open source Java projects." AST 2024 2. Arntzenius, R. F. "Measuring Energy Consumption during Continuous Integration of Open-Source Java Projects." (2024).
  • 4.
    09-02-202 6 4 Green Testing? Why? Elasticsearchyearly builds ≈ 9.7% of the avg household energy consumption in EU1 Elasticsearch yearly builds CO₂ ≈ driving a petrol car 222 km1 Testing ≈ 50% of total build energy consumption2 1. Zaidman, Andy. "An inconvenient truth in software engineering? The environmental impact of testing open source Java projects." AST 2024 2. Arntzenius, R. F. "Measuring Energy Consumption during Continuous Integration of Open-Source Java Projects." (2024). Software testing = essential for reliability But… test suites consume significant energy ⚡🌍
  • 5.
    09-02-202 6 5 LLM based coderefactoring 1. Deljouyi, Amirhossein, et al. "Leveraging large language models for enhancing the understandability of generated unit tests." ICSE 2025. 2. Apsan, Radu, et al. "Generating Energy-Efficient Code via Large-Language Models--Where are we now?." arXiv preprint arXiv:2509.10099 (2025). LLMs at test refactoring: Enhancing the understandability of generated unit tests1
  • 6.
    LLM based coderefactoring 1. Deljouyi, Amirhossein, et al. "Leveraging large language models for enhancing the understandability of generated unit tests." ICSE 2025. 2. Apsan, Radu, et al. "Generating Energy-Efficient Code via Large-Language Models--Where are we now?." arXiv preprint arXiv:2509.10099 (2025). LLMs at test refactoring: Enhancing the understandability of generated unit tests1 LLMs at energy-saving task: Copilot can generate energy-efficient non-test code2
  • 7.
    LLM based coderefactoring 1. Deljouyi, Amirhossein, et al. "Leveraging large language models for enhancing the understandability of generated unit tests." ICSE 2025. 2. Apsan, Radu, et al. "Generating Energy-Efficient Code via Large-Language Models--Where are we now?." arXiv preprint arXiv:2509.10099 (2025). Research gap: Can LLMs reengineer unit test code to reduce energy consumption? LLMs at test refactoring: Enhancing the understandability of generated unit tests1 LLMs at energy-saving task: Copilot can generate energy-efficient non-test code2
  • 8.
    LLM based UnitTest reengineering Factors of unit test energy consumption Unit Under Test + Unit Test + Enviroment
  • 9.
    LLM based UnitTest reengineering Research Questions RQ1. How much energy can be saved by LLM-based test reengineering? RQ2. What strategies do LLMs attempt in reengineering? RQ3. What is the impact on code coverage? Factors of unit test energy consumption Unit Under Test + Unit Test + Enviroment
  • 10.
    LLM based UnitTest reengineering Research Questions RQ1. How much energy can be saved by LLM-based test reengineering? RQ2. What strategies do LLMs attempt in reengineering? RQ3. What is the impact on code coverage? Factors of unit test energy consumption Unit Under Test + Unit Test + Enviroment Discussion: 💡 Why does reengineering succeed or fail?
  • 11.
    LLM based UnitTest reengineering Research Questions RQ1. How much energy can be saved by LLM-based test reengineering? RQ2. What strategies do LLMs attempt in reengineering? RQ3. What is the impact on code coverage? Factors of unit test energy consumption Unit Under Test + Unit Test + Enviroment Discussion: 💡 Why does reengineering succeed or fail? This study is: an exploratory study with the basic LLM setup to investigate whether current LLM can save energy consumption of unit test. This study is not: LLM prompt optimization study to improve LLM until we see clear energy saving of unit test.
  • 12.
    System pipeline diagram PromptDesign and pipeline
  • 13.
    JoularJX: Energy MeasurementTool  Advantages ✅  Provides energy consumption at Java method granularity  Requires only one execution to collect results  Easy integration into CI pipeline  Limitations ⚠️  Cannot capture methods with duration < 1 ms NOT all test methods can be captured by joularjx.
  • 14.
    RQ1: Can anLLM effectively reduce the energy consumption of unit tests?  9,705 tests processed → 3,386 buildable (34.9% runnable rate)  1,999 measurable → 1,471 unique (dedup)  799 energy ↓ → 8.2% more energy efficient  Project-level : 4 out of 8 projects reduced on test suite
  • 15.
    RQ2: Patterns behindenergy savings  799 energy ↓  Strict inclusion criteria (joules diff < 0 && p < 0.05 && Non-negligible effect size)  Result: 28 test methods selected ✅ Let’s Focus on Positive Cases in this RQ Human-reviewed categories
  • 16.
  • 17.
    A. Invalid UUTinvocation (4) • Original test: Invokes IOUtils.skipFully() • Validates skipping bytes & exception handling for invalid input • LLM-modified test: Removes call to skipFully() • Replaces with manual loop using InputStream.read() Summary: LLM modifications in this category may reduce test complexity by replacing utility calls with simpler constructs, but they undermine the test’s intent by bypassing the actual UUT and weakening coverage and correctness.
  • 18.
    B. Assertion ReductionModifications • Original test: verify applyPatchForFeatures() swaps the values of total and failed features • Assertions: 3 assertions check ordering and correct swapping • LLM-modified test: removes assertion (1), keeps only (2) & (3) B1. Reduce intermediate assertion (4) Summary: Category B reduces redundancy and improve efficiency by eliminating seemingly subsumed assertions, but at the cost of weakening test completeness and robustness.
  • 19.
    C. Syntax-Level Refactoring(6) • Original test: FileTime.from() • LLM-modified test: Replaced with FileTime.fromMillis()reduce the temporal precision to milliseconds • Impact: Potentially weakens the test, narrower scope of boundary validation Summary: Instead of performing deliberate energy-oriented optimizations, in Category C LLM treats the task as a general code refactoring problem - leveraging its strength in restructuring code, but without explicitly focusing on reducing energy consumption.
  • 20.
    D. Iteration-focused Optimization(2) Efficiency vs. Thoroughness Summary: Category D aim to improve performance or efficiency, but can also compromise the thoroughness or precision of the test if not carefully constrained.
  • 21.
    E. Concurrency Optimization(2) • Context: Test simulating concurrent file deletion during file tree traversal to trigger race conditions • Original Test: • Single-thread sequential deletion (10,000 files) • Artificial delay + polling with atomic flags • LLM-Modified Test • 10,000 independent tasks submitted to executor • No polling overhead, simpler concurrency handling Summary: concurrency-focused changes reduce energy use and runtime overhead, but at the cost of test reliability in detecting subtle concurrency bugs.
  • 22.
    How do LLM-generatedmodifications impact the code coverage of unit tests? Findings •Instruction coverage • 5/8 projects show small decreases (-1% to -2%) • 3 projects unchanged (cucumber-reporting, rabbitmq-mock, spring-petclinic) •Branch coverage • 4 projects show slight decreases (-1% to -5%) • 4 projects unchanged Summary: 1. Overall, code coverage remains stable or slightly declines after LLM modifications. 2. Energy-saving changes come at a modest cost in coverage.
  • 23.
    Where can LLMslearn energy-saving knowledge? Low success rate → Lack of knowledge? 1.Inside projects 2.Stack Overflow 3.Github sources
  • 24.
    Where can LLMslearn energy-saving knowledge? Inside projects Stack Overflow GitHub 20K+ commits scanned 0 energy-saving commits 2,123 “refactor/optimize” (generic) 📂 💬 Summary: Energy-related knowledge is rare & domain-specific LLM training lacks knowledege → falls back to generic refactoring 📦 233 questions (2008–2025) 47.6% score 0 (low value) ≤ Mostly mobile apps; few Java/test 3M+ commits, 600K+ files scanned 239 entries (41/100 repos) Linux driver “power opt”, hardware-accelerated
  • 25.
    Threats to Validity InternalValidity • Energy reduction may stem from external noise (system activity, JVM randomness) • Mitigation: controlled “zen” mode, repeated runs, noise filtering, statistical tests Construct Validity • Energy measured with JoularJX (1 ms resolution limit) • Very short tests may not be recorded reliably • Assumed low impact, but precision limits remain External Validity • Study scope: 8 open-source Java projects (JUnit) • Hardware limitation
  • 26.
    Summary Only 8.2% of9K+ tests showed energy reduction Code coverage mostly unchanged Overall gains are modest Current effectiveness is limited Reason lies in knowledge scarcity Add runtime energy feedback into LLM workflow Build datasets of code + energy profiles 📊Findings 💡Insight 🚀Future Work
  • 27.
  • 28.

Editor's Notes

  • #16 In cases where LLM-generated changes significantly reduced energy consumption, what types of modifications do we observe?
  • #25 Collected all 233 posts under the energy tag 47.6% scored ≤ 0 → low perceived value Top co-tags: android, python, ios Collected Top 100 repos (by stars) Defined 45 energy-related keywords Scanned 3.3M+ commits, 600K+ files, 1.0M+ PRs, 1.2M+ issues Matches: 239 entries in 41 repos Examples: Linux driver “power opt”, hardware-accelerated