On-Device or Remote? On the Energy Efficiency of
Fetching LLM-Generated Content
1
Vince Nguyen, 1
Vidya Dhopate, 1
Hieu Huynh, 1
Hiba Bouhlal, 1
Anusha Annengala,
2
Gian Luca Scoccia, 3
Matias Martinez, 1
Vincenzo Stoico, 1
Ivano Malavolta
1
Vrije Universiteit Amsterdam, The Netherlands, 2
Gran Sasso Science Institute, Italy,
3
Universitat Politècnica de Catalunya, Spain
{x.nguyenthanhvinh | v.dhopate | x.huynhthaihieu | h.bouhlal | a.annengala}@student.vu.nl,
gianluca.scoccia@gssi.it, matias.martinez@upc.edu, v.stoico@vu.nl, i.malavolta@vu.nl
Introduction
2
Advantages of fetching content locally
✓ Enhanced privacy: No HTTP requests to remote servers →
reduced data exposure
✓ Offline functionality: Continuous access without
connectivity
✓ Latency reduction: Local processing enables real-time
responses
curl https://api.openai.com/v1/responses 
-H "Content-Type: application/json" 
-H "Authorization: Bearer $OPENAI_API_KEY" 
-d '{
"model": "gpt-4.1",
"input": "Write a one-sentence bedtime
story about a unicorn."
}'
Fetching Remote Data
OpenAI developer platform, https://platform.openai.com/docs/overview?lang=curl
Martine Paris, Forbes, https://shorturl.at/BkAKH
Goal
3
High computational demands
⚠ Remote hosting: LLMs require cloud servers for
inference (high computational demands)
⚠ Dual impact:
● Network energy savings from eliminated server
requests
● Increased computation costs from local inference
The goal of this study to compare the energy usage of fetching content generated by LLMs from a remote server via HTTP
requests and generating content with on-device LLMs
Study Design
4
RQ1: How does the energy usage of the client machine
differ when fetching LLMs-generated content on-device
versus remotely?
RQ2: What is the correlation between performance
metrics and the energy usage of the client machine
when fetching LLMs-generated content on-device versus
remotely?
- RQ2.1: Execution Time
- RQ2.2: CPU Usage
- RQ2.3: GPU Usage
- RQ2.4: Memory Usage
(2 Fetching Methods x 7 LLMs x 3 Output Lengths) x 30 Repetitions = 1260 Runs
Blocking Factor - Output Length:
Short (100 words), Medium (500 words), and Long (1000 words)
Yuan, Y. et al. The Impact of Knowledge Distillation on the
Energy Consumption and Runtime Efficiency of NLP Models, CAIN
2024, https://shorturl.at/1Uqvb
Subjects Selection
5
7 Quantized LLMs via Ollama
● 10 to 30 seconds to generate content
● More than 3 million downloads as of October
2024
Llama 3.1 8B
Gemma1.1 2B and 7B
Qwen2 1.5B and 7B
Phi3 3B
Mistral 0.3 7B
Prompt Topics
● Most-viewed Wikipedia pages from Dec. 1, 2007 to Jan. 1, 2023
● Extracted 101 topics
“Please give me information about [topic]”
Experiment Execution
6
HTTP Requests:
- On-device: the Python code makes on-device HTTP GET requests to
http://localhost/api/generate
- Remote: we expose the server’s IP address publicly so it can be called by
our test on-device machine http://[server_IP_address]/api/generate
Experiment Runner: Python-based framework to define and execute the steps
of an experiment on any platform
Steps to mitigate confounding factors:
- Randomize the order of all individual runs
- Cooldown period of 90 seconds
- Control fan speed (6000 rotations per minutes)
S2 Group, Experiment Runner, https://github.com/S2-group/experiment-runner
7
On-Device Processing Uses 4–9× More Energy Than Remote
(Results) RQ1: Energy Usage
Output
Length
Fetching
Mode
Mean Median SD
Short On-device 52.82 55.0 20.94
Remote 15.18 14.30 5.86
Medium On-device 349.34 403.80 179.15
Remote 41.01 47.55 14.18
Long On-device 431.97 462.50 246.92
Remote 48.56 47.80 19.86
Energy Usage (J) - Stats
Statistically significant difference
(Wilcoxon test, p < 0.05)
High variability
(On-device SDs 6-7x higher)
Large effect size (Cliff’s Delta: 0.94–0.96)
8
(Results) RQ2: Energy vs Performance Correlation
Longer it takes to execute a fetch, the more energy is consumed
on client’s device
The higher the CPU usage, the lower energy usage
RQ2.2: Energy Usage vs CPU Usage
RQ2.2: Energy Usage vs Execution Time
Strong positive correlation
(Spearman’s rank correlation, p < 0.001)
Linear Negative Correlation
(Spearman’s rank correlation, p < 0.001)
(s)
(s)
9
(Results) RQ2: Energy vs Performance Correlation
RQ2.2: Energy Usage vs Memory Usage
RQ2.2: Energy Usage vs GPU Usage
GPU usage has greater impact on longer, more complex
generated content
No evident pattern between energy and memory usage
Small positive difference in On-Device
Statistically significant negative correlation in Remote (p < 0.001)
No significant correlation in Remote (p in [0.57 - 0.89])
Significant positive correlation in On-Device (p < 0.001)
(Discussion) RQ1: Energy Usage
10
On-Device Processing Uses 4–9× More Energy Than Remote
LLMs characteristics can have a role in inference energy efficiency
Influencing Factors
▸ Network conditions
- latency benefits from client-server
proximity and fewer network hops
- latency can increase with a different
traffic (requests from multiple client
devices)
▸ Hardware
- Remote Server is an AI server
CPU: 24 cores,
RAM: 64 GB,
GPU: Nvidia RTX 4070 (2023)
11
(Discussion) RQ2: Correlation Performance - Energy
Fetching
Mode
Average CPU Usage (%) Average GPU Usage (%)
Length Mean Median SD Mean Median SD
Short On-device 4.93 4.43 1.84 77.95 77.81 11.65
Remote 9.95 8.60 4.10 0.72 0.72 0.27
Medium On-device 3.54 3.37 0.76 93.06 92.79 94.23
Remote 5.12 4.53 2.14 0.29 0.23 0.15
Long On-device 3.52 3.36 0.68 93.23 93.72 4.25
Remote 4.51 4.22 1.68 0.26 0.22 0.13
Fetching
Mode
Average Memory Usage (%)
Length Mean Median SD
Short On-device 70.13 71.25 7.21
Remote 57.03 56.00 5.86
Medium On-device 71.27 72.85 7.20
Remote 56.81 56.48 5.04
Long On-device 71.26 72.65 7.31
Remote 56.63 56.22 5.06
▸ Linearity between Energy Usage and Execution Time
▸ Higher CPU, Lower Energy Usage
▸ Correlation between Output Length and GPU usage on
On-Device
▸ High Memory Usage on-device
▸ No significant correlation between
Energy and Memory in Remote Mode
Conclusion
12
On-Device Processing Uses 4–9× More Energy Than Remote Implications
▸ Low battery (<10%)? Prioritize cloud-based LLMs to reduce
client energy consumption
▸ Privacy/offline needs? Use on-device LLMs but limit
response length (e.g., ≤100 words) to minimize energy drain
Optimization Strategies
▸ Quantization testing: Compare 4-bit/8-bit/16-bit models for
accuracy-energy tradeoffs
▸ Focus on Resource Utilization: the interplay between
energy and usage of other computational resources (e.g.,
network, GPU, CPU)

On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Content (CAIN 2025)

  • 1.
    On-Device or Remote?On the Energy Efficiency of Fetching LLM-Generated Content 1 Vince Nguyen, 1 Vidya Dhopate, 1 Hieu Huynh, 1 Hiba Bouhlal, 1 Anusha Annengala, 2 Gian Luca Scoccia, 3 Matias Martinez, 1 Vincenzo Stoico, 1 Ivano Malavolta 1 Vrije Universiteit Amsterdam, The Netherlands, 2 Gran Sasso Science Institute, Italy, 3 Universitat Politècnica de Catalunya, Spain {x.nguyenthanhvinh | v.dhopate | x.huynhthaihieu | h.bouhlal | a.annengala}@student.vu.nl, gianluca.scoccia@gssi.it, matias.martinez@upc.edu, v.stoico@vu.nl, i.malavolta@vu.nl
  • 2.
    Introduction 2 Advantages of fetchingcontent locally ✓ Enhanced privacy: No HTTP requests to remote servers → reduced data exposure ✓ Offline functionality: Continuous access without connectivity ✓ Latency reduction: Local processing enables real-time responses curl https://api.openai.com/v1/responses -H "Content-Type: application/json" -H "Authorization: Bearer $OPENAI_API_KEY" -d '{ "model": "gpt-4.1", "input": "Write a one-sentence bedtime story about a unicorn." }' Fetching Remote Data OpenAI developer platform, https://platform.openai.com/docs/overview?lang=curl Martine Paris, Forbes, https://shorturl.at/BkAKH
  • 3.
    Goal 3 High computational demands ⚠Remote hosting: LLMs require cloud servers for inference (high computational demands) ⚠ Dual impact: ● Network energy savings from eliminated server requests ● Increased computation costs from local inference The goal of this study to compare the energy usage of fetching content generated by LLMs from a remote server via HTTP requests and generating content with on-device LLMs
  • 4.
    Study Design 4 RQ1: Howdoes the energy usage of the client machine differ when fetching LLMs-generated content on-device versus remotely? RQ2: What is the correlation between performance metrics and the energy usage of the client machine when fetching LLMs-generated content on-device versus remotely? - RQ2.1: Execution Time - RQ2.2: CPU Usage - RQ2.3: GPU Usage - RQ2.4: Memory Usage (2 Fetching Methods x 7 LLMs x 3 Output Lengths) x 30 Repetitions = 1260 Runs Blocking Factor - Output Length: Short (100 words), Medium (500 words), and Long (1000 words) Yuan, Y. et al. The Impact of Knowledge Distillation on the Energy Consumption and Runtime Efficiency of NLP Models, CAIN 2024, https://shorturl.at/1Uqvb
  • 5.
    Subjects Selection 5 7 QuantizedLLMs via Ollama ● 10 to 30 seconds to generate content ● More than 3 million downloads as of October 2024 Llama 3.1 8B Gemma1.1 2B and 7B Qwen2 1.5B and 7B Phi3 3B Mistral 0.3 7B Prompt Topics ● Most-viewed Wikipedia pages from Dec. 1, 2007 to Jan. 1, 2023 ● Extracted 101 topics “Please give me information about [topic]”
  • 6.
    Experiment Execution 6 HTTP Requests: -On-device: the Python code makes on-device HTTP GET requests to http://localhost/api/generate - Remote: we expose the server’s IP address publicly so it can be called by our test on-device machine http://[server_IP_address]/api/generate Experiment Runner: Python-based framework to define and execute the steps of an experiment on any platform Steps to mitigate confounding factors: - Randomize the order of all individual runs - Cooldown period of 90 seconds - Control fan speed (6000 rotations per minutes) S2 Group, Experiment Runner, https://github.com/S2-group/experiment-runner
  • 7.
    7 On-Device Processing Uses4–9× More Energy Than Remote (Results) RQ1: Energy Usage Output Length Fetching Mode Mean Median SD Short On-device 52.82 55.0 20.94 Remote 15.18 14.30 5.86 Medium On-device 349.34 403.80 179.15 Remote 41.01 47.55 14.18 Long On-device 431.97 462.50 246.92 Remote 48.56 47.80 19.86 Energy Usage (J) - Stats Statistically significant difference (Wilcoxon test, p < 0.05) High variability (On-device SDs 6-7x higher) Large effect size (Cliff’s Delta: 0.94–0.96)
  • 8.
    8 (Results) RQ2: Energyvs Performance Correlation Longer it takes to execute a fetch, the more energy is consumed on client’s device The higher the CPU usage, the lower energy usage RQ2.2: Energy Usage vs CPU Usage RQ2.2: Energy Usage vs Execution Time Strong positive correlation (Spearman’s rank correlation, p < 0.001) Linear Negative Correlation (Spearman’s rank correlation, p < 0.001) (s) (s)
  • 9.
    9 (Results) RQ2: Energyvs Performance Correlation RQ2.2: Energy Usage vs Memory Usage RQ2.2: Energy Usage vs GPU Usage GPU usage has greater impact on longer, more complex generated content No evident pattern between energy and memory usage Small positive difference in On-Device Statistically significant negative correlation in Remote (p < 0.001) No significant correlation in Remote (p in [0.57 - 0.89]) Significant positive correlation in On-Device (p < 0.001)
  • 10.
    (Discussion) RQ1: EnergyUsage 10 On-Device Processing Uses 4–9× More Energy Than Remote LLMs characteristics can have a role in inference energy efficiency Influencing Factors ▸ Network conditions - latency benefits from client-server proximity and fewer network hops - latency can increase with a different traffic (requests from multiple client devices) ▸ Hardware - Remote Server is an AI server CPU: 24 cores, RAM: 64 GB, GPU: Nvidia RTX 4070 (2023)
  • 11.
    11 (Discussion) RQ2: CorrelationPerformance - Energy Fetching Mode Average CPU Usage (%) Average GPU Usage (%) Length Mean Median SD Mean Median SD Short On-device 4.93 4.43 1.84 77.95 77.81 11.65 Remote 9.95 8.60 4.10 0.72 0.72 0.27 Medium On-device 3.54 3.37 0.76 93.06 92.79 94.23 Remote 5.12 4.53 2.14 0.29 0.23 0.15 Long On-device 3.52 3.36 0.68 93.23 93.72 4.25 Remote 4.51 4.22 1.68 0.26 0.22 0.13 Fetching Mode Average Memory Usage (%) Length Mean Median SD Short On-device 70.13 71.25 7.21 Remote 57.03 56.00 5.86 Medium On-device 71.27 72.85 7.20 Remote 56.81 56.48 5.04 Long On-device 71.26 72.65 7.31 Remote 56.63 56.22 5.06 ▸ Linearity between Energy Usage and Execution Time ▸ Higher CPU, Lower Energy Usage ▸ Correlation between Output Length and GPU usage on On-Device ▸ High Memory Usage on-device ▸ No significant correlation between Energy and Memory in Remote Mode
  • 12.
    Conclusion 12 On-Device Processing Uses4–9× More Energy Than Remote Implications ▸ Low battery (<10%)? Prioritize cloud-based LLMs to reduce client energy consumption ▸ Privacy/offline needs? Use on-device LLMs but limit response length (e.g., ≤100 words) to minimize energy drain Optimization Strategies ▸ Quantization testing: Compare 4-bit/8-bit/16-bit models for accuracy-energy tradeoffs ▸ Focus on Resource Utilization: the interplay between energy and usage of other computational resources (e.g., network, GPU, CPU)