Dr. Tamar Eilam discusses sustainable computing and AI sustainability. Deep learning requires a lot of computation and energy to train large models. The demand for AI is growing exponentially, as are the sizes of language models. Foundation models are becoming more common, where a broad pre-trained model is adapted for specific tasks. However, continuously training larger models risks increasing energy consumption significantly. Sustainable AI research aims to dynamically track energy and carbon usage, while helping data scientists determine optimal model training strategies based on transparency around computational costs and model performance.
3. Approximate and Partial list
of contributors in arbitrary
order
3
Energy modeling and quantification Marcelo Amaral, Huamin Chen, Tatsuhiro Chiba,
Rina Nakazawa, Sunyanan Choochotkaew, Eun K Lee, Umamaheswari Devi, Aanchal
Goyal Workload Classification Xi Yang, Rohan R Arora, Chandra Narayanaswami,
Cheuk Lam, Jerrold Leichter, Yu Deng, Daby Sow Energy Aware Optimization Tatebeh
Bahreini, Asser Tantawi, Alaa Youssef, Chen Wang, AI System Jeffrey Burns, Leland
Chang, Ankur Agrawal, Kailash Gopalakrishnan, Pradip Bose AI Quantification and
Metric Pedro Bello-Maldonado, Bishwaranjan Bhattacharjee, Carlos Costa, AI
Infrastructure Innovation Seelam Seetharami Model Architecture Innovation David Cox,
Rameswar Panda, Rogerio Feris, Leonid Karlinsky
8. The Computer Energy Problem
8
We are at an inflection point :
3. The end of Dennard
Scaling means we can’t
keep up
Some predict that electricity consumed by Data Centers will increase to 8% by 2030
Golden Era for Chip Design
1. Demand is growing at
exponential scale
How to stop data centers from gobbling up the
world’s electricity
https://www.nature.com/articles/d41586-018-
06610-y
2. The emergence of
energy-demanding
workloads(AI)
AI power consumption doubles
every 3-4 months
* Green AI, R. Schwartz, J. Dodge,
N. A. Smith, O. Etzioni 2019
13. 13
Act
13
Energy and CFP
per workload, tenant,
VM, container, Service,
Etc.
Identify hotspots
and applicable
strategies.
Calculate potential
savings.
Assess
Estimate
A set of controllers
to dynamically optimize the
Carbon footprint at
operation.
Design efficient systems
Report
Report CFP across your
entire organization in a
consistent fashion factoring
in requirements
Carbon Assessment & Reduction Framework
An Approach for Sustainable Computing
14. Energy Quantification
Challenge
• How do you estimate the power
consumption of applications running on
shared servers?
• How do you do that when you do not have
on-line power measurement at the server
level?
• How do you do that if you do not know what
else is running on the machine?
14
15. Energy Quantification
Challenge
• How do you estimate the power
consumption of applications running on
shared servers?
=> ratio based approach
• How do you do that when you do not have
on-line power measurement at the server
level?
=> power modeling
• How do you do that if you do not know what
else is running on the machine?
=> dynamic power estimation only
• How do you scale the approach to
developing power models (combinatorial
explosion problem)?
15
The Kepler Project
https://github.com/sustainable-computing-io/kepler
17. 17
Kepler Deployment Approaches
- Ratio Power Model for Dynamic CPU Power
with Hardware Counter:
DynPowerprocess i =
𝐶𝑃𝑈 𝑐𝑦𝑐𝑙𝑒𝑠 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖
𝛴𝐶𝑃𝑈 𝑐𝑦𝑐𝑙𝑒𝑠
𝑥 DynPowerhost_CPU
without Hardware Counter:
DynPowerprocess i =
𝐵𝐹𝑃 𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖
𝛴𝐵𝑃𝐹 𝐶𝑃𝑈 𝑡𝑖𝑚𝑒
𝑥
DynPowerhost_CPU
DynPowercontainer j = Σ 𝑖 𝜖 𝑗 DynPowerprocess i
- Evenly distribution of Idle Power
Powercontainer j = IdlePowerhost_CPU / numContainers GPU (nvml)
18. Kepler Model Server Project
facilitate training power model for server without power meter
Bare-metal (BM)
Kepler
Estimated System
Power Metrics
Ratio Power Model
Process/Container
Power Consumption
Virtual Machine (VM)
Trained Power Model
Bare-metal (BM)
RAPL ACPI/Sensors
Redfish/IPMI GPU (nvml)
Kepler
Ratio Power Model
Process/Container
Power Consumption
Server with
power meter Server without
power meter
Kepler Model Server
Motivation:
• No power measurement exposed or instrumented in some running systems
Challenges:
• No or not-enough data to train power model specific to all available metrics and emerging system platform and
settings (e.g., variety of CPU architecture, Frequency governor)
• Dynamicity of control plane processes
Collect
Data
Train
Model
Export
Model
Serve
Model
Estimate
Power
19. core of Kepler model server
Pipeline Framework (one extractor, one isolator, multiple trainers )
Extract
…
Prometheus query result Extracted data Isolated data
Power models
Node-level
Train
Container-level
Train
Isolate
Energy metric
Energy-related
metric (s)
with background power
without background power
https://www.cncf.io/blog/2023/10/11/exploring-keplers-potentials-unveiling-cloud-application-power-consumption/
20. The Issue with Third-
Party Clouds
No server power metric available
No knowledge of what else is running on my machine
how to split idle power?
Limited knowledge of the architecture and configuration of the bare metal servers
Challenge for applying separately trained power models…
ALL Cloud Native calculators are too coarse grained to be useful for optimization ..
Generated with Dall-E
22. 23
Act
23
Energy and CFP
per workload, tenant,
VM, container, Service,
Etc.
Identify hotspots
and applicable
strategies.
Calculate potential
savings.
Assess
Estimate
A set of controllers
to dynamically optimize the
Carbon footprint at
operation.
Design efficient systems
Report
Report CFP across your
entire organization in a
consistent fashion factoring
in requirements
Carbon Assessment & Reduction Framework
An Approach for Sustainable Computing
23. Detect non-productive workloads
• Virtual Machines
• Cloud-native deployments
• Cloud services
Can schedules be drawn up for a
few (if not all) productive
workloads?
Workload Classification: Motivation
25. 26
Act
26
Energy and CFP
per workload, tenant,
VM, container, Service,
Etc.
Identify hotspots
and applicable
strategies.
Calculate potential
savings.
Assess
Estimate
A set of controllers
to dynamically optimize the
Carbon footprint at
operation.
Design efficient systems
Report
Report CFP across your
entire organization in a
consistent fashion factoring
in requirements
Carbon Assessment & Reduction Framework
An Approach for Sustainable Computing
26. CARE: Carbon Quantification &
Reduction
Coordinated set of controllers to
dynamically quantify and
optimize the carbon footprint in
every level of the hybrid cloud
stack in and across on and off
prem data centers
Container
Right-Sizing
Dynamic
dispatching
Energy aware
scheduler
VM
placement
Power
management
Container
Right-Sizing
Energy aware
scheduler
VM
placement
Power
management
CFP =EIT × PUE × CI
Leverage renewable energy
when and where it is
available across datacenters.
Efficiency with container
resource consumption
within a datacenter.
Efficient infrastructure with
VM and power
management
27
29. The energy cost of AI
Deep learning is computationally intensive
Time consuming even with high-performance computing resources
Take for example: Training Image recognition model
Dataset: ImageNet-22K
Network: ResNet-101
256 GPUs
7 hours
~450kWh
4 GPUs
16 days
~385
kWh
1 model training run is ~2 weeks of
home energy consumption
https://arxiv.org/abs/1708.02188
30. AI demand keeps surging Training requirements
are doubling every 3.5
months
Source: Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and
Policy Considerations for Deep Learning in NLP. CoRR abs/1906.02243 (2019).
arXiv:1906.02243
Source: Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2019. Green AI.
arXiv:1907.10597 [cs.CY]
31. The emergence of foundation models
Homogenization: a broad foundation
model is adapted to perform specific tasks.
Almost all state-of- the-art NLP models are
now adapted from one of a few foundation
models, such as BERT, RoBERTa, BART,
T5, etc.
Multi modal, and cross domains are next.
Source: RishiBommasani,DrewA.Hudson,EhsanAdeli,RussAltman,SimranArora, Sydney von Arx, Michael S.
Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card,
Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora
Dora Demszky, and Chris Donahue et al. 2022. On the Opportunities and Risks of Foundation Models.
Models. arXiv:2108.07258 [cs.LG]
32. Sizes of Language Models Training Cost of Language Model
GPT-3 needs 1024 A100 GPUs for 34 days for training!
Large language models are getting larger
Some say that this is okay, because they are re-used for multiple tasks*
This claim is yet to be substantiated based on a sound analysis
*E.g., DavidPatterson,JosephGonzalez,QuocLe,ChenLiang,Lluis-MiquelMunguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon Emissions and Large Neural Network Training.
33. Data Scientist Dilemma: to adapt or not to
adapt
• To adapt from a broad model,
or, to train a smaller model on a more specific data set?
• How much data to use?
• Can I synthesize a few smaller models?
• Neural Architecture Search? Hyper Parameter Optimization?
Is it worth the cost? well, it depends….
• What is the optimal frequency of re-training?
Daily? Weekly?
what data shall I use for re-training? incremental? Complete?
34. Sustainable AI platform principles
Transparency dynamically track
energy and carbon across the data
and model life cycle
Traceability and Governance track
the ‘supply chain’ of models and
data-sets and associated energy
and carbon
Energy Efficiency Innovation across
all layers of the stack
Meaningful
Metrics
10/17/2023 35
35. Meaningful Metrics categories
data-
set
model
Products
Core Metric
Life-cycle
Efficiency
Construction Operation Construction
pre-training
10/17/2023
Operation
re-
training
Inference
Life-cycle
factor-in the provenance of models and data-sets and their associate
energy and carbon footprint (Life-Cycle-Assessment principles)
D FM M
Efficiency efficiency =
𝑐𝑜𝑠𝑡
𝑤𝑜𝑟𝑘 𝑝𝑟𝑜𝑑𝑢𝑐𝑒𝑑
what goes into ‘cost’?
compute for inference
+training
+bill of material ‘tax’
36. holistic approach to Sustainable AI
Factor-in the entire life cycle of models
Sustainable strategy exploration and what-if analysis
Provenance, Governance, and reporting
Holistic impact analysis and tradeoff based planning
AI Sustainability Metrics
38. The life-cycle of a model as a state machine
Each ‘state transition’ is associated with a significant energy/carbon cost,
and involve critical decisions, that will affect cost of this and downstream tasks.
• Tradeoffs between
accuracy,
time-to-value, and
energy/carbon
• Cost of one phase
may depend on
decisions taken
at a prior stage.
save now, pay
later….
• The particulars of
the target task are
important to factor in
early on.
39. On-Line Fine Grain monitoring of Energy and
Carbon with Kepler
• An open-source project pioneered by
RedHat and IBM Research to quantify
cloud native applications
energy/carbon.
• On road map to deliver in OCP and
integrate in Rosa
• Adrian Cockcroft advocating use
of Kepler across all cloud providers
“Real Time Energy and Carbon Standard
for Cloud Providers”
10/17/2023 40
40. SusQL: Context aware aggregation and energy accounting
Infrastructure: Kubernetes controller with its own CRD that gets data from Kepler for
aggregation
susql-controller
map[labels]->energy table
1 2
3
4
apiVersion: …
kind: LabelGroup
metadata: …
spec:
labels:
- <label-1>
- <label-2>
- <label-3>
- <label-4>
status:
totalEnergy: <total energy>
42. A ‘Supply Chain’ of models
Models are created (‘manufactured’)
distilled, fine tuned, and rer-used
(adapted) to created new models
Deployment is just the beginning of
the journey.
How do we reason about the Life-Cycle
Cost of models?
43. Product Life Cycle Assessment Principles
for Sustainable AI:
Products = data-set | model
We need to factor in the cost of the Bill of Material used in the creation of a new model
If B (a product or a service) is used in the process of creation of A1, A2, … An, then the carbon cost of B
is inherited by A1, A2, …, An in proportion to their use.
46. Efficiency at every layer of the AI Stack
• Every layer of the FM stack offer opportunity for efficiencies gains
Model Quantization,
architecture innovation
Tools dynamic batching
Platform Multiplexing, dispatching
Infrastructure DVFS, power param
optimization, caching,
Systems Approximate computing
and other system
innovations
• Empower the data scientist to make choices and explore tradeoffs between accuracy, performance, energy
• Empower the data scientist to reason about life-cycle strategies: e.g., if/what/when to re-use, and how much
to retrain
49. 50
Vision for AI Performance Scaling
• Applying Approximate Computing techniques to AI compute
• Critical requirement: maintain model accuracy
• Advantage: Quadratic improvement in performance
• IBM Research has been at the forefront of every major
technical advancement on bit-precision scaling
• 16-bit training (2015)
• 8-bit training (2018, 2019)
• 4-bit training (2020)
• 2/4-bit Inference (2018-2020)
• Complemented by
• Sparsity support
• Analog Computing
• 3D Stacking
Digital AI Cores
Scaling precision for quadratic gains in performance with iso-accuracy
4-bit Inference ASICs
J.Choi et al., https://arxiv.org/pdf/1805.06085.pdf
J.McKinstry et al., https://arxiv.org/abs/1809.04191
2-bit Inference ASICs
J.Choi et al., SysML 2019
0.1
1
10
100
2012 2015 2018 2021 2024
16-bit
32-bit
16-bit
8-bit
8-bit
2-bit
4-bit
4-bit
16-bit Training
ICML 2015
Training
Inference
4-bit Training
X. Sun et al NeurIPS 2020
8-bit Training
NeurIPS 2018, 2019
4-bit Inference
J.Choi et al.,arxiv 2018
2-bit Inference
J.Choi et al., SysML 2019
Bit
Precision
https://research.ibm.com/blog/ibm-artificial-intelligence-unit-aiu
https://research.ibm.com/blog/ai-chip-precision-scaling
51. Vela: A Cloud Native Supercomputer for the Foundation Model Age (Kepler inside)
System specifications
– Nodes with 8 x A100 GPUs (80GB)
– GPUs interconnected with NVLink, NVSwitch
– Cascade Lake CPUs, 1.5TB of DRAM,
– Four 3.2TB NVMe drives
– Redundant connections between nodes, TORs and
spines
– 2 x 100G NICs from each node – NCCL benchmarks
show we drive close to line rate
https://research.ibm.com/blog/AI-supercomputer-Vela-GPU-cluster
– Configure resources through software (APIs)
– Broad ecosystem of available cloud services
– Leverage data sets on Cloud Object Store
– Standard, flexible, scalable infrastructure design (vs
traditional HPC)
– Near bare metal performance (within 5%, single node)
How do you evolve from
specialized (monolithic), costly,
and inflexible HPC stack to Cloud
Native Stack without
compromising efficiency ?
- Programmability
- Scalability
- Re-use
- Observability
- Agility
- Democratization
53. Dispatching of jobs based on renewable energy
54
Motivation:
Carbon intensity of the energy mix of different
regions of IBM data centers varies over time.
Renewable energy is not available all the time
and in all places.
Workload Optimization: Placement and scheduling
of workloads based on carbon-free energy
availability.
Ideal dispatching: High CPU utilization when
carbon intensity is low and low CPU utilization when
carbon intensity is high.
T. Bahreini, A. Tantawi and A. Youssef, "An Approximation Algorithm for Minimizing the Cloud Carbon Footprint
through Workload Scheduling," 2022 IEEE 15th International Conference on Cloud Computing (CLOUD), 2022, pp.
522-531,
Challenge: Ideal dispatching might be practically
infeasible.
Short jobs may have short deadline.
Some jobs are not interruptible.
Jobs have heterogenous resource demands.
Obtaining the optimal packing is intractable.
54. 55
In a single data center how to
order batch jobs for minimum carbon
while meeting deadlines.
polynomial approximation algorithms that works
across data centers (space x time).
58. Call to action:
AI Platform providers:
- Build-in transparency and governance
- Incorporate platform and system innovation for efficiency.
Academia & Industry: Focus you Research on Efficiency not just
accuracy
Data Scientists / Practitioners: Develop a sustainability
mind-set
Re-use where it makes sense
Domain specific, smaller models are better!
Explore tradeoffs (accuracy vs cost)
59
energy-efficiency throughout the memory hierarchy.
Since different tasks require a different system composition for best utilization, the data centers needto be rearchitected in the future using disaggregationand composability. This allows flexible composition and
ystem configuration to optimally serve a particulartask.
Considering that the various technical components (CPUs, GPUs, memory, and storage) have differentlifecycles, disaggregation additionally improves the system performance and reduces cost, as they can be replaced separately. Common memory systems for AI/ML applications include on-chip memory, high bandwidth memory (HBM), and GDDR—and all have different architectural implications. A universal goal is to realize memory technology with much higher bandwidth and lower latency, while consuming less energy. While HBM DRAMs are already very power-efficient, roughly 2/3 of the power budget is still spent moving data between an SoC and the DRAM (Figure 2.4)5. Reducingthe volume of data moved provides an opportunity for large improvement, this requires further research.
Different concepts for disaggregation of memory and storage are already proposed, but more research is needed to identify the best way to use disaggregation to achieve TCO benefits at scale and improve latency. To generate these benefits, a multi-tiered memory approach that includes the use of storage-class memories is needed. The new architectures can pose a challenge but can also provide an opportunity for application development. The impact to legacy code needs to be understood and mitigated.
Foundation models have led to an unprecedented level of homogenization: Almost all state-of- the-art NLP models are now adapted from one of a few foundation models, such as BERT, RoBERTa, BART, T5, etc.
Training GPT-3, which is a single general-purpose AI program that can generate language and has many different uses, took 1.287 gigawatt hours, according to a research paper published in 2021, or about as much electricity as 120 US homes would consume in a year. That training generated 502 tons of carbon emissions, according to the same paper, or about as much as 110 US cars emit in a year. That’s for just one program, or “model.” While training a model has a huge upfront power cost, researchers found in some cases it’s only about 40% of the power burned by the actual use of the model, with billions of requests pouring in for popular programs. Plus, the models are getting bigger. OpenAI’s GPT-3 uses 175 billion parameters, or variables, that the AI system has learned through its training and retraining. Its predecessor used just 1.5 billion.
Another relative measure comes from Google, where researchers found that artificial intelligence made up 10 to 15% of the company’s total electricity consumption, which was 18.3 terawatt hours in 2021. That would mean that Google’s AI burns around 2.3 terawatt hours annually, about as much electricity each year as all the homes in a city the size of Atlanta.
https://www.bloomberg.com/news/articles/2023-03-09/how-much-energy-do-ai-and-chatgpt-use-no-one-knows-for-sure
Packaged as a PCIe card, for ease of integration into virtually any on-premises or cloud system
Integration into the IBM Watson software stack underway, to power the AI inference infrastructure of IBM Research’s Foundation Model Big Bet
Packaged as a PCIe card, for ease of integration into virtually any on-premises or cloud system
Enabled in the Red Hat software stack including PyTorch and TensorFlow integration
we can drop from 32-bit floating point arithmetic to bit-formats holding a quarter as much information. This simplified format dramatically cuts the amount of number crunching needed to train and run an AI model, without sacrificing accuracy.
We leverage key IBM breakthroughs from the last five years to find the best tradeoff between speed and accuracy.
This is not a chip we designed entirely from scratch. Rather, it’s the scaled version of an already proven AI accelerator built into our Telum chip that power Z 16 System.
So, we asked ourselves: how do we deliver bare-metal performance inside of a VM? Following a significant amount of research and discovery, we devised a way to expose all of the capabilities on the node (GPUs, CPUs, networking, and storage) into the VM so that the virtualization overhead is less than 5%, which is the lowest overhead in the industry that we’re aware of. This work includes configuring the bare-metal host for virtualization with support for Virtual Machine Extensions (VMX), single-root IO virtualization (SR-IOV), and huge pages. We also needed to faithfully represent all devices and their connectivity inside the VM, such as which network cards are connected to which CPUs and GPUs, how GPUs are connected to the CPU sockets, and how GPUs are connected to each other. These, along with other hardware and software configurations, enabled our system to achieve close to bare metal performance.
Bare Metal vs. VMs || Ethernet vs Infiniband || openhsift scheduling (MCAD) vs. LSF
enabling SR-IOV for our network interface cards on each node, thereby exposing each 100G link directly into the VMs via virtual functions.
we can hide the communication time over the network behind compute time occurring on the GPUs. This approach is aided by our choice of GPUs with 80GB of memory (discussed above), which allows us to use bigger batch sizes (compared to the 40 GB model), and leverage the Fully Shared Data Parallel (FSDP) training strategy more efficiently.
Next we’ll be rolling out an implementation of remote direct memory access (RDMA) over converged ethernet (RoCE) at scale and GPU Direct RDMA (GDR), to deliver the performance benefits of RDMA and GDR while minimizing adverse impact to other traffic. Our lab measurements indicate that this will cut latency in half.