AI Sustainability
Dr. Tamar Eilam
IBM Fellow, Chief Scientist Sustainable Computing, IBM Research
2
Approximate and Partial list
of contributors in arbitrary
order
3
Energy modeling and quantification Marcelo Amaral, Huamin Chen, Tatsuhiro Chiba,
Rina Nakazawa, Sunyanan Choochotkaew, Eun K Lee, Umamaheswari Devi, Aanchal
Goyal Workload Classification Xi Yang, Rohan R Arora, Chandra Narayanaswami,
Cheuk Lam, Jerrold Leichter, Yu Deng, Daby Sow Energy Aware Optimization Tatebeh
Bahreini, Asser Tantawi, Alaa Youssef, Chen Wang, AI System Jeffrey Burns, Leland
Chang, Ankur Agrawal, Kailash Gopalakrishnan, Pradip Bose AI Quantification and
Metric Pedro Bello-Maldonado, Bishwaranjan Bhattacharjee, Carlos Costa, AI
Infrastructure Innovation Seelam Seetharami Model Architecture Innovation David Cox,
Rameswar Panda, Rogerio Feris, Leonid Karlinsky
The Climate Impact Chain
Human
activity
Increased
Green House
Gas (GHG) in
atmosphere
Global
warming
Global
climate
change
Physical
&
biological
impact
Human socio-
economic
impact
$150 billion
Average cost in damages per year
100M+
Increase in population facing hunger
IBM Research | © 2022 IBM Corporation
5
Mitigation
Carbon
Capture
Geo-
engine
ering
Reduce Carbon
Emission
Sustainable
Computing
Part1: Sustainable
Computing
IBM Research / Doc ID / Month XX, 2020 / © 2020 IBM Corporation 6
What is Sustainable Computing ?
7
Ability to measure, quantify, and ultimately reduce carbon
footprint at every layer of the computing stack, in- and across-
data centers, and across the entire life cycle.
IBM Research | © 2022 IBM Corporation
The Computer Energy Problem
8
We are at an inflection point :
3. The end of Dennard
Scaling means we can’t
keep up
 Some predict that electricity consumed by Data Centers will increase to 8% by 2030
 Golden Era for Chip Design
1. Demand is growing at
exponential scale
How to stop data centers from gobbling up the
world’s electricity
https://www.nature.com/articles/d41586-018-
06610-y
2. The emergence of
energy-demanding
workloads(AI)
AI power consumption doubles
every 3-4 months
* Green AI, R. Schwartz, J. Dodge,
N. A. Smith, O. Etzioni 2019
Ever rising energy demands
for computing vs. global
energy production is
creating new risk, and new
opportunities for radically
different computing
to drastically improve
efficiency
31%
a years the energy consumption
increase trend for hyperscalers in
North America
>10%
of the world's power will be
consumed by hyperscalers by 2030
IBM Research | © 2022 IBM Corporation 9
Sustainable Computing epochs
Making the Current State
More Sustainable
Introducing Accelerators
(Digital)
& Hardware and Software
co-design / co-optimization
New Computational Models
(beyond digital)
 Understanding the As-Is
 Hot Spot Detection
 Remediation and Optimization
 Coupling Power and Cloud
 Cooling, Data Center Planning, etc
 Storage AutoTiering.
 HW and SW co-design (scalable
approach)
 Reduced precision chips – 8bit
precision approximate computing
 Voltage scaling with error correction
 Runtime management of dis-
aggregated & composable
heterogenous DC
 New computational
models that completely
break the relationship
between energy & computation:
neuromorphic, analog AI, data-
centric,
quantum, etc.
https://research.ibm.com/blog/telum-
processor
https://www.esp.cs.columbia.edu
https://research.ibm.com/blog/the-
hardware-behind-analog-ai
https://www.zurich.ibm.com/sto/memory/
IBM Research | © 2022 IBM Corporation 10
Carbon Intensity
The emission rate: grams of carbon
dioxide released
per megajoule of energy produced
—
With coal power stations, the carbon intensity
is high as CO2 is produced as part of the
power generation process.
Carbon intensity is >1 kg/kWh for coal;
—
Renewable energy such as hydro or solar
produce almost no emissions, so their carbon
intensity is very low.
Carbon intensity is ~0 for solar/wind
Modeling the Data Center Carbon
Footprint
11
x Carbon Intensity
Power usage effectiveness (PUE)
A predominant metric used to measure the energy
efficiency of a data center.
—
PUE = (Total Facility Energy) / ( IT Equipment
Energy)
Efficiency improves as the quotient decreases
towards 1.
1 is optimal, 2 is very bad.
Total Carbon Footprint
The total amount of carbon dioxide (CO2) and
equivalent green house gas emissions associated
with powering a data center.
CFP >= 0.
Carbon Footprint = IT Equipment Energy x Power Usage Effectiveness
CFP =EIT × PUE × CI
EIT
PUE
An example DC Energy Breakdown
IBM Research | © 2022 IBM Corporation
Reducing the Data Center Carbon
Footprint: Research Opportunities
12
x Carbon Intensity
Carbon Footprint = IT Equipment Energy x Power Usage Effectiveness
CFP =EIT × ERE × CI
• Data Center Design, Cooling and Heat-
Reuse
• Rack Design to optimize power
conversion, and direct liquid cooling
• Improving power conversion in the data
center
• Energy Aware Scheduling, Vertical Scaling,
Dispatching
• Power Management
• Accelerators for Green AI: Tradeoffs
between accuracy and efficiency
• Chip Design
• Dispatching of batch workload such as AI
Training Jobs across time and space to
maximize renewable energy use.
• Forecasting of renewable energy (time
series composition)
• Can the cloud sense renewable energy and
adapt?
https://research.ibm.com/blog/ibm-artificial-intelligence-
unit-aiu
https://www.zurich.ibm.com/st/energy_efficiency/zeroemiss
ion.html
IBM Research | © 2022 IBM Corporation
13
Act
13
Energy and CFP
per workload, tenant,
VM, container, Service,
Etc.
Identify hotspots
and applicable
strategies.
Calculate potential
savings.
Assess
Estimate
A set of controllers
to dynamically optimize the
Carbon footprint at
operation.
Design efficient systems
Report
Report CFP across your
entire organization in a
consistent fashion factoring
in requirements
Carbon Assessment & Reduction Framework
An Approach for Sustainable Computing
Energy Quantification
Challenge
• How do you estimate the power
consumption of applications running on
shared servers?
• How do you do that when you do not have
on-line power measurement at the server
level?
• How do you do that if you do not know what
else is running on the machine?
14
Energy Quantification
Challenge
• How do you estimate the power
consumption of applications running on
shared servers?
=> ratio based approach
• How do you do that when you do not have
on-line power measurement at the server
level?
=> power modeling
• How do you do that if you do not know what
else is running on the machine?
=> dynamic power estimation only
• How do you scale the approach to
developing power models (combinatorial
explosion problem)?
15
The Kepler Project
https://github.com/sustainable-computing-io/kepler
16
[1] https://github.com/sustainable-computing-io/kepler
Kepler Architecture
• eBPF metrics:
hardware
counters, cpu
time and soft IRQ
• System Power
metrics from BMs
and VMs
• Ratio Power
Model for
containers
• Trained Power
Model to estimate
the VM’s
component
power
consumption
17
Kepler Deployment Approaches
- Ratio Power Model for Dynamic CPU Power
with Hardware Counter:
DynPowerprocess i =
𝐶𝑃𝑈 𝑐𝑦𝑐𝑙𝑒𝑠 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖
𝛴𝐶𝑃𝑈 𝑐𝑦𝑐𝑙𝑒𝑠
𝑥 DynPowerhost_CPU
without Hardware Counter:
DynPowerprocess i =
𝐵𝐹𝑃 𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖
𝛴𝐵𝑃𝐹 𝐶𝑃𝑈 𝑡𝑖𝑚𝑒
𝑥
DynPowerhost_CPU
DynPowercontainer j = Σ 𝑖 𝜖 𝑗 DynPowerprocess i
- Evenly distribution of Idle Power
Powercontainer j = IdlePowerhost_CPU / numContainers GPU (nvml)
Kepler Model Server Project
facilitate training power model for server without power meter
Bare-metal (BM)
Kepler
Estimated System
Power Metrics
Ratio Power Model
Process/Container
Power Consumption
Virtual Machine (VM)
Trained Power Model
Bare-metal (BM)
RAPL ACPI/Sensors
Redfish/IPMI GPU (nvml)
Kepler
Ratio Power Model
Process/Container
Power Consumption
Server with
power meter Server without
power meter
Kepler Model Server
Motivation:
• No power measurement exposed or instrumented in some running systems
Challenges:
• No or not-enough data to train power model specific to all available metrics and emerging system platform and
settings (e.g., variety of CPU architecture, Frequency governor)
• Dynamicity of control plane processes
Collect
Data
Train
Model
Export
Model
Serve
Model
Estimate
Power
core of Kepler model server
Pipeline Framework (one extractor, one isolator, multiple trainers )
Extract
…
Prometheus query result Extracted data Isolated data
Power models
Node-level
Train
Container-level
Train
Isolate
Energy metric
Energy-related
metric (s)
with background power
without background power
https://www.cncf.io/blog/2023/10/11/exploring-keplers-potentials-unveiling-cloud-application-power-consumption/
The Issue with Third-
Party Clouds
 No server power metric available
 No knowledge of what else is running on my machine
 how to split idle power? 
 Limited knowledge of the architecture and configuration of the bare metal servers
 Challenge for applying separately trained power models… 
 ALL Cloud Native calculators are too coarse grained to be useful for optimization .. 
Generated with Dall-E
https://adrianco.medium.com/proposal-for-a-realtime-carbon-footprint-standard-60b71c269948
Adrian Cockcroft
How can we get to real time monitoring
of application carbon consumption in third party
clouds?
Consistent
Trustworthy
Transparent
Explainable
Can Kepler help?
What else do we need?
WIP: Reference
Implementation to be
Open Sourced.
23
Act
23
Energy and CFP
per workload, tenant,
VM, container, Service,
Etc.
Identify hotspots
and applicable
strategies.
Calculate potential
savings.
Assess
Estimate
A set of controllers
to dynamically optimize the
Carbon footprint at
operation.
Design efficient systems
Report
Report CFP across your
entire organization in a
consistent fashion factoring
in requirements
Carbon Assessment & Reduction Framework
An Approach for Sustainable Computing
Detect non-productive workloads
• Virtual Machines
• Cloud-native deployments
• Cloud services
Can schedules be drawn up for a
few (if not all) productive
workloads?
Workload Classification: Motivation
Methodology
Workload*
Classification
Phase
Abstraction
Inactive/
active
phases
Non-repeatable
Constantly
Productive
Alternating
Workload
Timetabling
Candidate for
Termination
Candidate for
Parking
No Action
Repeatable
Recommendation
Metrics
Non-productive
• Non-productive: Remaining in the Inactive Phase
• Constantly Productive: Remaining in the Active Phase
• Alternating: Switching between the two Phases
VM1
VM2
VM𝑁
𝑇 − 𝑤𝑐
7/14/21 Days
𝑇
26
Act
26
Energy and CFP
per workload, tenant,
VM, container, Service,
Etc.
Identify hotspots
and applicable
strategies.
Calculate potential
savings.
Assess
Estimate
A set of controllers
to dynamically optimize the
Carbon footprint at
operation.
Design efficient systems
Report
Report CFP across your
entire organization in a
consistent fashion factoring
in requirements
Carbon Assessment & Reduction Framework
An Approach for Sustainable Computing
CARE: Carbon Quantification &
Reduction
Coordinated set of controllers to
dynamically quantify and
optimize the carbon footprint in
every level of the hybrid cloud
stack in and across on and off
prem data centers
Container
Right-Sizing
Dynamic
dispatching
Energy aware
scheduler
VM
placement
Power
management
Container
Right-Sizing
Energy aware
scheduler
VM
placement
Power
management
CFP =EIT × PUE × CI
Leverage renewable energy
when and where it is
available across datacenters.
Efficiency with container
resource consumption
within a datacenter.
Efficient infrastructure with
VM and power
management
27
Part2: AI Sustainability
IBM Research / Doc ID / Month XX, 2020 / © 2020 IBM Corporation 28
The energy cost of AI
 Deep learning is computationally intensive
 Time consuming even with high-performance computing resources
Take for example: Training Image recognition model
Dataset: ImageNet-22K
Network: ResNet-101
256 GPUs
7 hours
~450kWh
4 GPUs
16 days
~385
kWh
1 model training run is ~2 weeks of
home energy consumption
https://arxiv.org/abs/1708.02188
AI demand keeps surging Training requirements
are doubling every 3.5
months
Source: Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and
Policy Considerations for Deep Learning in NLP. CoRR abs/1906.02243 (2019).
arXiv:1906.02243
Source: Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2019. Green AI.
arXiv:1907.10597 [cs.CY]
The emergence of foundation models
Homogenization: a broad foundation
model is adapted to perform specific tasks.
Almost all state-of- the-art NLP models are
now adapted from one of a few foundation
models, such as BERT, RoBERTa, BART,
T5, etc.
Multi modal, and cross domains are next.
Source: RishiBommasani,DrewA.Hudson,EhsanAdeli,RussAltman,SimranArora, Sydney von Arx, Michael S.
Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card,
Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora
Dora Demszky, and Chris Donahue et al. 2022. On the Opportunities and Risks of Foundation Models.
Models. arXiv:2108.07258 [cs.LG]
Sizes of Language Models Training Cost of Language Model
GPT-3 needs 1024 A100 GPUs for 34 days for training!
Large language models are getting larger
Some say that this is okay, because they are re-used for multiple tasks*
This claim is yet to be substantiated based on a sound analysis
*E.g., DavidPatterson,JosephGonzalez,QuocLe,ChenLiang,Lluis-MiquelMunguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon Emissions and Large Neural Network Training.
Data Scientist Dilemma: to adapt or not to
adapt
• To adapt from a broad model,
or, to train a smaller model on a more specific data set?
• How much data to use?
• Can I synthesize a few smaller models?
• Neural Architecture Search? Hyper Parameter Optimization?
Is it worth the cost? well, it depends….
• What is the optimal frequency of re-training?
Daily? Weekly?
what data shall I use for re-training? incremental? Complete?
Sustainable AI platform principles
Transparency dynamically track
energy and carbon across the data
and model life cycle
Traceability and Governance track
the ‘supply chain’ of models and
data-sets and associated energy
and carbon
Energy Efficiency Innovation across
all layers of the stack
Meaningful
Metrics
10/17/2023 35
Meaningful Metrics categories
data-
set
model
Products
Core Metric
Life-cycle
Efficiency
Construction Operation Construction
pre-training
10/17/2023
Operation
re-
training
Inference
Life-cycle
factor-in the provenance of models and data-sets and their associate
energy and carbon footprint (Life-Cycle-Assessment principles)
D FM M
Efficiency efficiency =
𝑐𝑜𝑠𝑡
𝑤𝑜𝑟𝑘 𝑝𝑟𝑜𝑑𝑢𝑐𝑒𝑑
what goes into ‘cost’?
 compute for inference
 +training
 +bill of material ‘tax’
holistic approach to Sustainable AI
Factor-in the entire life cycle of models
Sustainable strategy exploration and what-if analysis
Provenance, Governance, and reporting
Holistic impact analysis and tradeoff based planning
AI Sustainability Metrics
Transparency
10/17/2023 38
The life-cycle of a model as a state machine
Each ‘state transition’ is associated with a significant energy/carbon cost,
and involve critical decisions, that will affect cost of this and downstream tasks.
• Tradeoffs between
accuracy,
time-to-value, and
energy/carbon
• Cost of one phase
may depend on
decisions taken
at a prior stage.
save now, pay
later….
• The particulars of
the target task are
important to factor in
early on.
On-Line Fine Grain monitoring of Energy and
Carbon with Kepler
• An open-source project pioneered by
RedHat and IBM Research to quantify
cloud native applications
energy/carbon.
• On road map to deliver in OCP and
integrate in Rosa
• Adrian Cockcroft advocating use
of Kepler across all cloud providers
“Real Time Energy and Carbon Standard
for Cloud Providers”
10/17/2023 40
SusQL: Context aware aggregation and energy accounting
Infrastructure: Kubernetes controller with its own CRD that gets data from Kepler for
aggregation
susql-controller
map[labels]->energy table
1 2
3
4
apiVersion: …
kind: LabelGroup
metadata: …
spec:
labels:
- <label-1>
- <label-2>
- <label-3>
- <label-4>
status:
totalEnergy: <total energy>
Governance
10/17/2023 42
A ‘Supply Chain’ of models
Models are created (‘manufactured’)
distilled, fine tuned, and rer-used
(adapted) to created new models
Deployment is just the beginning of
the journey.
How do we reason about the Life-Cycle
Cost of models?
Product Life Cycle Assessment Principles
for Sustainable AI:
Products = data-set | model
We need to factor in the cost of the Bill of Material used in the creation of a new model
If B (a product or a service) is used in the process of creation of A1, A2, … An, then the carbon cost of B
is inherited by A1, A2, …, An in proportion to their use.
The Governance Chain
Efficiency at
Every Layer
10/17/2023 46
Efficiency at every layer of the AI Stack
• Every layer of the FM stack offer opportunity for efficiencies gains
Model Quantization,
architecture innovation
Tools dynamic batching
Platform Multiplexing, dispatching
Infrastructure DVFS, power param
optimization, caching,
Systems Approximate computing
and other system
innovations
• Empower the data scientist to make choices and explore tradeoffs between accuracy, performance, energy
• Empower the data scientist to reason about life-cycle strategies: e.g., if/what/when to re-use, and how much
to retrain
Systems innovation
10/17/2023 48
49
IBM Research’s Artificial Intelligence Unit (AIU)
Chip architecture optimized for enterprise AI workloads
Enabled for Foundation Models
Enabled in the Red Hat software stack
Supports multi-precision inference (& training)
FP16, FP8, INT8, INT4, INT2
Implemented in leading edge 5nm technology
https://research.ibm.com/blog/ibm-artificial-intelligence-unit-aiu
SoC implements IBM’s leadership innovations
in low-precision AI arithmetic and algorithms
IBM Research AI Hardware Center / © 2023 IBM Corporation
50
Vision for AI Performance Scaling
• Applying Approximate Computing techniques to AI compute
• Critical requirement: maintain model accuracy
• Advantage: Quadratic improvement in performance
• IBM Research has been at the forefront of every major
technical advancement on bit-precision scaling
• 16-bit training (2015)
• 8-bit training (2018, 2019)
• 4-bit training (2020)
• 2/4-bit Inference (2018-2020)
• Complemented by
• Sparsity support
• Analog Computing
• 3D Stacking
Digital AI Cores
Scaling precision for quadratic gains in performance with iso-accuracy
4-bit Inference ASICs
J.Choi et al., https://arxiv.org/pdf/1805.06085.pdf
J.McKinstry et al., https://arxiv.org/abs/1809.04191
2-bit Inference ASICs
J.Choi et al., SysML 2019
0.1
1
10
100
2012 2015 2018 2021 2024
16-bit
32-bit
16-bit
8-bit
8-bit
2-bit
4-bit
4-bit
16-bit Training
ICML 2015
Training
Inference
4-bit Training
X. Sun et al NeurIPS 2020
8-bit Training
NeurIPS 2018, 2019
4-bit Inference
J.Choi et al.,arxiv 2018
2-bit Inference
J.Choi et al., SysML 2019
Bit
Precision
https://research.ibm.com/blog/ibm-artificial-intelligence-unit-aiu
https://research.ibm.com/blog/ai-chip-precision-scaling
Infrastructure innovation
10/17/2023 51
Vela: A Cloud Native Supercomputer for the Foundation Model Age (Kepler inside)
System specifications
– Nodes with 8 x A100 GPUs (80GB)
– GPUs interconnected with NVLink, NVSwitch
– Cascade Lake CPUs, 1.5TB of DRAM,
– Four 3.2TB NVMe drives
– Redundant connections between nodes, TORs and
spines
– 2 x 100G NICs from each node – NCCL benchmarks
show we drive close to line rate
https://research.ibm.com/blog/AI-supercomputer-Vela-GPU-cluster
– Configure resources through software (APIs)
– Broad ecosystem of available cloud services
– Leverage data sets on Cloud Object Store
– Standard, flexible, scalable infrastructure design (vs
traditional HPC)
– Near bare metal performance (within 5%, single node)
How do you evolve from
specialized (monolithic), costly,
and inflexible HPC stack to Cloud
Native Stack without
compromising efficiency ?
- Programmability
- Scalability
- Re-use
- Observability
- Agility
- Democratization
10/17/2023
53
Platform Innovation
Dispatching of jobs based on renewable energy
54
Motivation:
 Carbon intensity of the energy mix of different
regions of IBM data centers varies over time.
 Renewable energy is not available all the time
and in all places.
Workload Optimization: Placement and scheduling
of workloads based on carbon-free energy
availability.
Ideal dispatching: High CPU utilization when
carbon intensity is low and low CPU utilization when
carbon intensity is high.
T. Bahreini, A. Tantawi and A. Youssef, "An Approximation Algorithm for Minimizing the Cloud Carbon Footprint
through Workload Scheduling," 2022 IEEE 15th International Conference on Cloud Computing (CLOUD), 2022, pp.
522-531,
Challenge: Ideal dispatching might be practically
infeasible.
 Short jobs may have short deadline.
 Some jobs are not interruptible.
 Jobs have heterogenous resource demands.
Obtaining the optimal packing is intractable.
55
In a single data center how to
order batch jobs for minimum carbon
while meeting deadlines.
polynomial approximation algorithms that works
across data centers (space x time).
10/17/2023
56
Models & Tools Innovation
57
58
Call to action:
AI Platform providers:
- Build-in transparency and governance
- Incorporate platform and system innovation for efficiency.
Academia & Industry: Focus you Research on Efficiency not just
accuracy
Data Scientists / Practitioners: Develop a sustainability
mind-set
Re-use where it makes sense
Domain specific, smaller models are better!
Explore tradeoffs (accuracy vs cost)
59
Tokyo
Shin-Kawasaki
Delhi
Bangalore
Singapore
Nairobi
Haifa
Zurich
Warrington
Dublin
Cambridge
Albany
Yorktown
Almaden
Rio de Janeiro
Sao Paulo Johannesburg
6 Nobel Laureates 10 Medals of Technology 5 National Medals of Science 6 Turing Awards
IBM Research
Questions?
61
AI Sustainability Mascots 23-f.pptx

AI Sustainability Mascots 23-f.pptx

  • 1.
    AI Sustainability Dr. TamarEilam IBM Fellow, Chief Scientist Sustainable Computing, IBM Research
  • 2.
  • 3.
    Approximate and Partiallist of contributors in arbitrary order 3 Energy modeling and quantification Marcelo Amaral, Huamin Chen, Tatsuhiro Chiba, Rina Nakazawa, Sunyanan Choochotkaew, Eun K Lee, Umamaheswari Devi, Aanchal Goyal Workload Classification Xi Yang, Rohan R Arora, Chandra Narayanaswami, Cheuk Lam, Jerrold Leichter, Yu Deng, Daby Sow Energy Aware Optimization Tatebeh Bahreini, Asser Tantawi, Alaa Youssef, Chen Wang, AI System Jeffrey Burns, Leland Chang, Ankur Agrawal, Kailash Gopalakrishnan, Pradip Bose AI Quantification and Metric Pedro Bello-Maldonado, Bishwaranjan Bhattacharjee, Carlos Costa, AI Infrastructure Innovation Seelam Seetharami Model Architecture Innovation David Cox, Rameswar Panda, Rogerio Feris, Leonid Karlinsky
  • 4.
    The Climate ImpactChain Human activity Increased Green House Gas (GHG) in atmosphere Global warming Global climate change Physical & biological impact Human socio- economic impact $150 billion Average cost in damages per year 100M+ Increase in population facing hunger IBM Research | © 2022 IBM Corporation
  • 5.
  • 6.
    Part1: Sustainable Computing IBM Research/ Doc ID / Month XX, 2020 / © 2020 IBM Corporation 6
  • 7.
    What is SustainableComputing ? 7 Ability to measure, quantify, and ultimately reduce carbon footprint at every layer of the computing stack, in- and across- data centers, and across the entire life cycle. IBM Research | © 2022 IBM Corporation
  • 8.
    The Computer EnergyProblem 8 We are at an inflection point : 3. The end of Dennard Scaling means we can’t keep up  Some predict that electricity consumed by Data Centers will increase to 8% by 2030  Golden Era for Chip Design 1. Demand is growing at exponential scale How to stop data centers from gobbling up the world’s electricity https://www.nature.com/articles/d41586-018- 06610-y 2. The emergence of energy-demanding workloads(AI) AI power consumption doubles every 3-4 months * Green AI, R. Schwartz, J. Dodge, N. A. Smith, O. Etzioni 2019
  • 9.
    Ever rising energydemands for computing vs. global energy production is creating new risk, and new opportunities for radically different computing to drastically improve efficiency 31% a years the energy consumption increase trend for hyperscalers in North America >10% of the world's power will be consumed by hyperscalers by 2030 IBM Research | © 2022 IBM Corporation 9
  • 10.
    Sustainable Computing epochs Makingthe Current State More Sustainable Introducing Accelerators (Digital) & Hardware and Software co-design / co-optimization New Computational Models (beyond digital)  Understanding the As-Is  Hot Spot Detection  Remediation and Optimization  Coupling Power and Cloud  Cooling, Data Center Planning, etc  Storage AutoTiering.  HW and SW co-design (scalable approach)  Reduced precision chips – 8bit precision approximate computing  Voltage scaling with error correction  Runtime management of dis- aggregated & composable heterogenous DC  New computational models that completely break the relationship between energy & computation: neuromorphic, analog AI, data- centric, quantum, etc. https://research.ibm.com/blog/telum- processor https://www.esp.cs.columbia.edu https://research.ibm.com/blog/the- hardware-behind-analog-ai https://www.zurich.ibm.com/sto/memory/ IBM Research | © 2022 IBM Corporation 10
  • 11.
    Carbon Intensity The emissionrate: grams of carbon dioxide released per megajoule of energy produced — With coal power stations, the carbon intensity is high as CO2 is produced as part of the power generation process. Carbon intensity is >1 kg/kWh for coal; — Renewable energy such as hydro or solar produce almost no emissions, so their carbon intensity is very low. Carbon intensity is ~0 for solar/wind Modeling the Data Center Carbon Footprint 11 x Carbon Intensity Power usage effectiveness (PUE) A predominant metric used to measure the energy efficiency of a data center. — PUE = (Total Facility Energy) / ( IT Equipment Energy) Efficiency improves as the quotient decreases towards 1. 1 is optimal, 2 is very bad. Total Carbon Footprint The total amount of carbon dioxide (CO2) and equivalent green house gas emissions associated with powering a data center. CFP >= 0. Carbon Footprint = IT Equipment Energy x Power Usage Effectiveness CFP =EIT × PUE × CI EIT PUE An example DC Energy Breakdown IBM Research | © 2022 IBM Corporation
  • 12.
    Reducing the DataCenter Carbon Footprint: Research Opportunities 12 x Carbon Intensity Carbon Footprint = IT Equipment Energy x Power Usage Effectiveness CFP =EIT × ERE × CI • Data Center Design, Cooling and Heat- Reuse • Rack Design to optimize power conversion, and direct liquid cooling • Improving power conversion in the data center • Energy Aware Scheduling, Vertical Scaling, Dispatching • Power Management • Accelerators for Green AI: Tradeoffs between accuracy and efficiency • Chip Design • Dispatching of batch workload such as AI Training Jobs across time and space to maximize renewable energy use. • Forecasting of renewable energy (time series composition) • Can the cloud sense renewable energy and adapt? https://research.ibm.com/blog/ibm-artificial-intelligence- unit-aiu https://www.zurich.ibm.com/st/energy_efficiency/zeroemiss ion.html IBM Research | © 2022 IBM Corporation
  • 13.
    13 Act 13 Energy and CFP perworkload, tenant, VM, container, Service, Etc. Identify hotspots and applicable strategies. Calculate potential savings. Assess Estimate A set of controllers to dynamically optimize the Carbon footprint at operation. Design efficient systems Report Report CFP across your entire organization in a consistent fashion factoring in requirements Carbon Assessment & Reduction Framework An Approach for Sustainable Computing
  • 14.
    Energy Quantification Challenge • Howdo you estimate the power consumption of applications running on shared servers? • How do you do that when you do not have on-line power measurement at the server level? • How do you do that if you do not know what else is running on the machine? 14
  • 15.
    Energy Quantification Challenge • Howdo you estimate the power consumption of applications running on shared servers? => ratio based approach • How do you do that when you do not have on-line power measurement at the server level? => power modeling • How do you do that if you do not know what else is running on the machine? => dynamic power estimation only • How do you scale the approach to developing power models (combinatorial explosion problem)? 15 The Kepler Project https://github.com/sustainable-computing-io/kepler
  • 16.
    16 [1] https://github.com/sustainable-computing-io/kepler Kepler Architecture •eBPF metrics: hardware counters, cpu time and soft IRQ • System Power metrics from BMs and VMs • Ratio Power Model for containers • Trained Power Model to estimate the VM’s component power consumption
  • 17.
    17 Kepler Deployment Approaches -Ratio Power Model for Dynamic CPU Power with Hardware Counter: DynPowerprocess i = 𝐶𝑃𝑈 𝑐𝑦𝑐𝑙𝑒𝑠 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖 𝛴𝐶𝑃𝑈 𝑐𝑦𝑐𝑙𝑒𝑠 𝑥 DynPowerhost_CPU without Hardware Counter: DynPowerprocess i = 𝐵𝐹𝑃 𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖 𝛴𝐵𝑃𝐹 𝐶𝑃𝑈 𝑡𝑖𝑚𝑒 𝑥 DynPowerhost_CPU DynPowercontainer j = Σ 𝑖 𝜖 𝑗 DynPowerprocess i - Evenly distribution of Idle Power Powercontainer j = IdlePowerhost_CPU / numContainers GPU (nvml)
  • 18.
    Kepler Model ServerProject facilitate training power model for server without power meter Bare-metal (BM) Kepler Estimated System Power Metrics Ratio Power Model Process/Container Power Consumption Virtual Machine (VM) Trained Power Model Bare-metal (BM) RAPL ACPI/Sensors Redfish/IPMI GPU (nvml) Kepler Ratio Power Model Process/Container Power Consumption Server with power meter Server without power meter Kepler Model Server Motivation: • No power measurement exposed or instrumented in some running systems Challenges: • No or not-enough data to train power model specific to all available metrics and emerging system platform and settings (e.g., variety of CPU architecture, Frequency governor) • Dynamicity of control plane processes Collect Data Train Model Export Model Serve Model Estimate Power
  • 19.
    core of Keplermodel server Pipeline Framework (one extractor, one isolator, multiple trainers ) Extract … Prometheus query result Extracted data Isolated data Power models Node-level Train Container-level Train Isolate Energy metric Energy-related metric (s) with background power without background power https://www.cncf.io/blog/2023/10/11/exploring-keplers-potentials-unveiling-cloud-application-power-consumption/
  • 20.
    The Issue withThird- Party Clouds  No server power metric available  No knowledge of what else is running on my machine  how to split idle power?   Limited knowledge of the architecture and configuration of the bare metal servers  Challenge for applying separately trained power models…   ALL Cloud Native calculators are too coarse grained to be useful for optimization ..  Generated with Dall-E
  • 21.
    https://adrianco.medium.com/proposal-for-a-realtime-carbon-footprint-standard-60b71c269948 Adrian Cockcroft How canwe get to real time monitoring of application carbon consumption in third party clouds? Consistent Trustworthy Transparent Explainable Can Kepler help? What else do we need? WIP: Reference Implementation to be Open Sourced.
  • 22.
    23 Act 23 Energy and CFP perworkload, tenant, VM, container, Service, Etc. Identify hotspots and applicable strategies. Calculate potential savings. Assess Estimate A set of controllers to dynamically optimize the Carbon footprint at operation. Design efficient systems Report Report CFP across your entire organization in a consistent fashion factoring in requirements Carbon Assessment & Reduction Framework An Approach for Sustainable Computing
  • 23.
    Detect non-productive workloads •Virtual Machines • Cloud-native deployments • Cloud services Can schedules be drawn up for a few (if not all) productive workloads? Workload Classification: Motivation
  • 24.
    Methodology Workload* Classification Phase Abstraction Inactive/ active phases Non-repeatable Constantly Productive Alternating Workload Timetabling Candidate for Termination Candidate for Parking NoAction Repeatable Recommendation Metrics Non-productive • Non-productive: Remaining in the Inactive Phase • Constantly Productive: Remaining in the Active Phase • Alternating: Switching between the two Phases VM1 VM2 VM𝑁 𝑇 − 𝑤𝑐 7/14/21 Days 𝑇
  • 25.
    26 Act 26 Energy and CFP perworkload, tenant, VM, container, Service, Etc. Identify hotspots and applicable strategies. Calculate potential savings. Assess Estimate A set of controllers to dynamically optimize the Carbon footprint at operation. Design efficient systems Report Report CFP across your entire organization in a consistent fashion factoring in requirements Carbon Assessment & Reduction Framework An Approach for Sustainable Computing
  • 26.
    CARE: Carbon Quantification& Reduction Coordinated set of controllers to dynamically quantify and optimize the carbon footprint in every level of the hybrid cloud stack in and across on and off prem data centers Container Right-Sizing Dynamic dispatching Energy aware scheduler VM placement Power management Container Right-Sizing Energy aware scheduler VM placement Power management CFP =EIT × PUE × CI Leverage renewable energy when and where it is available across datacenters. Efficiency with container resource consumption within a datacenter. Efficient infrastructure with VM and power management 27
  • 27.
    Part2: AI Sustainability IBMResearch / Doc ID / Month XX, 2020 / © 2020 IBM Corporation 28
  • 29.
    The energy costof AI  Deep learning is computationally intensive  Time consuming even with high-performance computing resources Take for example: Training Image recognition model Dataset: ImageNet-22K Network: ResNet-101 256 GPUs 7 hours ~450kWh 4 GPUs 16 days ~385 kWh 1 model training run is ~2 weeks of home energy consumption https://arxiv.org/abs/1708.02188
  • 30.
    AI demand keepssurging Training requirements are doubling every 3.5 months Source: Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. CoRR abs/1906.02243 (2019). arXiv:1906.02243 Source: Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2019. Green AI. arXiv:1907.10597 [cs.CY]
  • 31.
    The emergence offoundation models Homogenization: a broad foundation model is adapted to perform specific tasks. Almost all state-of- the-art NLP models are now adapted from one of a few foundation models, such as BERT, RoBERTa, BART, T5, etc. Multi modal, and cross domains are next. Source: RishiBommasani,DrewA.Hudson,EhsanAdeli,RussAltman,SimranArora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Dora Demszky, and Chris Donahue et al. 2022. On the Opportunities and Risks of Foundation Models. Models. arXiv:2108.07258 [cs.LG]
  • 32.
    Sizes of LanguageModels Training Cost of Language Model GPT-3 needs 1024 A100 GPUs for 34 days for training! Large language models are getting larger Some say that this is okay, because they are re-used for multiple tasks* This claim is yet to be substantiated based on a sound analysis *E.g., DavidPatterson,JosephGonzalez,QuocLe,ChenLiang,Lluis-MiquelMunguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon Emissions and Large Neural Network Training.
  • 33.
    Data Scientist Dilemma:to adapt or not to adapt • To adapt from a broad model, or, to train a smaller model on a more specific data set? • How much data to use? • Can I synthesize a few smaller models? • Neural Architecture Search? Hyper Parameter Optimization? Is it worth the cost? well, it depends…. • What is the optimal frequency of re-training? Daily? Weekly? what data shall I use for re-training? incremental? Complete?
  • 34.
    Sustainable AI platformprinciples Transparency dynamically track energy and carbon across the data and model life cycle Traceability and Governance track the ‘supply chain’ of models and data-sets and associated energy and carbon Energy Efficiency Innovation across all layers of the stack Meaningful Metrics 10/17/2023 35
  • 35.
    Meaningful Metrics categories data- set model Products CoreMetric Life-cycle Efficiency Construction Operation Construction pre-training 10/17/2023 Operation re- training Inference Life-cycle factor-in the provenance of models and data-sets and their associate energy and carbon footprint (Life-Cycle-Assessment principles) D FM M Efficiency efficiency = 𝑐𝑜𝑠𝑡 𝑤𝑜𝑟𝑘 𝑝𝑟𝑜𝑑𝑢𝑐𝑒𝑑 what goes into ‘cost’?  compute for inference  +training  +bill of material ‘tax’
  • 36.
    holistic approach toSustainable AI Factor-in the entire life cycle of models Sustainable strategy exploration and what-if analysis Provenance, Governance, and reporting Holistic impact analysis and tradeoff based planning AI Sustainability Metrics
  • 37.
  • 38.
    The life-cycle ofa model as a state machine Each ‘state transition’ is associated with a significant energy/carbon cost, and involve critical decisions, that will affect cost of this and downstream tasks. • Tradeoffs between accuracy, time-to-value, and energy/carbon • Cost of one phase may depend on decisions taken at a prior stage. save now, pay later…. • The particulars of the target task are important to factor in early on.
  • 39.
    On-Line Fine Grainmonitoring of Energy and Carbon with Kepler • An open-source project pioneered by RedHat and IBM Research to quantify cloud native applications energy/carbon. • On road map to deliver in OCP and integrate in Rosa • Adrian Cockcroft advocating use of Kepler across all cloud providers “Real Time Energy and Carbon Standard for Cloud Providers” 10/17/2023 40
  • 40.
    SusQL: Context awareaggregation and energy accounting Infrastructure: Kubernetes controller with its own CRD that gets data from Kepler for aggregation susql-controller map[labels]->energy table 1 2 3 4 apiVersion: … kind: LabelGroup metadata: … spec: labels: - <label-1> - <label-2> - <label-3> - <label-4> status: totalEnergy: <total energy>
  • 41.
  • 42.
    A ‘Supply Chain’of models Models are created (‘manufactured’) distilled, fine tuned, and rer-used (adapted) to created new models Deployment is just the beginning of the journey. How do we reason about the Life-Cycle Cost of models?
  • 43.
    Product Life CycleAssessment Principles for Sustainable AI: Products = data-set | model We need to factor in the cost of the Bill of Material used in the creation of a new model If B (a product or a service) is used in the process of creation of A1, A2, … An, then the carbon cost of B is inherited by A1, A2, …, An in proportion to their use.
  • 44.
  • 45.
  • 46.
    Efficiency at everylayer of the AI Stack • Every layer of the FM stack offer opportunity for efficiencies gains Model Quantization, architecture innovation Tools dynamic batching Platform Multiplexing, dispatching Infrastructure DVFS, power param optimization, caching, Systems Approximate computing and other system innovations • Empower the data scientist to make choices and explore tradeoffs between accuracy, performance, energy • Empower the data scientist to reason about life-cycle strategies: e.g., if/what/when to re-use, and how much to retrain
  • 47.
  • 48.
    49 IBM Research’s ArtificialIntelligence Unit (AIU) Chip architecture optimized for enterprise AI workloads Enabled for Foundation Models Enabled in the Red Hat software stack Supports multi-precision inference (& training) FP16, FP8, INT8, INT4, INT2 Implemented in leading edge 5nm technology https://research.ibm.com/blog/ibm-artificial-intelligence-unit-aiu SoC implements IBM’s leadership innovations in low-precision AI arithmetic and algorithms IBM Research AI Hardware Center / © 2023 IBM Corporation
  • 49.
    50 Vision for AIPerformance Scaling • Applying Approximate Computing techniques to AI compute • Critical requirement: maintain model accuracy • Advantage: Quadratic improvement in performance • IBM Research has been at the forefront of every major technical advancement on bit-precision scaling • 16-bit training (2015) • 8-bit training (2018, 2019) • 4-bit training (2020) • 2/4-bit Inference (2018-2020) • Complemented by • Sparsity support • Analog Computing • 3D Stacking Digital AI Cores Scaling precision for quadratic gains in performance with iso-accuracy 4-bit Inference ASICs J.Choi et al., https://arxiv.org/pdf/1805.06085.pdf J.McKinstry et al., https://arxiv.org/abs/1809.04191 2-bit Inference ASICs J.Choi et al., SysML 2019 0.1 1 10 100 2012 2015 2018 2021 2024 16-bit 32-bit 16-bit 8-bit 8-bit 2-bit 4-bit 4-bit 16-bit Training ICML 2015 Training Inference 4-bit Training X. Sun et al NeurIPS 2020 8-bit Training NeurIPS 2018, 2019 4-bit Inference J.Choi et al.,arxiv 2018 2-bit Inference J.Choi et al., SysML 2019 Bit Precision https://research.ibm.com/blog/ibm-artificial-intelligence-unit-aiu https://research.ibm.com/blog/ai-chip-precision-scaling
  • 50.
  • 51.
    Vela: A CloudNative Supercomputer for the Foundation Model Age (Kepler inside) System specifications – Nodes with 8 x A100 GPUs (80GB) – GPUs interconnected with NVLink, NVSwitch – Cascade Lake CPUs, 1.5TB of DRAM, – Four 3.2TB NVMe drives – Redundant connections between nodes, TORs and spines – 2 x 100G NICs from each node – NCCL benchmarks show we drive close to line rate https://research.ibm.com/blog/AI-supercomputer-Vela-GPU-cluster – Configure resources through software (APIs) – Broad ecosystem of available cloud services – Leverage data sets on Cloud Object Store – Standard, flexible, scalable infrastructure design (vs traditional HPC) – Near bare metal performance (within 5%, single node) How do you evolve from specialized (monolithic), costly, and inflexible HPC stack to Cloud Native Stack without compromising efficiency ? - Programmability - Scalability - Re-use - Observability - Agility - Democratization
  • 52.
  • 53.
    Dispatching of jobsbased on renewable energy 54 Motivation:  Carbon intensity of the energy mix of different regions of IBM data centers varies over time.  Renewable energy is not available all the time and in all places. Workload Optimization: Placement and scheduling of workloads based on carbon-free energy availability. Ideal dispatching: High CPU utilization when carbon intensity is low and low CPU utilization when carbon intensity is high. T. Bahreini, A. Tantawi and A. Youssef, "An Approximation Algorithm for Minimizing the Cloud Carbon Footprint through Workload Scheduling," 2022 IEEE 15th International Conference on Cloud Computing (CLOUD), 2022, pp. 522-531, Challenge: Ideal dispatching might be practically infeasible.  Short jobs may have short deadline.  Some jobs are not interruptible.  Jobs have heterogenous resource demands. Obtaining the optimal packing is intractable.
  • 54.
    55 In a singledata center how to order batch jobs for minimum carbon while meeting deadlines. polynomial approximation algorithms that works across data centers (space x time).
  • 55.
  • 56.
  • 57.
  • 58.
    Call to action: AIPlatform providers: - Build-in transparency and governance - Incorporate platform and system innovation for efficiency. Academia & Industry: Focus you Research on Efficiency not just accuracy Data Scientists / Practitioners: Develop a sustainability mind-set Re-use where it makes sense Domain specific, smaller models are better! Explore tradeoffs (accuracy vs cost) 59
  • 59.
    Tokyo Shin-Kawasaki Delhi Bangalore Singapore Nairobi Haifa Zurich Warrington Dublin Cambridge Albany Yorktown Almaden Rio de Janeiro SaoPaulo Johannesburg 6 Nobel Laureates 10 Medals of Technology 5 National Medals of Science 6 Turing Awards IBM Research
  • 60.

Editor's Notes

  • #10 energy-efficiency throughout the memory hierarchy. Since different tasks require a different system composition for best utilization, the data centers need to be rearchitected in the future using disaggregation and composability. This allows flexible composition and ystem configuration to optimally serve a particular task. Considering that the various technical components (CPUs, GPUs, memory, and storage) have different lifecycles, disaggregation additionally improves the system performance and reduces cost, as they can be replaced separately. Common memory systems for AI/ML applications include on-chip memory, high bandwidth memory (HBM), and GDDR—and all have different architectural implications. A universal goal is to realize memory technology with much higher bandwidth and lower latency, while consuming less energy. While HBM DRAMs are already very power-efficient, roughly 2/3 of the power budget is still spent moving data between an SoC and the DRAM (Figure 2.4)5. Reducing the volume of data moved provides an opportunity for large improvement, this requires further research. Different concepts for disaggregation of memory and storage are already proposed, but more research is needed to identify the best way to use disaggregation to achieve TCO benefits at scale and improve latency. To generate these benefits, a multi-tiered memory approach that includes the use of storage-class memories is needed. The new architectures can pose a challenge but can also provide an opportunity for application development. The impact to legacy code needs to be understood and mitigated.
  • #33 Foundation models have led to an unprecedented level of homogenization: Almost all state-of- the-art NLP models are now adapted from one of a few foundation models, such as BERT, RoBERTa, BART, T5, etc.
  • #34 Training GPT-3, which is a single general-purpose AI program that can generate language and has many different uses, took 1.287 gigawatt hours, according to a research paper published in 2021, or about as much electricity as 120 US homes would consume in a year. That training generated 502 tons of carbon emissions, according to the same paper, or about as much as 110 US cars emit in a year. That’s for just one program, or “model.” While training a model has a huge upfront power cost, researchers found in some cases it’s only about 40% of the power burned by the actual use of the model, with billions of requests pouring in for popular programs. Plus, the models are getting bigger. OpenAI’s GPT-3 uses 175 billion parameters, or variables, that the AI system has learned through its training and retraining. Its predecessor used just 1.5 billion. Another relative measure comes from Google, where researchers found that artificial intelligence made up 10 to 15% of the company’s total electricity consumption, which was 18.3 terawatt hours in 2021. That would mean that Google’s AI burns around 2.3 terawatt hours annually, about as much electricity each year as all the homes in a city the size of Atlanta. https://www.bloomberg.com/news/articles/2023-03-09/how-much-energy-do-ai-and-chatgpt-use-no-one-knows-for-sure
  • #50 Packaged as a PCIe card, for ease of integration into virtually any on-premises or cloud system Integration into the IBM Watson software stack underway, to power the AI inference infrastructure of IBM Research’s Foundation Model Big Bet Packaged as a PCIe card, for ease of integration into virtually any on-premises or cloud system Enabled in the Red Hat software stack including PyTorch and TensorFlow integration
  • #51 we can drop from 32-bit floating point arithmetic to bit-formats holding a quarter as much information. This simplified format dramatically cuts the amount of number crunching needed to train and run an AI model, without sacrificing accuracy. We leverage key IBM breakthroughs from the last five years to find the best tradeoff between speed and accuracy. This is not a chip we designed entirely from scratch. Rather, it’s the scaled version of an already proven AI accelerator built into our Telum chip that power Z 16 System.
  • #53 So, we asked ourselves: how do we deliver bare-metal performance inside of a VM? Following a significant amount of research and discovery, we devised a way to expose all of the capabilities on the node (GPUs, CPUs, networking, and storage) into the VM so that the virtualization overhead is less than 5%, which is the lowest overhead in the industry that we’re aware of. This work includes configuring the bare-metal host for virtualization with support for Virtual Machine Extensions (VMX), single-root IO virtualization (SR-IOV), and huge pages. We also needed to faithfully represent all devices and their connectivity inside the VM, such as which network cards are connected to which CPUs and GPUs, how GPUs are connected to the CPU sockets, and how GPUs are connected to each other. These, along with other hardware and software configurations, enabled our system to achieve close to bare metal performance. Bare Metal vs. VMs || Ethernet vs Infiniband || openhsift scheduling (MCAD) vs. LSF enabling SR-IOV for our network interface cards on each node, thereby exposing each 100G link directly into the VMs via virtual functions.  we can hide the communication time over the network behind compute time occurring on the GPUs. This approach is aided by our choice of GPUs with 80GB of memory (discussed above), which allows us to use bigger batch sizes (compared to the 40 GB model), and leverage the Fully Shared Data Parallel (FSDP) training strategy more efficiently.  Next we’ll be rolling out an implementation of remote direct memory access (RDMA) over converged ethernet (RoCE) at scale and GPU Direct RDMA (GDR), to deliver the performance benefits of RDMA and GDR while minimizing adverse impact to other traffic. Our lab measurements indicate that this will cut latency in half.