SlideShare a Scribd company logo
1 of 15
Download to read offline
Advanced High-Performance
Computing Features of the
OpenPOWER ISA
Workshop on RISC-V and OpenPOWER
International Conference on Supercomputing
June 20, 2020
José Moreira – IBM Research
Many thanks to: Jeff Stuecheli
Edmund Gieske
Brian Thompto
© 2020 IBM Corporation
Open architectures for supercomputing
• Since the dawn of supercomputing with the CDC 6600, through the many
generations of Seymour Cray machines, massively parallel processorslike Blue
Gene and the current champion Fugaku, high performance computing has been
about three things:
▪ Number crunching: perform as many arithmetic operations as possible
▪ Memory bandwidth: read/write as much data from/to memory as possible
▪ Interconnect: communicate between elements as fast as possible
• Through the OpenPOWER initiative, IBM has made a variety of computing
technologies openly available to the community, including the three HPC-
enabling technologies listedabove
• In this talk, we will focus on recent OpenPOWER developments in the areas of
number crunching and memory bandwidth
• But don’t forget the interconnect if you are building a supercomputer ☺
2
© 2020 IBM Corporation
POWER family memory architecture
3
Scale Out
Direct Attach Memory
Scale Up
Buffered Memory
Low latency access
Commodity packaging form factor
Superior RAS, High bandwidth, High Capacity
Agnostic interface for alternatememory innovations
Same Open Memory Interface used for all Systems and Memory Technologies
OpenCAPI Agnostic Buffered Memory (OMI)
Near Tier
Extreme Bandwidth
Lower Capacity
Commodity
Low Latency
Low Cost
Enterprise
RAS
Capacity
Bandwidth
Storage Class
Extreme Capacity
Persistence
© 2020 IBM Corporation
Primary tier memory options
4
72b ~2666 MHz bidi
8b
72b ~2666 MHz bidi
8b
8b
8b ~25G dif f
8b
8b
BUF
8b ~25G dif f
BUF
16b
DDR4 RDIMM
Capacity ~256 GB
BW ~150 GB/sec
DDR4 LRDIMM
Capacity ~2 TB
BW ~150 GB/sec
DDR4 OMI DIMM
Capacity ~256GB→4 TB
BW ~320 GB/sec
BW Opt OMI DIMM
Capacity ~128→512 GB
BW ~650 GB/sec
1024b
On module
Si interposer
On Module HBM
Capacity ~16→32 GB
BW ~1 TB/sec
Same System
Same System
Unique System
OMIStrategy
Only 5-10ns
higher load-to-use
than RDIMM
(< 5ns for LRDIMM)
© 2020 IBM Corporation
POWER connectivity variants
5
Direct Attach Memory
Max capacity: 2 TB
Max bandwidth: 150 GB/s
Limited system interconnect
OMI Buffered Memory
Max capacity: 4 TB
Max bandwidth: 650 GB/s
Enhanced system interconnect
x24 system attach
x24 system attach
4 x DDR4 memory
4 x DDR4 memory
2 x local SMP
Interconnect
x48 system attach
x48 system attach 3 x local SMP
Advanced interconnect
8 x OMI buffered memory
8 x OMI buffered memory
6x bandwidth
per mm2 as
DDR signaling
© 2020 IBM Corporation
DRAM DIMM comparison
6
• Technology agnostic
• Low cost
• Ultra-scale systemdensity
• Enterprise reliability
• Low-latency
• High bandwidth
Approximate Scale
JEDEC DDR DIMM
IBM Centaur DIMM
OMI DDIMM
© 2020 IBM Corporation
Matrix-Multiply Assist (MMA) instructions
• The latest version of Power ISA (for POWER10) is now publicly available
• The Matrix-Multiply Assist instructions lead to very efficient implementations for
high performance computing
• These instructions are a natural match for implementing dense numerical linear
algebra computations
• We have also shown application to other computations such as convolution
• Various other computations require additional work and research, including
arbitrary precisionarithmetic, discrete Fourier transform, …
7
© 2020 IBM Corporation
Power ISA Vector-Scalar Registers (VSRs)
8
© 2020 IBM Corporation
Accumulators
• Accumulators are 4 × 4 arrays of 32-bit elements (we will briefly mention the
64-bit extension later)
𝐴 =
𝑎00 𝑎01
𝑎10 𝑎11
𝑎02 𝑎03
𝑎12 𝑎13
𝑎20 𝑎21
𝑎30 𝑎31
𝑎22 𝑎23
𝑎32 𝑎33
• The elements can be either 32-bit signed-integers (int32) or 32-bit single-
precisionfloating-point numbers (fp32)
• Each accumulator is a software-managed shadow to a set of 4 consecutive
VSRs (8 architectedaccumulators – ACC[0:7]) – must choose between
using accumulator or associatedVSRs
• State must be explicitly transferredbetween accumulators and VSRs using
VSX Move From Accumulator (xxmfacc) andVSX Move To Accumulator
(xxmtacc)
9
© 2020 IBM Corporation
Outer-product (xv<type>ger<rank-𝑘>) instructions
• Accumulatorsare updatedby rank-𝑘 update instructions:
• Input:1 accumulator(𝐴)+ 2 VSRs (𝑋, 𝑌)
• Output:1 accumulator(sameas input to reduce instructionencodingspace)
• Operation: 𝐴 ← ± 𝐴 ± 𝑋𝑌 𝑇
• For 32-bit data, 𝑋 and 𝑌 are 4 × 1 arrays ofelements
• For 16-bit data, 𝑋 and 𝑌 are 4 × 2 arrays ofelements
• For 8-bit data, 𝑋 and 𝑌 are 4 × 4 arraysofelements
• For 4-bit data, 𝑋 and 𝑌 are 4 × 8 arraysofelements
• This way 𝑋𝑌 𝑇 always has a 4 × 4 shape, compatiblewith accumulator
Instruction 𝑨 𝑿 𝒀 𝑻 # of madds/instruction
xvf32ger 4 × 4 (fp32) 4 × 1 (fp32) 1 × 4 (fp32) 16
xvf16ger2 4 × 4 (fp32) 4 × 2 (fp16) 2 × 4 (fp16) 32
xvi16ger2 4 × 4 (int32) 4 × 2 (int16) 2 × 4 (int16) 32
xvi8ger4 4 × 4 (int32) 4 × 4 (int8) 4 × 4 (int8) 64
xvi4ger8 4 × 4 (int32) 4 × 8 (int4) 8 × 4 (int4) 128
10
© 2020 IBM Corporation
Extension to 64-bit
• Accumulators are 4 × 2 arrays of 64-bit floating-point elements
𝐴 =
𝑎00 𝑎01
𝑎10 𝑎11
𝑎20 𝑎21
𝑎30 𝑎31
• Accumulators are updated by outer-product instructions:
▪ Input: 1 accumulator (𝐴) + 3 VSRs (𝑋1, 𝑋2, 𝑌)
▪ Output: 1 accumulator (same as input to reduce instruction encoding space)
• Operation: 𝐴 ← 𝐴 +
𝑋1
𝑋2
𝑌 𝑇
▪ 𝑋1, 𝑋2 and 𝑌 are 2 × 1 arrays of elements
▪
𝑋1
𝑋2
𝑌 𝑇 always has a 4 × 2 shape, compatible with accumulator
11
Instruction 𝑨 𝑿 𝒀 𝑻 # of madds/instruction
xvf64ger 4 × 2 (fp64) 4 × 1 (fp64) 1 × 2 (fp64) 8
© 2020 IBM Corporation
Load pressure and unit latency (32-bit results)
12
• 8 accumulators(8 × 16 result)
• 6 VSR loads/8 xv_gerinstructions
• 0.75 VSR load/xv_ger
• Tolerates the most latency
• 4 accumulators(8 × 8 result)
• 4 VSR loads/4 xv_gerinstructions
• 1 VSR load/xv_ger
• Couldwork well in SMT modes
• 2 accumulators(8 × 4 result)
• 3 VSR loads/2 xv_gerinstructions
• 1.5 VSR load/ xv_ger
• 1 accumulator (4 × 4result)
• 2 VSR loads/1 xv_gerinstruction
• 2 VSR load/ xv_ger
ACC[0] ACC[2] ACC[4] ACC[6]
ACC[1] ACC[3] ACC[5] ACC[7]
X[0]X[1]
Y[0] Y[1] Y[2] Y[3]
ACC[0] ACC[2]
ACC[1] ACC[3]
X[0]X[1]
Y[0] Y[1]
ACC[0]
ACC[1]
X[0]X[1]
Y[0]
© 2020 IBM Corporation
The micro-kernel of GEMM: 𝑪 𝒎×𝒏 += 𝑨 𝒎×𝒌 × 𝑩 𝒌×𝒏
13
© 2020 IBM Corporation
SGEMM micro-kernel using 𝟖 × 𝟖 virtual accumulator
14
© 2020 IBM Corporation
Conclusions
• OpenPOWER delivers the three essentials of a supercomputer:
▪ Number crunching: Through its SIMD and MMA instructions
▪ Memory bandwitth: Through OMI
▪ Interconnect: Through OpenCAPI (not discussed in this talk)
• The MMA instructions provide a new level of performance for dense linear
algebra and related computations
• OMI provides a new level of memory bandwidth for computing systems while
delivering low cost, versatility and capacity
• Both are scalable, offeringroom to growth in both dimensions
• Together with Open Source software, we see a clear road ahead for OpenHPC
systems that combine the best innovation from the community!
15

More Related Content

What's hot

DDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationDDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationSubhajit Sahu
 
了解网络
了解网络了解网络
了解网络Feng Yu
 
Citrix TechXperts Perth May 2016
Citrix TechXperts Perth May 2016Citrix TechXperts Perth May 2016
Citrix TechXperts Perth May 2016Jeremy Saunders
 
Application_Benchmark_into_Virtualization
Application_Benchmark_into_VirtualizationApplication_Benchmark_into_Virtualization
Application_Benchmark_into_VirtualizationKhai Le
 
了解Cpu
了解Cpu了解Cpu
了解CpuFeng Yu
 
Reconnaissance of Virtio: What’s new and how it’s all connected?
Reconnaissance of Virtio: What’s new and how it’s all connected?Reconnaissance of Virtio: What’s new and how it’s all connected?
Reconnaissance of Virtio: What’s new and how it’s all connected?Samsung Open Source Group
 
Achieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVMAchieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVMDevOps.com
 
Red hat open stack and storage presentation
Red hat open stack and storage presentationRed hat open stack and storage presentation
Red hat open stack and storage presentationMayur Shetty
 
A way to visual the best storage media for an application
A way to visual the best storage media for an applicationA way to visual the best storage media for an application
A way to visual the best storage media for an applicationTony Roug
 
z/VM Performance Analysis
z/VM Performance Analysisz/VM Performance Analysis
z/VM Performance AnalysisRodrigo Campos
 
StorTrends 2400i is a 2U Dual Dialect IP-SAN / NAS Storage Solution
StorTrends 2400i is a 2U Dual Dialect IP-SAN / NAS Storage SolutionStorTrends 2400i is a 2U Dual Dialect IP-SAN / NAS Storage Solution
StorTrends 2400i is a 2U Dual Dialect IP-SAN / NAS Storage SolutionAmerican Megatrends India
 
DevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DC
DevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DCDevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DC
DevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DCDevOpsDays Riga
 

What's hot (13)

DDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationDDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : Presentation
 
了解网络
了解网络了解网络
了解网络
 
Citrix TechXperts Perth May 2016
Citrix TechXperts Perth May 2016Citrix TechXperts Perth May 2016
Citrix TechXperts Perth May 2016
 
Application_Benchmark_into_Virtualization
Application_Benchmark_into_VirtualizationApplication_Benchmark_into_Virtualization
Application_Benchmark_into_Virtualization
 
了解Cpu
了解Cpu了解Cpu
了解Cpu
 
Reconnaissance of Virtio: What’s new and how it’s all connected?
Reconnaissance of Virtio: What’s new and how it’s all connected?Reconnaissance of Virtio: What’s new and how it’s all connected?
Reconnaissance of Virtio: What’s new and how it’s all connected?
 
Achieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVMAchieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVM
 
Red hat open stack and storage presentation
Red hat open stack and storage presentationRed hat open stack and storage presentation
Red hat open stack and storage presentation
 
Nvprof um 1
Nvprof um 1Nvprof um 1
Nvprof um 1
 
A way to visual the best storage media for an application
A way to visual the best storage media for an applicationA way to visual the best storage media for an application
A way to visual the best storage media for an application
 
z/VM Performance Analysis
z/VM Performance Analysisz/VM Performance Analysis
z/VM Performance Analysis
 
StorTrends 2400i is a 2U Dual Dialect IP-SAN / NAS Storage Solution
StorTrends 2400i is a 2U Dual Dialect IP-SAN / NAS Storage SolutionStorTrends 2400i is a 2U Dual Dialect IP-SAN / NAS Storage Solution
StorTrends 2400i is a 2U Dual Dialect IP-SAN / NAS Storage Solution
 
DevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DC
DevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DCDevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DC
DevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DC
 

Similar to Advanced High-Performance Computing Features of the Open Power ISA

Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10Ganesan Narayanasamy
 
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...xKinAnx
 
How Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterHow Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterAaron Joue
 
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...Red_Hat_Storage
 
Fujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU SpecificationsFujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU Specificationsinside-BigData.com
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overviewlambertt
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
IBM Power Systems E850C and S824
IBM Power Systems E850C and S824IBM Power Systems E850C and S824
IBM Power Systems E850C and S824David Spurway
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...HostedbyConfluent
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster inwin stack
 
Challenges in Embedded Computing
Challenges in Embedded ComputingChallenges in Embedded Computing
Challenges in Embedded ComputingPradeep Kumar TS
 
Decoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMIDecoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMIAllan Cantle
 
Ics21 workshop decoupling compute from memory, storage &amp; io with omi - ...
Ics21 workshop   decoupling compute from memory, storage &amp; io with omi - ...Ics21 workshop   decoupling compute from memory, storage &amp; io with omi - ...
Ics21 workshop decoupling compute from memory, storage &amp; io with omi - ...Vaibhav R
 
IBM Power9 Features and Specifications
IBM Power9 Features and SpecificationsIBM Power9 Features and Specifications
IBM Power9 Features and Specificationsinside-BigData.com
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-GeneOpenStack Korea Community
 

Similar to Advanced High-Performance Computing Features of the Open Power ISA (20)

Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10
 
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
 
How Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterHow Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver Cluster
 
POWER9 for AI & HPC
POWER9 for AI & HPCPOWER9 for AI & HPC
POWER9 for AI & HPC
 
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
 
Fujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU SpecificationsFujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU Specifications
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overview
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
IBM Power Systems E850C and S824
IBM Power Systems E850C and S824IBM Power Systems E850C and S824
IBM Power Systems E850C and S824
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
module01.ppt
module01.pptmodule01.ppt
module01.ppt
 
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
 
Challenges in Embedded Computing
Challenges in Embedded ComputingChallenges in Embedded Computing
Challenges in Embedded Computing
 
Decoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMIDecoupling Compute from Memory, Storage and IO with OMI
Decoupling Compute from Memory, Storage and IO with OMI
 
Ics21 workshop decoupling compute from memory, storage &amp; io with omi - ...
Ics21 workshop   decoupling compute from memory, storage &amp; io with omi - ...Ics21 workshop   decoupling compute from memory, storage &amp; io with omi - ...
Ics21 workshop decoupling compute from memory, storage &amp; io with omi - ...
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
IBM Power9 Features and Specifications
IBM Power9 Features and SpecificationsIBM Power9 Features and Specifications
IBM Power9 Features and Specifications
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
 

More from Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency programGanesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISAGanesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsGanesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsGanesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 

Recently uploaded

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Recently uploaded (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Advanced High-Performance Computing Features of the Open Power ISA

  • 1. Advanced High-Performance Computing Features of the OpenPOWER ISA Workshop on RISC-V and OpenPOWER International Conference on Supercomputing June 20, 2020 José Moreira – IBM Research Many thanks to: Jeff Stuecheli Edmund Gieske Brian Thompto
  • 2. © 2020 IBM Corporation Open architectures for supercomputing • Since the dawn of supercomputing with the CDC 6600, through the many generations of Seymour Cray machines, massively parallel processorslike Blue Gene and the current champion Fugaku, high performance computing has been about three things: ▪ Number crunching: perform as many arithmetic operations as possible ▪ Memory bandwidth: read/write as much data from/to memory as possible ▪ Interconnect: communicate between elements as fast as possible • Through the OpenPOWER initiative, IBM has made a variety of computing technologies openly available to the community, including the three HPC- enabling technologies listedabove • In this talk, we will focus on recent OpenPOWER developments in the areas of number crunching and memory bandwidth • But don’t forget the interconnect if you are building a supercomputer ☺ 2
  • 3. © 2020 IBM Corporation POWER family memory architecture 3 Scale Out Direct Attach Memory Scale Up Buffered Memory Low latency access Commodity packaging form factor Superior RAS, High bandwidth, High Capacity Agnostic interface for alternatememory innovations Same Open Memory Interface used for all Systems and Memory Technologies OpenCAPI Agnostic Buffered Memory (OMI) Near Tier Extreme Bandwidth Lower Capacity Commodity Low Latency Low Cost Enterprise RAS Capacity Bandwidth Storage Class Extreme Capacity Persistence
  • 4. © 2020 IBM Corporation Primary tier memory options 4 72b ~2666 MHz bidi 8b 72b ~2666 MHz bidi 8b 8b 8b ~25G dif f 8b 8b BUF 8b ~25G dif f BUF 16b DDR4 RDIMM Capacity ~256 GB BW ~150 GB/sec DDR4 LRDIMM Capacity ~2 TB BW ~150 GB/sec DDR4 OMI DIMM Capacity ~256GB→4 TB BW ~320 GB/sec BW Opt OMI DIMM Capacity ~128→512 GB BW ~650 GB/sec 1024b On module Si interposer On Module HBM Capacity ~16→32 GB BW ~1 TB/sec Same System Same System Unique System OMIStrategy Only 5-10ns higher load-to-use than RDIMM (< 5ns for LRDIMM)
  • 5. © 2020 IBM Corporation POWER connectivity variants 5 Direct Attach Memory Max capacity: 2 TB Max bandwidth: 150 GB/s Limited system interconnect OMI Buffered Memory Max capacity: 4 TB Max bandwidth: 650 GB/s Enhanced system interconnect x24 system attach x24 system attach 4 x DDR4 memory 4 x DDR4 memory 2 x local SMP Interconnect x48 system attach x48 system attach 3 x local SMP Advanced interconnect 8 x OMI buffered memory 8 x OMI buffered memory 6x bandwidth per mm2 as DDR signaling
  • 6. © 2020 IBM Corporation DRAM DIMM comparison 6 • Technology agnostic • Low cost • Ultra-scale systemdensity • Enterprise reliability • Low-latency • High bandwidth Approximate Scale JEDEC DDR DIMM IBM Centaur DIMM OMI DDIMM
  • 7. © 2020 IBM Corporation Matrix-Multiply Assist (MMA) instructions • The latest version of Power ISA (for POWER10) is now publicly available • The Matrix-Multiply Assist instructions lead to very efficient implementations for high performance computing • These instructions are a natural match for implementing dense numerical linear algebra computations • We have also shown application to other computations such as convolution • Various other computations require additional work and research, including arbitrary precisionarithmetic, discrete Fourier transform, … 7
  • 8. © 2020 IBM Corporation Power ISA Vector-Scalar Registers (VSRs) 8
  • 9. © 2020 IBM Corporation Accumulators • Accumulators are 4 × 4 arrays of 32-bit elements (we will briefly mention the 64-bit extension later) 𝐴 = 𝑎00 𝑎01 𝑎10 𝑎11 𝑎02 𝑎03 𝑎12 𝑎13 𝑎20 𝑎21 𝑎30 𝑎31 𝑎22 𝑎23 𝑎32 𝑎33 • The elements can be either 32-bit signed-integers (int32) or 32-bit single- precisionfloating-point numbers (fp32) • Each accumulator is a software-managed shadow to a set of 4 consecutive VSRs (8 architectedaccumulators – ACC[0:7]) – must choose between using accumulator or associatedVSRs • State must be explicitly transferredbetween accumulators and VSRs using VSX Move From Accumulator (xxmfacc) andVSX Move To Accumulator (xxmtacc) 9
  • 10. © 2020 IBM Corporation Outer-product (xv<type>ger<rank-𝑘>) instructions • Accumulatorsare updatedby rank-𝑘 update instructions: • Input:1 accumulator(𝐴)+ 2 VSRs (𝑋, 𝑌) • Output:1 accumulator(sameas input to reduce instructionencodingspace) • Operation: 𝐴 ← ± 𝐴 ± 𝑋𝑌 𝑇 • For 32-bit data, 𝑋 and 𝑌 are 4 × 1 arrays ofelements • For 16-bit data, 𝑋 and 𝑌 are 4 × 2 arrays ofelements • For 8-bit data, 𝑋 and 𝑌 are 4 × 4 arraysofelements • For 4-bit data, 𝑋 and 𝑌 are 4 × 8 arraysofelements • This way 𝑋𝑌 𝑇 always has a 4 × 4 shape, compatiblewith accumulator Instruction 𝑨 𝑿 𝒀 𝑻 # of madds/instruction xvf32ger 4 × 4 (fp32) 4 × 1 (fp32) 1 × 4 (fp32) 16 xvf16ger2 4 × 4 (fp32) 4 × 2 (fp16) 2 × 4 (fp16) 32 xvi16ger2 4 × 4 (int32) 4 × 2 (int16) 2 × 4 (int16) 32 xvi8ger4 4 × 4 (int32) 4 × 4 (int8) 4 × 4 (int8) 64 xvi4ger8 4 × 4 (int32) 4 × 8 (int4) 8 × 4 (int4) 128 10
  • 11. © 2020 IBM Corporation Extension to 64-bit • Accumulators are 4 × 2 arrays of 64-bit floating-point elements 𝐴 = 𝑎00 𝑎01 𝑎10 𝑎11 𝑎20 𝑎21 𝑎30 𝑎31 • Accumulators are updated by outer-product instructions: ▪ Input: 1 accumulator (𝐴) + 3 VSRs (𝑋1, 𝑋2, 𝑌) ▪ Output: 1 accumulator (same as input to reduce instruction encoding space) • Operation: 𝐴 ← 𝐴 + 𝑋1 𝑋2 𝑌 𝑇 ▪ 𝑋1, 𝑋2 and 𝑌 are 2 × 1 arrays of elements ▪ 𝑋1 𝑋2 𝑌 𝑇 always has a 4 × 2 shape, compatible with accumulator 11 Instruction 𝑨 𝑿 𝒀 𝑻 # of madds/instruction xvf64ger 4 × 2 (fp64) 4 × 1 (fp64) 1 × 2 (fp64) 8
  • 12. © 2020 IBM Corporation Load pressure and unit latency (32-bit results) 12 • 8 accumulators(8 × 16 result) • 6 VSR loads/8 xv_gerinstructions • 0.75 VSR load/xv_ger • Tolerates the most latency • 4 accumulators(8 × 8 result) • 4 VSR loads/4 xv_gerinstructions • 1 VSR load/xv_ger • Couldwork well in SMT modes • 2 accumulators(8 × 4 result) • 3 VSR loads/2 xv_gerinstructions • 1.5 VSR load/ xv_ger • 1 accumulator (4 × 4result) • 2 VSR loads/1 xv_gerinstruction • 2 VSR load/ xv_ger ACC[0] ACC[2] ACC[4] ACC[6] ACC[1] ACC[3] ACC[5] ACC[7] X[0]X[1] Y[0] Y[1] Y[2] Y[3] ACC[0] ACC[2] ACC[1] ACC[3] X[0]X[1] Y[0] Y[1] ACC[0] ACC[1] X[0]X[1] Y[0]
  • 13. © 2020 IBM Corporation The micro-kernel of GEMM: 𝑪 𝒎×𝒏 += 𝑨 𝒎×𝒌 × 𝑩 𝒌×𝒏 13
  • 14. © 2020 IBM Corporation SGEMM micro-kernel using 𝟖 × 𝟖 virtual accumulator 14
  • 15. © 2020 IBM Corporation Conclusions • OpenPOWER delivers the three essentials of a supercomputer: ▪ Number crunching: Through its SIMD and MMA instructions ▪ Memory bandwitth: Through OMI ▪ Interconnect: Through OpenCAPI (not discussed in this talk) • The MMA instructions provide a new level of performance for dense linear algebra and related computations • OMI provides a new level of memory bandwidth for computing systems while delivering low cost, versatility and capacity • Both are scalable, offeringroom to growth in both dimensions • Together with Open Source software, we see a clear road ahead for OpenHPC systems that combine the best innovation from the community! 15