SlideShare a Scribd company logo
1 of 21
Download to read offline
OpenPOWER Innovations for
High-Performance Computing
OpenPOWER Workshop
CINECA, Italy
July 9, 2020
José Moreira – IBM Research
Many thanks to: Jeff Stuecheli
Edmund Gieske
Brian Thompto
© 2020 IBM Corporation
Open architectures for supercomputing
• Since the dawn of supercomputing with the CDC 6600, through the many
generations of Seymour Cray machines, massively parallel processors like Blue
Gene and the current champion Fugaku, high performance computing has been
about three things:
▪ Number crunching: perform as many arithmetic operations as possible
▪ Memory bandwidth: read/write as much data from/to memory as possible
▪ Interconnect: communicate between elements as fast as possible
• IBM servers are now being designed and built around a variety of computing
technologies that IBM has made openly available to the community, including the
three supercomputing technologies listed above
• In this talk, we will focus on recent OpenPOWER developments in those areas
2
© 2020 IBM Corporation
Power ISA – foundation of ecosystem
3
Instruction Heritage Note # Instr. Cum Instr. Open ISA
POWER (P1) Base 218 218 Contributing
POWER (P2) 6 224 Contributing
PowerPC (P3) 64b 119 343 Contributing
PowerPC 2.00 (P4) 7 350 Contributing
PowerPC 2.01 2 352 Contributing
PowerPC 2.02 (P5) 14 366 Contributing
Power ISA 2.03 SIMD-VMX 171 537 Contributing
Power ISA 2.05 (P6) 105 642 Contributing
Power ISA 2.06 (P7) SIMD-VSX 189 831 Contributing
Power ISA 2.07 (P8) 111 942 Contributing
Power ISA 3.0 (P9) 231 1173 Compliance
Power ISA 3.1 (P10) Prefix 246 1419 Compliance
• Abbreviated ineage of Power ISA
− Greater than 30 years of innovation and a developed ecosystem
− Instruction heritage shown for Power ISA 3.1
1985
1990
America
POWER
POWER 2 POWER PC
POWER PC
PPC-E
...
1997
PPC-E
...
PowerPC
PowerPC
Embedded
2006 2.03
Power.org
Power ISA
2.07B
2003
PPC 2.02
3.02017
PPC 64
PPC 2.00
IBM Server
2006
Open Power ISA
3.0C3.1
Custom
Extensions
2.07
2020
Embedded
Features
© 2020 IBM Corporation
PowerISA 3.1: Foundation for expansion
The PowerISA 3.1 includes a number of new features (see specification / preface for details):
• General: byte reverse instructions, vector integer multiply/divide/modulo instructions, 128-bit binary
integer operations, set boolean extension, string operations, test LSB by byte operation, VSX scalar
minimum/maximum/compare quad-precision
• SIMD: VSX 32-byte storage access operations, SIMD permute-class operations, bit-manipulation
operations, VSX load/store rightmost element operations, VSX mask manipulation operations, VSX PCV
generate operations for expand/compress,
• Translation management extensions
• Copy/paste extensions
• Persistent storage / store synchronization
• Pause / wait-reserve
• Hypervisor interrupt location control
• Matrix math operations (more details later)
• Debug: BHRB filtering updates, multiple DEAW, new performance monitor SPRs, performance monitor
facility sampling security
• Instruction prefix support : 8-byte and modifying opcodes
4
© 2020 IBM Corporation
Power ISA 3.1 prefix architecture
• Prefix architecture, primary opcode=1
▪ RISC-friendly variable length instructions:
o New 8-byte instruction space lays the foundation for future ISA expansion
o Always 4-byte instruction alignment
▪ Two forms: modifying (M=1) and 8-byte opcode (M=0)
o Modifying: prefix extends function of existing instructions
o 8-byte opcode: provides new opcode space for expansion, multi-operand instructions, etc.
▪ PC-relative addressing: reduced path-length with new Power ABI support
▪ MMA lane masking : mask by lane for MMA operations
▪ Additional instructions and capabilities
• Generous room for expanded capabilities including opcode space and modifiers
5
new suffix
opcode space
0 310 315 6 117
0PO=1 PO
5
M
Prefix Suffix
© 2020 IBM Corporation
Matrix-Multiply Assist (MMA) instructions
• The latest version of Power ISA (for POWER10) is now publicly available
• The Matrix-Multiply Assist instructions lead to very efficient implementations for
key algorithms in technical computing, machine learning, deep learning and
business analytics
• These instructions are a natural match for implementing dense numerical linear
algebra computations
• We have also shown application to other computations such as convolution
• Various other computations require additional work and research, including
arbitrary precision arithmetic, discrete Fourier transform, …
6
© 2020 IBM Corporation
Power ISA Vector-Scalar Registers (VSRs)
7
© 2020 IBM Corporation
Accumulators
• Accumulators are 4 × 4 arrays of 32-bit elements (we will briefly mention the
64-bit extension later)
𝐴 =
𝑎00 𝑎01
𝑎10 𝑎11
𝑎02 𝑎03
𝑎12 𝑎13
𝑎20 𝑎21
𝑎30 𝑎31
𝑎22 𝑎23
𝑎32 𝑎33
• The elements can be either 32-bit signed-integers (int32) or 32-bit single-
precision floating-point numbers (fp32)
• Each accumulator is a software-managed shadow to a set of 4 consecutive
VSRs (8 architected accumulators – ACC[0:7]) – must choose between
using accumulator or associated VSRs
• State must be explicitly transferred between accumulators and VSRs using
VSX Move From Accumulator (xxmfacc) and VSX Move To Accumulator
(xxmtacc)
8
© 2020 IBM Corporation
Outer-product (xv<type>ger<rank-𝑘>) instructions
• Accumulators are updated by rank-𝑘 update instructions:
• Input: 1 accumulator (𝐴) + 2 VSRs (𝑋, 𝑌)
• Output: 1 accumulator (same as input to reduce instruction encoding space)
• Operation: 𝐴 ← ± 𝐴 ± 𝑋𝑌 𝑇
• For 32-bit data, 𝑋 and 𝑌 are 4 × 1 arrays of elements
• For 16-bit data, 𝑋 and 𝑌 are 4 × 2 arrays of elements
• For 8-bit data, 𝑋 and 𝑌 are 4 × 4 arrays of elements
• For 4-bit data, 𝑋 and 𝑌 are 4 × 8 arrays of elements
• This way 𝑋𝑌 𝑇 always has a 4 × 4 shape, compatible with accumulator
Instruction 𝑨 𝑿 𝒀 𝑻 # of madds/instruction
xvf32ger 4 × 4 (fp32) 4 × 1 (fp32) 1 × 4 (fp32) 16
xvf16ger2 4 × 4 (fp32) 4 × 2 (fp16) 2 × 4 (fp16) 32
xvi16ger2 4 × 4 (int32) 4 × 2 (int16) 2 × 4 (int16) 32
xvi8ger4 4 × 4 (int32) 4 × 4 (int8) 4 × 4 (int8) 64
xvi4ger8 4 × 4 (int32) 4 × 8 (int4) 8 × 4 (int4) 128
9
© 2020 IBM Corporation
Rank-𝑘 update
10
4
4
4
4
×
2
2
rank-2 update
4
4
4±
4
×
1
1
rank-1 update
±
±± 4
4
4
4
×
8
8
rank-8 update
4
4
4
4
×
4
4
rank-4 update
±±
±±
32-bit input elements
16-bit input elements
8-bit input elements
4-bit input elements
© 2020 IBM Corporation
Extension to 64-bit
• Accumulators are 4 × 2 arrays of 64-bit floating-point elements
𝐴 =
𝑎00 𝑎01
𝑎10 𝑎11
𝑎20 𝑎21
𝑎30 𝑎31
• Accumulators are updated by outer-product instructions:
▪ Input: 1 accumulator (𝐴) + 3 VSRs (𝑋1, 𝑋2, 𝑌)
▪ Output: 1 accumulator (same as input to reduce instruction encoding space)
• Operation: 𝐴 ← 𝐴 +
𝑋1
𝑋2
𝑌 𝑇
▪ 𝑋1, 𝑋2 and 𝑌 are 2 × 1 arrays of elements
▪
𝑋1
𝑋2
𝑌 𝑇
always has a 4 × 2 shape, compatible with accumulator
11
Instruction 𝑨 𝑿 𝒀 𝑻 # of madds/instruction
xvf64ger 4 × 2 (fp64) 4 × 1 (fp64) 1 × 2 (fp64) 8
© 2020 IBM Corporation
Load pressure and unit latency (32-bit results)
12
• 8 accumulators (8 × 16 result)
• 6 VSR loads/8 xv_ger instructions
• 0.75 VSR load/xv_ger
• Tolerates the most latency
• 4 accumulators (8 × 8 result)
• 4 VSR loads/4 xv_ger instructions
• 1 VSR load/xv_ger
• Could work well in SMT modes
• 2 accumulators (8 × 4 result)
• 3 VSR loads/2 xv_ger instructions
• 1.5 VSR load/ xv_ger
• 1 accumulator (4 × 4 result)
• 2 VSR loads/1 xv_ger instruction
• 2 VSR load/ xv_ger
ACC[0] ACC[2] ACC[4] ACC[6]
ACC[1] ACC[3] ACC[5] ACC[7]
X[0]X[1]
Y[0] Y[1] Y[2] Y[3]
ACC[0] ACC[2]
ACC[1] ACC[3]
X[0]X[1]
Y[0] Y[1]
ACC[0]
ACC[1]
X[0]X[1]
Y[0]
© 2020 IBM Corporation
The micro-kernel of GEMM: 𝑪 𝒎×𝒏 += 𝑨 𝒎×𝒌 × 𝑩 𝒌×𝒏
13
© 2020 IBM Corporation
SGEMM micro-kernel using 𝟖 × 𝟖 virtual accumulator
14
© 2020 IBM Corporation
POWER family memory architecture
15
Scale Out
Direct Attach Memory
Scale Up
Buffered Memory
Low latency access
Commodity packaging form factor
Superior RAS, High bandwidth, High Capacity
Agnostic interface for alternate memory innovations
Same Open Memory Interface used for all Systems and Memory Technologies
OpenCAPI Agnostic Buffered Memory (OMI)
Near Tier
Extreme Bandwidth
Lower Capacity
Commodity
Low Latency
Low Cost
Enterprise
RAS
Capacity
Bandwidth
Storage Class
Extreme Capacity
Persistence
© 2020 IBM Corporation
Primary tier memory options
16
72b ~2666 MHz bidi
8b
72b ~2666 MHz bidi
8b
8b
8b ~25G diff
8b
8b
BUF
8b ~25G diff
BUF
16b
DDR4 RDIMM
Capacity ~256 GB
BW ~150 GB/sec
DDR4 LRDIMM
Capacity ~2 TB
BW ~150 GB/sec
DDR4 OMI DIMM
Capacity ~256GB→4 TB
BW ~320 GB/sec
BW Opt OMI DIMM
Capacity ~128→512 GB
BW ~650 GB/sec
1024b
On module
Si interposer
On Module HBM
Capacity ~16→32 GB
BW ~1 TB/sec
Same System
Same System
Unique System
OMIStrategy
Only 5-10ns
higher load-to-use
than RDIMM
(< 5ns for LRDIMM)
© 2020 IBM Corporation
POWER connectivity variants
17
Direct Attach Memory
Max capacity: 2 TB
Max bandwidth: 150 GB/s
Limited system interconnect
OMI Buffered Memory
Max capacity: 4 TB
Max bandwidth: 650 GB/s
Enhanced system interconnect
x24 system attach
x24 system attach
4 x DDR4 memory
4 x DDR4 memory
2 x local SMP
Interconnect
x48 system attach
x48 system attach 3 x local SMP
Advanced interconnect
8 x OMI buffered memory
8 x OMI buffered memory
6x bandwidth
per mm2 as
DDR signaling
© 2020 IBM Corporation
DRAM DIMM comparison
18
• Technology agnostic
• Low cost
• Ultra-scale system density
• Enterprise reliability
• Low-latency
• High bandwidth
Approximate Scale
JEDEC DDR DIMM
IBM Centaur DIMM
OMI DDIMM
© 2020 IBM Corporation
Accelerated
OpenCAPI
Device
OpenCAPI key attributes
19
TL/DL 25Gb I/O
OpenCAPI Enabled Processor
U
Accelerated
Function
TLx/DLx
1. Architecture agnostic bus – Applicable with any system/microprocessor architecture
2. Optimized for high bandwidth and low latency
3. High performance 25Gbps PHY design
4. Coherency - Attached devices operate natively within application’s user space and coherently with host microprocessor
5. Virtual addressing enables low overhead with no kernel, hypervisor or firmware involvement; security benefit
6. Wide range of use cases and access semantics
7. CPU coherent device memory (Home Agent Memory)
8. Architected for both classic memory and emerging advanced storage class memory
9. Minimal OpenCAPI design overhead (Less than 5% of a Xilinx VU3P FPGA)
Caches
Application
▪ Storage/Compute/Network etc
▪ ASIC/FPGA/FFSA
▪ FPGA, SOC, GPU Accelerator
▪ Load/Store or Block Access
Advanced
SCM Solutions
BufferedSystemMemory
OpenCAPIMemoryBuffers
Device Memory
© 2020 IBM Corporation
Mellanox Innova2
Network + FPGA
• Xilinx US+ KU15P FPGA
• Mellanox CX5 NIC
• 16 GB DDR4
• 2 25Gb SFP Cages
• X8 25Gb/s OpenCAPI Support
Network Acceleration (NFV, Packet
Classification), Security Acceleration
Nallatech 250-SoC
Multipurpose Converged
Network / Storage
• Xilinx Zynq US+ ZU19EG FPGA
• 8/16 GB DDR4, 4/8 GB DDR4 ARM
• PCIe Gen3 x16, CAPI2
• 4 x8 Oculink Ports support NVMe,
Network, or OpenCAPI
• 2 100Gb QSFP28 Cages
NVMEoF Target, High BW Storage
Server
AlphaData ADM-9H7
Large FPGA with 8GB HBM
• Xilinx US+ VU37P FPGA + HBM
• 8GB High Bandwidth Memory
• PCIe Gen4 x8 or Gen3 x16, CAPI2
• 2 x8 25 Gb/s OpenCAPI Ports (support
up to 50 GB/s)
• 4 100Gb QSFP28 Cages
ML/DL, Inference, System Modeling, HPC
AlphaData ADM-9H3
Medium FPGA with 8GB HBM
• Xilinx Virtex US+ VU33P-3 FPGA +
HBM
• 8GB High Bandwidth Memory
• PCIe Gen4 x8 or Gen3 x16, CAPI2
• 1 x8 25 Gb/s OpenCAPI Ports
• 1 2x100Gb QSFP28-DD Cage
ML/DL, Inference, System Modeling,
HPC
OpenCAPI adapters
20
© 2020 IBM Corporation
Conclusions
• OpenPOWER delivers the three essential technologies of supercomputing:
▪ Number crunching: Through its SIMD and MMA instructions
▪ Memory bandwith and capacity: Through OMI
▪ System interconnect: Through OpenCAPI
• The MMA instructions provide a new level of performance for dense linear
algebra and related computations
• OMI provides a new level of memory bandwidth for computing systems while
delivering low cost, versatility and capacity
• OpenCAPI provides an enhanced system interconnect for acceleration and
additional functionality
• All three are scalable, offering room to growth in all dimensions
• Together with Open Source software, we see a clear road ahead for high-
performance computing systems that combine the best innovation from the
community!
21

More Related Content

What's hot

LCE12: LCE12 ARMv8 Plenary
LCE12: LCE12 ARMv8 PlenaryLCE12: LCE12 ARMv8 Plenary
LCE12: LCE12 ARMv8 PlenaryLinaro
 
AMD64 (EM64T) architecture
AMD64 (EM64T) architectureAMD64 (EM64T) architecture
AMD64 (EM64T) architecturePVS-Studio
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCGanesan Narayanasamy
 
PCI Slot
PCI SlotPCI Slot
PCI Slotiyinyan
 
Lec 03 ia32 architecture
Lec 03  ia32 architectureLec 03  ia32 architecture
Lec 03 ia32 architectureAbdul Khan
 
PCIe BUS: A State-of-the-Art-Review
PCIe BUS: A State-of-the-Art-ReviewPCIe BUS: A State-of-the-Art-Review
PCIe BUS: A State-of-the-Art-ReviewIOSRJVSP
 
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010Altera Corporation
 
13. peripheral component interconnect (pci)
13. peripheral component interconnect (pci)13. peripheral component interconnect (pci)
13. peripheral component interconnect (pci)Rumah Belajar
 
Chapt 02 ia-32 processer architecture
Chapt 02   ia-32 processer architectureChapt 02   ia-32 processer architecture
Chapt 02 ia-32 processer architecturebushrakainat214
 
Slideshare - PCIe
Slideshare - PCIeSlideshare - PCIe
Slideshare - PCIeJin Wu
 
IP PCIe
IP PCIeIP PCIe
IP PCIeSILKAN
 
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Hsien-Hsin Sean Lee, Ph.D.
 

What's hot (20)

Pcie drivers basics
Pcie drivers basicsPcie drivers basics
Pcie drivers basics
 
LCE12: LCE12 ARMv8 Plenary
LCE12: LCE12 ARMv8 PlenaryLCE12: LCE12 ARMv8 Plenary
LCE12: LCE12 ARMv8 Plenary
 
AMD64 (EM64T) architecture
AMD64 (EM64T) architectureAMD64 (EM64T) architecture
AMD64 (EM64T) architecture
 
POWER9 for AI & HPC
POWER9 for AI & HPCPOWER9 for AI & HPC
POWER9 for AI & HPC
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
What is-32-bit-and-64-bit
What is-32-bit-and-64-bitWhat is-32-bit-and-64-bit
What is-32-bit-and-64-bit
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
 
PCI Slot
PCI SlotPCI Slot
PCI Slot
 
PCI & ISA bus
PCI & ISA busPCI & ISA bus
PCI & ISA bus
 
Lec 03 ia32 architecture
Lec 03  ia32 architectureLec 03  ia32 architecture
Lec 03 ia32 architecture
 
PCIe BUS: A State-of-the-Art-Review
PCIe BUS: A State-of-the-Art-ReviewPCIe BUS: A State-of-the-Art-Review
PCIe BUS: A State-of-the-Art-Review
 
64 bits for developers
64 bits for developers64 bits for developers
64 bits for developers
 
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
 
13. peripheral component interconnect (pci)
13. peripheral component interconnect (pci)13. peripheral component interconnect (pci)
13. peripheral component interconnect (pci)
 
Chapt 02 ia-32 processer architecture
Chapt 02   ia-32 processer architectureChapt 02   ia-32 processer architecture
Chapt 02 ia-32 processer architecture
 
Pcie basic
Pcie basicPcie basic
Pcie basic
 
Slideshare - PCIe
Slideshare - PCIeSlideshare - PCIe
Slideshare - PCIe
 
Eisa
EisaEisa
Eisa
 
IP PCIe
IP PCIeIP PCIe
IP PCIe
 
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
 

Similar to Advanced High-Performance Computing Features of the OpenPOWER ISA

M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...Michael Gschwind
 
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...Red_Hat_Storage
 
IBM Power Systems E850C and S824
IBM Power Systems E850C and S824IBM Power Systems E850C and S824
IBM Power Systems E850C and S824David Spurway
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...Filipe Miranda
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
April 2014 IBM announcement webcast
April 2014 IBM announcement webcastApril 2014 IBM announcement webcast
April 2014 IBM announcement webcastHELP400
 
Db2 analytics accelerator on ibm integrated analytics system technical over...
Db2 analytics accelerator on ibm integrated analytics system   technical over...Db2 analytics accelerator on ibm integrated analytics system   technical over...
Db2 analytics accelerator on ibm integrated analytics system technical over...Daniel Martin
 
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...xKinAnx
 
Ibm power systems facts and features power 8
Ibm power systems facts and features  power 8 Ibm power systems facts and features  power 8
Ibm power systems facts and features power 8 Diego Alberto Tamayo
 
IBM Power Systems Announcement Update
IBM Power Systems Announcement UpdateIBM Power Systems Announcement Update
IBM Power Systems Announcement UpdateDavid Spurway
 
Challenges in Embedded Computing
Challenges in Embedded ComputingChallenges in Embedded Computing
Challenges in Embedded ComputingPradeep Kumar TS
 
Fujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU SpecificationsFujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU Specificationsinside-BigData.com
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
Flexing Network Muscle with IBM Flex System Fabric Technology
Flexing Network Muscle with IBM Flex System Fabric TechnologyFlexing Network Muscle with IBM Flex System Fabric Technology
Flexing Network Muscle with IBM Flex System Fabric TechnologyBrocade
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsAnand Haridass
 
High-Density Top-Loading Storage for Cloud Scale Applications
High-Density Top-Loading Storage for Cloud Scale Applications High-Density Top-Loading Storage for Cloud Scale Applications
High-Density Top-Loading Storage for Cloud Scale Applications Rebekah Rodriguez
 
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ library
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ libraryInterview with Anatoliy Kuznetsov, the author of BitMagic C++ library
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ libraryPVS-Studio
 

Similar to Advanced High-Performance Computing Features of the OpenPOWER ISA (20)

M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
 
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
 
IBM Power Systems E850C and S824
IBM Power Systems E850C and S824IBM Power Systems E850C and S824
IBM Power Systems E850C and S824
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
April 2014 IBM announcement webcast
April 2014 IBM announcement webcastApril 2014 IBM announcement webcast
April 2014 IBM announcement webcast
 
Db2 analytics accelerator on ibm integrated analytics system technical over...
Db2 analytics accelerator on ibm integrated analytics system   technical over...Db2 analytics accelerator on ibm integrated analytics system   technical over...
Db2 analytics accelerator on ibm integrated analytics system technical over...
 
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
Ibm power systems facts and features power 8
Ibm power systems facts and features  power 8 Ibm power systems facts and features  power 8
Ibm power systems facts and features power 8
 
IBM Power Systems Announcement Update
IBM Power Systems Announcement UpdateIBM Power Systems Announcement Update
IBM Power Systems Announcement Update
 
Challenges in Embedded Computing
Challenges in Embedded ComputingChallenges in Embedded Computing
Challenges in Embedded Computing
 
Fujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU SpecificationsFujitsu Presents Post-K CPU Specifications
Fujitsu Presents Post-K CPU Specifications
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Flexing Network Muscle with IBM Flex System Fabric Technology
Flexing Network Muscle with IBM Flex System Fabric TechnologyFlexing Network Muscle with IBM Flex System Fabric Technology
Flexing Network Muscle with IBM Flex System Fabric Technology
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
High-Density Top-Loading Storage for Cloud Scale Applications
High-Density Top-Loading Storage for Cloud Scale Applications High-Density Top-Loading Storage for Cloud Scale Applications
High-Density Top-Loading Storage for Cloud Scale Applications
 
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ library
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ libraryInterview with Anatoliy Kuznetsov, the author of BitMagic C++ library
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ library
 

More from Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency programGanesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISAGanesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsGanesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsGanesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
 
Perspectives of Frond end Design
Perspectives of Frond end DesignPerspectives of Frond end Design
Perspectives of Frond end Design
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Advanced High-Performance Computing Features of the OpenPOWER ISA

  • 1. OpenPOWER Innovations for High-Performance Computing OpenPOWER Workshop CINECA, Italy July 9, 2020 José Moreira – IBM Research Many thanks to: Jeff Stuecheli Edmund Gieske Brian Thompto
  • 2. © 2020 IBM Corporation Open architectures for supercomputing • Since the dawn of supercomputing with the CDC 6600, through the many generations of Seymour Cray machines, massively parallel processors like Blue Gene and the current champion Fugaku, high performance computing has been about three things: ▪ Number crunching: perform as many arithmetic operations as possible ▪ Memory bandwidth: read/write as much data from/to memory as possible ▪ Interconnect: communicate between elements as fast as possible • IBM servers are now being designed and built around a variety of computing technologies that IBM has made openly available to the community, including the three supercomputing technologies listed above • In this talk, we will focus on recent OpenPOWER developments in those areas 2
  • 3. © 2020 IBM Corporation Power ISA – foundation of ecosystem 3 Instruction Heritage Note # Instr. Cum Instr. Open ISA POWER (P1) Base 218 218 Contributing POWER (P2) 6 224 Contributing PowerPC (P3) 64b 119 343 Contributing PowerPC 2.00 (P4) 7 350 Contributing PowerPC 2.01 2 352 Contributing PowerPC 2.02 (P5) 14 366 Contributing Power ISA 2.03 SIMD-VMX 171 537 Contributing Power ISA 2.05 (P6) 105 642 Contributing Power ISA 2.06 (P7) SIMD-VSX 189 831 Contributing Power ISA 2.07 (P8) 111 942 Contributing Power ISA 3.0 (P9) 231 1173 Compliance Power ISA 3.1 (P10) Prefix 246 1419 Compliance • Abbreviated ineage of Power ISA − Greater than 30 years of innovation and a developed ecosystem − Instruction heritage shown for Power ISA 3.1 1985 1990 America POWER POWER 2 POWER PC POWER PC PPC-E ... 1997 PPC-E ... PowerPC PowerPC Embedded 2006 2.03 Power.org Power ISA 2.07B 2003 PPC 2.02 3.02017 PPC 64 PPC 2.00 IBM Server 2006 Open Power ISA 3.0C3.1 Custom Extensions 2.07 2020 Embedded Features
  • 4. © 2020 IBM Corporation PowerISA 3.1: Foundation for expansion The PowerISA 3.1 includes a number of new features (see specification / preface for details): • General: byte reverse instructions, vector integer multiply/divide/modulo instructions, 128-bit binary integer operations, set boolean extension, string operations, test LSB by byte operation, VSX scalar minimum/maximum/compare quad-precision • SIMD: VSX 32-byte storage access operations, SIMD permute-class operations, bit-manipulation operations, VSX load/store rightmost element operations, VSX mask manipulation operations, VSX PCV generate operations for expand/compress, • Translation management extensions • Copy/paste extensions • Persistent storage / store synchronization • Pause / wait-reserve • Hypervisor interrupt location control • Matrix math operations (more details later) • Debug: BHRB filtering updates, multiple DEAW, new performance monitor SPRs, performance monitor facility sampling security • Instruction prefix support : 8-byte and modifying opcodes 4
  • 5. © 2020 IBM Corporation Power ISA 3.1 prefix architecture • Prefix architecture, primary opcode=1 ▪ RISC-friendly variable length instructions: o New 8-byte instruction space lays the foundation for future ISA expansion o Always 4-byte instruction alignment ▪ Two forms: modifying (M=1) and 8-byte opcode (M=0) o Modifying: prefix extends function of existing instructions o 8-byte opcode: provides new opcode space for expansion, multi-operand instructions, etc. ▪ PC-relative addressing: reduced path-length with new Power ABI support ▪ MMA lane masking : mask by lane for MMA operations ▪ Additional instructions and capabilities • Generous room for expanded capabilities including opcode space and modifiers 5 new suffix opcode space 0 310 315 6 117 0PO=1 PO 5 M Prefix Suffix
  • 6. © 2020 IBM Corporation Matrix-Multiply Assist (MMA) instructions • The latest version of Power ISA (for POWER10) is now publicly available • The Matrix-Multiply Assist instructions lead to very efficient implementations for key algorithms in technical computing, machine learning, deep learning and business analytics • These instructions are a natural match for implementing dense numerical linear algebra computations • We have also shown application to other computations such as convolution • Various other computations require additional work and research, including arbitrary precision arithmetic, discrete Fourier transform, … 6
  • 7. © 2020 IBM Corporation Power ISA Vector-Scalar Registers (VSRs) 7
  • 8. © 2020 IBM Corporation Accumulators • Accumulators are 4 × 4 arrays of 32-bit elements (we will briefly mention the 64-bit extension later) 𝐴 = 𝑎00 𝑎01 𝑎10 𝑎11 𝑎02 𝑎03 𝑎12 𝑎13 𝑎20 𝑎21 𝑎30 𝑎31 𝑎22 𝑎23 𝑎32 𝑎33 • The elements can be either 32-bit signed-integers (int32) or 32-bit single- precision floating-point numbers (fp32) • Each accumulator is a software-managed shadow to a set of 4 consecutive VSRs (8 architected accumulators – ACC[0:7]) – must choose between using accumulator or associated VSRs • State must be explicitly transferred between accumulators and VSRs using VSX Move From Accumulator (xxmfacc) and VSX Move To Accumulator (xxmtacc) 8
  • 9. © 2020 IBM Corporation Outer-product (xv<type>ger<rank-𝑘>) instructions • Accumulators are updated by rank-𝑘 update instructions: • Input: 1 accumulator (𝐴) + 2 VSRs (𝑋, 𝑌) • Output: 1 accumulator (same as input to reduce instruction encoding space) • Operation: 𝐴 ← ± 𝐴 ± 𝑋𝑌 𝑇 • For 32-bit data, 𝑋 and 𝑌 are 4 × 1 arrays of elements • For 16-bit data, 𝑋 and 𝑌 are 4 × 2 arrays of elements • For 8-bit data, 𝑋 and 𝑌 are 4 × 4 arrays of elements • For 4-bit data, 𝑋 and 𝑌 are 4 × 8 arrays of elements • This way 𝑋𝑌 𝑇 always has a 4 × 4 shape, compatible with accumulator Instruction 𝑨 𝑿 𝒀 𝑻 # of madds/instruction xvf32ger 4 × 4 (fp32) 4 × 1 (fp32) 1 × 4 (fp32) 16 xvf16ger2 4 × 4 (fp32) 4 × 2 (fp16) 2 × 4 (fp16) 32 xvi16ger2 4 × 4 (int32) 4 × 2 (int16) 2 × 4 (int16) 32 xvi8ger4 4 × 4 (int32) 4 × 4 (int8) 4 × 4 (int8) 64 xvi4ger8 4 × 4 (int32) 4 × 8 (int4) 8 × 4 (int4) 128 9
  • 10. © 2020 IBM Corporation Rank-𝑘 update 10 4 4 4 4 × 2 2 rank-2 update 4 4 4± 4 × 1 1 rank-1 update ± ±± 4 4 4 4 × 8 8 rank-8 update 4 4 4 4 × 4 4 rank-4 update ±± ±± 32-bit input elements 16-bit input elements 8-bit input elements 4-bit input elements
  • 11. © 2020 IBM Corporation Extension to 64-bit • Accumulators are 4 × 2 arrays of 64-bit floating-point elements 𝐴 = 𝑎00 𝑎01 𝑎10 𝑎11 𝑎20 𝑎21 𝑎30 𝑎31 • Accumulators are updated by outer-product instructions: ▪ Input: 1 accumulator (𝐴) + 3 VSRs (𝑋1, 𝑋2, 𝑌) ▪ Output: 1 accumulator (same as input to reduce instruction encoding space) • Operation: 𝐴 ← 𝐴 + 𝑋1 𝑋2 𝑌 𝑇 ▪ 𝑋1, 𝑋2 and 𝑌 are 2 × 1 arrays of elements ▪ 𝑋1 𝑋2 𝑌 𝑇 always has a 4 × 2 shape, compatible with accumulator 11 Instruction 𝑨 𝑿 𝒀 𝑻 # of madds/instruction xvf64ger 4 × 2 (fp64) 4 × 1 (fp64) 1 × 2 (fp64) 8
  • 12. © 2020 IBM Corporation Load pressure and unit latency (32-bit results) 12 • 8 accumulators (8 × 16 result) • 6 VSR loads/8 xv_ger instructions • 0.75 VSR load/xv_ger • Tolerates the most latency • 4 accumulators (8 × 8 result) • 4 VSR loads/4 xv_ger instructions • 1 VSR load/xv_ger • Could work well in SMT modes • 2 accumulators (8 × 4 result) • 3 VSR loads/2 xv_ger instructions • 1.5 VSR load/ xv_ger • 1 accumulator (4 × 4 result) • 2 VSR loads/1 xv_ger instruction • 2 VSR load/ xv_ger ACC[0] ACC[2] ACC[4] ACC[6] ACC[1] ACC[3] ACC[5] ACC[7] X[0]X[1] Y[0] Y[1] Y[2] Y[3] ACC[0] ACC[2] ACC[1] ACC[3] X[0]X[1] Y[0] Y[1] ACC[0] ACC[1] X[0]X[1] Y[0]
  • 13. © 2020 IBM Corporation The micro-kernel of GEMM: 𝑪 𝒎×𝒏 += 𝑨 𝒎×𝒌 × 𝑩 𝒌×𝒏 13
  • 14. © 2020 IBM Corporation SGEMM micro-kernel using 𝟖 × 𝟖 virtual accumulator 14
  • 15. © 2020 IBM Corporation POWER family memory architecture 15 Scale Out Direct Attach Memory Scale Up Buffered Memory Low latency access Commodity packaging form factor Superior RAS, High bandwidth, High Capacity Agnostic interface for alternate memory innovations Same Open Memory Interface used for all Systems and Memory Technologies OpenCAPI Agnostic Buffered Memory (OMI) Near Tier Extreme Bandwidth Lower Capacity Commodity Low Latency Low Cost Enterprise RAS Capacity Bandwidth Storage Class Extreme Capacity Persistence
  • 16. © 2020 IBM Corporation Primary tier memory options 16 72b ~2666 MHz bidi 8b 72b ~2666 MHz bidi 8b 8b 8b ~25G diff 8b 8b BUF 8b ~25G diff BUF 16b DDR4 RDIMM Capacity ~256 GB BW ~150 GB/sec DDR4 LRDIMM Capacity ~2 TB BW ~150 GB/sec DDR4 OMI DIMM Capacity ~256GB→4 TB BW ~320 GB/sec BW Opt OMI DIMM Capacity ~128→512 GB BW ~650 GB/sec 1024b On module Si interposer On Module HBM Capacity ~16→32 GB BW ~1 TB/sec Same System Same System Unique System OMIStrategy Only 5-10ns higher load-to-use than RDIMM (< 5ns for LRDIMM)
  • 17. © 2020 IBM Corporation POWER connectivity variants 17 Direct Attach Memory Max capacity: 2 TB Max bandwidth: 150 GB/s Limited system interconnect OMI Buffered Memory Max capacity: 4 TB Max bandwidth: 650 GB/s Enhanced system interconnect x24 system attach x24 system attach 4 x DDR4 memory 4 x DDR4 memory 2 x local SMP Interconnect x48 system attach x48 system attach 3 x local SMP Advanced interconnect 8 x OMI buffered memory 8 x OMI buffered memory 6x bandwidth per mm2 as DDR signaling
  • 18. © 2020 IBM Corporation DRAM DIMM comparison 18 • Technology agnostic • Low cost • Ultra-scale system density • Enterprise reliability • Low-latency • High bandwidth Approximate Scale JEDEC DDR DIMM IBM Centaur DIMM OMI DDIMM
  • 19. © 2020 IBM Corporation Accelerated OpenCAPI Device OpenCAPI key attributes 19 TL/DL 25Gb I/O OpenCAPI Enabled Processor U Accelerated Function TLx/DLx 1. Architecture agnostic bus – Applicable with any system/microprocessor architecture 2. Optimized for high bandwidth and low latency 3. High performance 25Gbps PHY design 4. Coherency - Attached devices operate natively within application’s user space and coherently with host microprocessor 5. Virtual addressing enables low overhead with no kernel, hypervisor or firmware involvement; security benefit 6. Wide range of use cases and access semantics 7. CPU coherent device memory (Home Agent Memory) 8. Architected for both classic memory and emerging advanced storage class memory 9. Minimal OpenCAPI design overhead (Less than 5% of a Xilinx VU3P FPGA) Caches Application ▪ Storage/Compute/Network etc ▪ ASIC/FPGA/FFSA ▪ FPGA, SOC, GPU Accelerator ▪ Load/Store or Block Access Advanced SCM Solutions BufferedSystemMemory OpenCAPIMemoryBuffers Device Memory
  • 20. © 2020 IBM Corporation Mellanox Innova2 Network + FPGA • Xilinx US+ KU15P FPGA • Mellanox CX5 NIC • 16 GB DDR4 • 2 25Gb SFP Cages • X8 25Gb/s OpenCAPI Support Network Acceleration (NFV, Packet Classification), Security Acceleration Nallatech 250-SoC Multipurpose Converged Network / Storage • Xilinx Zynq US+ ZU19EG FPGA • 8/16 GB DDR4, 4/8 GB DDR4 ARM • PCIe Gen3 x16, CAPI2 • 4 x8 Oculink Ports support NVMe, Network, or OpenCAPI • 2 100Gb QSFP28 Cages NVMEoF Target, High BW Storage Server AlphaData ADM-9H7 Large FPGA with 8GB HBM • Xilinx US+ VU37P FPGA + HBM • 8GB High Bandwidth Memory • PCIe Gen4 x8 or Gen3 x16, CAPI2 • 2 x8 25 Gb/s OpenCAPI Ports (support up to 50 GB/s) • 4 100Gb QSFP28 Cages ML/DL, Inference, System Modeling, HPC AlphaData ADM-9H3 Medium FPGA with 8GB HBM • Xilinx Virtex US+ VU33P-3 FPGA + HBM • 8GB High Bandwidth Memory • PCIe Gen4 x8 or Gen3 x16, CAPI2 • 1 x8 25 Gb/s OpenCAPI Ports • 1 2x100Gb QSFP28-DD Cage ML/DL, Inference, System Modeling, HPC OpenCAPI adapters 20
  • 21. © 2020 IBM Corporation Conclusions • OpenPOWER delivers the three essential technologies of supercomputing: ▪ Number crunching: Through its SIMD and MMA instructions ▪ Memory bandwith and capacity: Through OMI ▪ System interconnect: Through OpenCAPI • The MMA instructions provide a new level of performance for dense linear algebra and related computations • OMI provides a new level of memory bandwidth for computing systems while delivering low cost, versatility and capacity • OpenCAPI provides an enhanced system interconnect for acceleration and additional functionality • All three are scalable, offering room to growth in all dimensions • Together with Open Source software, we see a clear road ahead for high- performance computing systems that combine the best innovation from the community! 21