Advanced High-Performance Computing Features of the OpenPOWER ISA

OpenPOWER Innovations for
High-Performance Computing
OpenPOWER Workshop
CINECA, Italy
July 9, 2020
José Moreira – IBM Research
Many thanks to: Jeff Stuecheli
Edmund Gieske
Brian Thompto

© 2020 IBM Corporation
Open architectures for supercomputing
• Since the dawn of supercomputing with the CDC 6600, through the many
generations of Seymour Cray machines, massively parallel processors like Blue
Gene and the current champion Fugaku, high performance computing has been
about three things:
▪ Number crunching: perform as many arithmetic operations as possible
▪ Memory bandwidth: read/write as much data from/to memory as possible
▪ Interconnect: communicate between elements as fast as possible
• IBM servers are now being designed and built around a variety of computing
technologies that IBM has made openly available to the community, including the
three supercomputing technologies listed above
• In this talk, we will focus on recent OpenPOWER developments in those areas
2

Power ISA – foundation of ecosystem
3
Instruction Heritage Note # Instr. Cum Instr. Open ISA
POWER (P1) Base 218 218 Contributing
POWER (P2) 6 224 Contributing
PowerPC (P3) 64b 119 343 Contributing
PowerPC 2.00 (P4) 7 350 Contributing
PowerPC 2.01 2 352 Contributing
PowerPC 2.02 (P5) 14 366 Contributing
Power ISA 2.03 SIMD-VMX 171 537 Contributing
Power ISA 2.05 (P6) 105 642 Contributing
Power ISA 2.06 (P7) SIMD-VSX 189 831 Contributing
Power ISA 2.07 (P8) 111 942 Contributing
Power ISA 3.0 (P9) 231 1173 Compliance
Power ISA 3.1 (P10) Prefix 246 1419 Compliance
• Abbreviated ineage of Power ISA
− Greater than 30 years of innovation and a developed ecosystem
− Instruction heritage shown for Power ISA 3.1
1985
1990
America
POWER
POWER 2 POWER PC
POWER PC
PPC-E
...
1997
PPC-E
...
PowerPC
PowerPC
Embedded
2006 2.03
Power.org
Power ISA
2.07B
2003
PPC 2.02
3.02017
PPC 64
PPC 2.00
IBM Server
2006
Open Power ISA
3.0C3.1
Custom
Extensions
2.07
2020
Embedded
Features

PowerISA 3.1: Foundation for expansion
The PowerISA 3.1 includes a number of new features (see specification / preface for details):
• General: byte reverse instructions, vector integer multiply/divide/modulo instructions, 128-bit binary
integer operations, set boolean extension, string operations, test LSB by byte operation, VSX scalar
minimum/maximum/compare quad-precision
• SIMD: VSX 32-byte storage access operations, SIMD permute-class operations, bit-manipulation
operations, VSX load/store rightmost element operations, VSX mask manipulation operations, VSX PCV
generate operations for expand/compress,
• Translation management extensions
• Copy/paste extensions
• Persistent storage / store synchronization
• Pause / wait-reserve
• Hypervisor interrupt location control
• Matrix math operations (more details later)
• Debug: BHRB filtering updates, multiple DEAW, new performance monitor SPRs, performance monitor
facility sampling security
• Instruction prefix support : 8-byte and modifying opcodes
4

Power ISA 3.1 prefix architecture
• Prefix architecture, primary opcode=1
▪ RISC-friendly variable length instructions:
o New 8-byte instruction space lays the foundation for future ISA expansion
o Always 4-byte instruction alignment
▪ Two forms: modifying (M=1) and 8-byte opcode (M=0)
o Modifying: prefix extends function of existing instructions
o 8-byte opcode: provides new opcode space for expansion, multi-operand instructions, etc.
▪ PC-relative addressing: reduced path-length with new Power ABI support
▪ MMA lane masking : mask by lane for MMA operations
▪ Additional instructions and capabilities
• Generous room for expanded capabilities including opcode space and modifiers
5
new suffix
opcode space
0 310 315 6 117
0PO=1 PO
5
M
Prefix Suffix

Matrix-Multiply Assist (MMA) instructions
• The latest version of Power ISA (for POWER10) is now publicly available
• The Matrix-Multiply Assist instructions lead to very efficient implementations for
key algorithms in technical computing, machine learning, deep learning and
business analytics
• These instructions are a natural match for implementing dense numerical linear
algebra computations
• We have also shown application to other computations such as convolution
• Various other computations require additional work and research, including
arbitrary precision arithmetic, discrete Fourier transform, …
6

Power ISA Vector-Scalar Registers (VSRs)
7

Accumulators
• Accumulators are 4 × 4 arrays of 32-bit elements (we will briefly mention the
64-bit extension later)
𝐴 =
𝑎00 𝑎01
𝑎10 𝑎11
𝑎02 𝑎03
𝑎12 𝑎13
𝑎20 𝑎21
𝑎30 𝑎31
𝑎22 𝑎23
𝑎32 𝑎33
• The elements can be either 32-bit signed-integers (int32) or 32-bit single-
precision floating-point numbers (fp32)
• Each accumulator is a software-managed shadow to a set of 4 consecutive
VSRs (8 architected accumulators – ACC[0:7]) – must choose between
using accumulator or associated VSRs
• State must be explicitly transferred between accumulators and VSRs using
VSX Move From Accumulator (xxmfacc) and VSX Move To Accumulator
(xxmtacc)
8

Outer-product (xv<type>ger<rank-𝑘>) instructions
• Accumulators are updated by rank-𝑘 update instructions:
• Input: 1 accumulator (𝐴) + 2 VSRs (𝑋, 𝑌)
• Output: 1 accumulator (same as input to reduce instruction encoding space)
• Operation: 𝐴 ← ± 𝐴 ± 𝑋𝑌 𝑇
• For 32-bit data, 𝑋 and 𝑌 are 4 × 1 arrays of elements
• This way 𝑋𝑌 𝑇 always has a 4 × 4 shape, compatible with accumulator
Instruction 𝑨 𝑿 𝒀 𝑻 # of madds/instruction
xvf32ger 4 × 4 (fp32) 4 × 1 (fp32) 1 × 4 (fp32) 16
xvf16ger2 4 × 4 (fp32) 4 × 2 (fp16) 2 × 4 (fp16) 32
xvi16ger2 4 × 4 (int32) 4 × 2 (int16) 2 × 4 (int16) 32
9

Rank-𝑘 update
10
4
4
4
4
×
2
2
rank-2 update
4
4
4±
4
×
1
1
rank-1 update
±
±± 4
4
4
4
×
8
8
rank-8 update
4
4
4
4
×
4
4
rank-4 update
±±
±±
32-bit input elements

Extension to 64-bit
• Accumulators are 4 × 2 arrays of 64-bit floating-point elements
𝐴 =
𝑎00 𝑎01
𝑎10 𝑎11
𝑎20 𝑎21
𝑎30 𝑎31
• Accumulators are updated by outer-product instructions:
▪ Input: 1 accumulator (𝐴) + 3 VSRs (𝑋1, 𝑋2, 𝑌)
▪ Output: 1 accumulator (same as input to reduce instruction encoding space)
• Operation: 𝐴 ← 𝐴 +
𝑋1
𝑋2
𝑌 𝑇
▪ 𝑋1, 𝑋2 and 𝑌 are 2 × 1 arrays of elements
▪
𝑋1
𝑋2
𝑌 𝑇
always has a 4 × 2 shape, compatible with accumulator
11
Instruction 𝑨 𝑿 𝒀 𝑻 # of madds/instruction
xvf64ger 4 × 2 (fp64) 4 × 1 (fp64) 1 × 2 (fp64) 8

Load pressure and unit latency (32-bit results)
12
• 8 accumulators (8 × 16 result)
• 6 VSR loads/8 xv_ger instructions
• 0.75 VSR load/xv_ger
• Tolerates the most latency
• 1 VSR load/xv_ger
• Could work well in SMT modes
• 1.5 VSR load/ xv_ger
• 1 accumulator (4 × 4 result)
• 2 VSR loads/1 xv_ger instruction
• 2 VSR load/ xv_ger
ACC[0] ACC[2] ACC[4] ACC[6]
ACC[1] ACC[3] ACC[5] ACC[7]
X[0]X[1]
Y[0] Y[1] Y[2] Y[3]
ACC[0] ACC[2]
ACC[1] ACC[3]
X[0]X[1]
Y[0] Y[1]
ACC[0]
ACC[1]
X[0]X[1]
Y[0]

The micro-kernel of GEMM: 𝑪 𝒎×𝒏 += 𝑨 𝒎×𝒌 × 𝑩 𝒌×𝒏
13

SGEMM micro-kernel using 𝟖 × 𝟖 virtual accumulator
14

POWER family memory architecture
15
Scale Out
Direct Attach Memory
Scale Up
Buffered Memory
Low latency access
Commodity packaging form factor
Superior RAS, High bandwidth, High Capacity
Agnostic interface for alternate memory innovations
Same Open Memory Interface used for all Systems and Memory Technologies
OpenCAPI Agnostic Buffered Memory (OMI)
Near Tier
Extreme Bandwidth
Lower Capacity
Commodity
Low Latency
Low Cost
Enterprise
RAS
Capacity
Bandwidth
Storage Class
Extreme Capacity
Persistence

Primary tier memory options
16
72b ~2666 MHz bidi
8b
72b ~2666 MHz bidi
8b
8b
8b ~25G diff
8b
8b
BUF
8b ~25G diff
BUF
16b
DDR4 RDIMM
Capacity ~256 GB
BW ~150 GB/sec
DDR4 LRDIMM
Capacity ~2 TB
BW ~150 GB/sec
DDR4 OMI DIMM
Capacity ~256GB→4 TB
BW ~320 GB/sec
BW Opt OMI DIMM
Capacity ~128→512 GB
BW ~650 GB/sec
1024b
On module
Si interposer
On Module HBM
Capacity ~16→32 GB
BW ~1 TB/sec
Same System
Same System
Unique System
OMIStrategy
Only 5-10ns
higher load-to-use
than RDIMM
(< 5ns for LRDIMM)

POWER connectivity variants
17
Direct Attach Memory
Max capacity: 2 TB
Max bandwidth: 150 GB/s
Limited system interconnect
OMI Buffered Memory
Max capacity: 4 TB
Max bandwidth: 650 GB/s
Enhanced system interconnect
x24 system attach
x24 system attach
4 x DDR4 memory
4 x DDR4 memory
2 x local SMP
Interconnect
x48 system attach
x48 system attach 3 x local SMP
Advanced interconnect
8 x OMI buffered memory
8 x OMI buffered memory
6x bandwidth
per mm2 as
DDR signaling

DRAM DIMM comparison
18
• Technology agnostic
• Low cost
• Ultra-scale system density
• Enterprise reliability
• Low-latency
• High bandwidth
Approximate Scale
JEDEC DDR DIMM
IBM Centaur DIMM
OMI DDIMM

Accelerated
OpenCAPI
Device
OpenCAPI key attributes
19
TL/DL 25Gb I/O
OpenCAPI Enabled Processor
U
Accelerated
Function
TLx/DLx
1. Architecture agnostic bus – Applicable with any system/microprocessor architecture
2. Optimized for high bandwidth and low latency
3. High performance 25Gbps PHY design
4. Coherency - Attached devices operate natively within application’s user space and coherently with host microprocessor
5. Virtual addressing enables low overhead with no kernel, hypervisor or firmware involvement; security benefit
6. Wide range of use cases and access semantics
7. CPU coherent device memory (Home Agent Memory)
8. Architected for both classic memory and emerging advanced storage class memory
9. Minimal OpenCAPI design overhead (Less than 5% of a Xilinx VU3P FPGA)
Caches
Application
▪ Storage/Compute/Network etc
▪ ASIC/FPGA/FFSA
▪ FPGA, SOC, GPU Accelerator
▪ Load/Store or Block Access
Advanced
SCM Solutions
BufferedSystemMemory
OpenCAPIMemoryBuffers
Device Memory

Mellanox Innova2
Network + FPGA
• Xilinx US+ KU15P FPGA
• Mellanox CX5 NIC
• 16 GB DDR4
• 2 25Gb SFP Cages
• X8 25Gb/s OpenCAPI Support
Network Acceleration (NFV, Packet
Classification), Security Acceleration
Nallatech 250-SoC
Multipurpose Converged
Network / Storage
• Xilinx Zynq US+ ZU19EG FPGA
• 8/16 GB DDR4, 4/8 GB DDR4 ARM
• PCIe Gen3 x16, CAPI2
• 4 x8 Oculink Ports support NVMe,
Network, or OpenCAPI
• 2 100Gb QSFP28 Cages
NVMEoF Target, High BW Storage
Server
AlphaData ADM-9H7
Large FPGA with 8GB HBM
• Xilinx US+ VU37P FPGA + HBM
• 8GB High Bandwidth Memory
• PCIe Gen4 x8 or Gen3 x16, CAPI2
• 2 x8 25 Gb/s OpenCAPI Ports (support
up to 50 GB/s)
• 4 100Gb QSFP28 Cages
ML/DL, Inference, System Modeling, HPC
AlphaData ADM-9H3
Medium FPGA with 8GB HBM
• Xilinx Virtex US+ VU33P-3 FPGA +
HBM
• 8GB High Bandwidth Memory
• PCIe Gen4 x8 or Gen3 x16, CAPI2
• 1 x8 25 Gb/s OpenCAPI Ports
• 1 2x100Gb QSFP28-DD Cage
ML/DL, Inference, System Modeling,
HPC
OpenCAPI adapters
20

Conclusions
• OpenPOWER delivers the three essential technologies of supercomputing:
▪ Number crunching: Through its SIMD and MMA instructions
▪ Memory bandwith and capacity: Through OMI
▪ System interconnect: Through OpenCAPI
• The MMA instructions provide a new level of performance for dense linear
algebra and related computations
• OMI provides a new level of memory bandwidth for computing systems while
delivering low cost, versatility and capacity
• OpenCAPI provides an enhanced system interconnect for acceleration and
additional functionality
• All three are scalable, offering room to growth in all dimensions
• Together with Open Source software, we see a clear road ahead for high-
performance computing systems that combine the best innovation from the
community!
21

Advanced High-Performance Computing Features of the OpenPOWER ISA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Advanced High-Performance Computing Features of the OpenPOWER ISA

Similar to Advanced High-Performance Computing Features of the OpenPOWER ISA (20)

More from Ganesan Narayanasamy

More from Ganesan Narayanasamy (20)

Recently uploaded

Recently uploaded (20)

Advanced High-Performance Computing Features of the OpenPOWER ISA