POWER9 AC922 Newell System - HPC & AI

POWER Systems
AC922 Newell System:
The AI & HPC Platform
Anand Haridass
IBM Cognitive Systems
anharida@in.ibm.com
Client Briefing – Q1 2018
Charts from Chris Mann, Michael Fisher,
Dylan Boday & Performance teams

IBM Systems
IBM POWER HPC & ML/DL Platform Strategy
 High-performance computer and high-performance analytics drive common platform design
 Servers will be predominately 2-socket designs
 Developing deeper relationships with technology partners – ref OpenPOWER
 Majority of floating-point performance will come from GPUs
 OpenCAPI / Accelerators
 Utilize Industry-standard compliant 19” racks and electronics enclosures
 Air and water cooling options
 Platforms will be based on a common enclosure form factor
 Enclosure provides working envelope that we will continue to enhance with the latest
technology from IBM, NVIDIA, Mellanox and other OpenPOWER partners
 Enclosure provides a platform with sufficient power, cooling capability to support these
enhancements

An Acceleration Superhighway:
POWER 9 is IBM’s Latest Processor
2H12
POWER7+
32 nm
- 2.5x Larger L3 cache
- On-die acceleration
- Zero-power core idle state
- Up to 12 Cores
- SMT8
- CAPI Acceleration
- High Bandwidth GPU Attach
1H14 – 2H161H10
POWER7
45 nm
- 8 Cores
- SMT4
- eDRAM L3 Cache
POWER9 Family
14nm
POWER8 Family
22nm
Enterprise
Enterprise
Enterprise &
Big Data Optimized
2H17 – 2H18+
− Only processor with NVLink,
PCIe Gen 4 advanced IO
interfaces and coherence
− Premier Platform for
Accelerated Computing
− Processor Family with
Scale-Up and Scale-Out
Optimized Silicon
Built for the Cognitive Era

IBM Systems | 4
POWER9 Processor Family
Scale-Out – 2 Socket Optimized
Robust 2 socket SMP system
Direct Memory Attach
• Up to 8 DDR4 ports
• Up to 170 GB/s memory BW
• Commodity packaging form factor
Scale-Up – 4+-Socket Optimized
Scalable System Topology / Capacity
• Large multi-socket
Buffered Memory Attach
• 8 Buffered channels
• Up to 230 GB/s memory BW
SMT4 Core
24 SMT4 Cores / Chip
Linux Ecosystem Optimized
SMT8 Core
12 SMT8 Cores / Chip
PowerVM Ecosystem Continuity
Core Count / Size
SMP scalability / Memory subsystem

POWER9 offers a variety of Acceleration Options
State of the Art I/O and Acceleration Attachment Signaling
– PCIe Gen 4 x 48 lanes – 192 GB/s duplex bandwidth
– 25G Link x 48 lanes – 300 GB/s duplex bandwidth
Robust Accelerated Compute Options with OPEN standards
– On-Chip Acceleration – Gzip x1, 842 Compression x2, AES/SHA x2
– CAPI 2.0 – 4x bandwidth of POWER8 using PCIe Gen 4
– OpenCAPI 3.0 – High bandwidth, low latency and open interface using 25G Link
– NVLink 2.0 – Next generation of GPU/CPU bandwidth and integration
POWER9
PowerAccel
• Extreme Processor / Accelerator Bandwidth and Reduced Latency
• Coherent Memory and Virtual Addressing Capability for all Accelerators
• OpenPOWER Community Enablement – Robust Accelerated Compute Options

Extreme CPU/Accelerator BandwidthSystem
bottleneck
Only Available with POWER
POWER9 Introduces Acceleration Innovations
Seamless CPU/Accelerator Interaction
• Coherent memory sharing
• Enhanced virtual address translation
Broader Use of Heterogeneous Compute
• Designed for efficient programming models
• Accelerate complex analytic / cognitive applications

IBM Systems| 7
IBM POWER GPU Systems Roadmap
2017 - 2018
POWER S822LC
• 2 POWER8 Processors
- 190 Turismo module
• 2 x16 Gen 3 FHFL PCIe slots
- Supports 2 NVidia K80 GPU’s
- Supports 2 PCIe adapters
• 1 x8 Gen 3 HHHL PCIe, CAPI
• 1 x8 Gen 3 PCIe
• 32 DDR3 IS DIMM’s
- 4, 8, 16, 32GB DIMMs
- 32 – 1024GB Memory Capacity
• 2 SATA SFF HDD / SSD
• 2 1300W Power Supplies
- 200VAC Input
• BMC support structure
- IPMI, USB, EN, VGA
• Air cooled
POWER S822LC for HPC
• 2 POWER8 w/ NVLink Processors
- 190 module
• 1, 2, 4 NVidia “Pascal” GPU’s
- 300W, SXM2 Form Factor,
NVLink 1.0
enabled
• 1 x8 Gen3 HHHL PCIe, CAPI
enabled
• 32 DDR4 IS DIMM’s
- 4, 8, 16, 32GB DIMM’s
• Pluggable NVMe storage adapter
- 1.6, 3.2TB Capacity
• 2 1300W power supplies
- 200VAC Input
• BMC Support Structure
- IPMI, USB, EN, VGA
• Air and water cooled options
POWER AC922
• 2 POWER9 Processors
- 190, 250W modules
• 4-6 NVidia “Volta” GPU’s
- 300W, SXM2 Form Factor, NVLink 2.0
• 6 GPU configuration, water cooled
• 4 GPU configuration, air or water
cooled
• 2 Gen4 x16 HHHL PCIe, CAPI enabled
• 1 Gen4 x4 HHHL PCIe
• 1 Gen4 Shared x8 PCIe adapter
• 16 IS DIMM’s
- 8, 16, 32, 64, 128GB DIMMs
- 200 VAC, 277VAC, 400VDC input
- N+1 Redundant
• Second generation BMC Support
Structure
• Pluggable NVMe storage adapter
option
2015 2016 Future
SWIFT (Preliminary)
• 2 Axone Processors
- 190, 250W modules
- OpenCAPI 3.0
• 4 NVIDIA “Volta F.O.” GPU’s
• 1 Gen4 x8 FHHL PCIe adapter
• 16 Buffered DIMMs
- x16 OMI interface
- 8, 16, 32, 64, 128GB DIMMs
• 2 NVME SSD
- N+1 Redundant
• Second generation BMC Support
Structure
• Next Generation HPC platform
• Air and water cooled
DEEP EDDY (Preliminary)
• 2 P10 Processors
- 190, 250W modules
- OpenCAPI 4.0
• 4 NVIDIA “Future” GPU’s
• 1 Gen4 x8 FHHL PCIe adapter
• 16 Buffered DIMMs
- x16 OMI interface
- 8, 16, 32, 64, 128GB, 256GB DIMMs
• 2 NVME SSD
- N+1 Redundant
• Third generation BMC Support
Structure
• Next Generation HPC platform
• Air and water cooled
2019 - 2020

IBM Systems
High level System Overview
 2-Socket, 2U Packaging
 40 P9 Processor cores
 4 NVIDIA Volta 2.0 GPUs
 1 TB Memory (16x - 64GB DIMMs)
 4 PCIe Gen4 Slots
 2x SFF (HDD/SSD), SATA, Up to 7.7 TB storage
 Supports 1.6TB and 3.2TB NVMe Adapters
 Redundant Hot Swap Power Supplies and Fans
 Default 3 year 9x5 warranty, 100% CRU
AC922 Newell - POWER9 with increased GPU and IO bandwidth for differentiation
Realize unprecedented performance and application gains with POWER9 and NVLink 2.0
• 2 POWER9 CPUs and up to 4 “Volta” NVLink 2.0 GPUs in a versatile 2U Linux server
• PCIe Gen4 bus has double I/O Bandwidth vs. PCIe Gen3
• CPU (Turbo)/GPU (Boost) enabled for improved data center efficiency and performance to be
maintained at high levels
8

4 GPUs @150GB/s
CPU  GPU bandwidth
6 GPUs @100GB/s
CPU  GPU bandwidth
Coherent access to system memory
PCIe Gen 4 and CAPI 2.0 to InfiniBand
Air and Water cooled options
Coherent access to system memory
PCIe Gen 4 and CAPI 2.0 to InfiniBand
Water cooled only
NVLink
100GB/s
NVLink
100GB/s
NVDIA V100
Coherent
access to
system memory
(2TB)
NVLink
100GB/s
NVLink
100GB/s
NVLink
100GB/s
170GB/s
CPU
PCIe Gen 4
CAPI 2.0
NVDIA V100
NVDIA V100
DDR4
IB
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
CPU
PCIe Gen 4
CAPI 2.0
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
IB

IBM Systems | 10
POWER AC922 Design – 4 GPU
Power 9 Processor (2x)
• 18, 22C water cooled
• 16, 20C air cooled
PCIe slot (4x)
• Gen4 PCIe
• 2, x16 HHHL Adapter
• 1, Shared slot
• 1 x8 HHHL Adapter
Memory DIMM’s (16x)
• 8 DDR4 IS DIMMs per socket
• 8, 16, 32,64, 128GB DIMMs
NVidia Volta GPU
• 3 per socket
• SXM2 form factor
• 300W
• NVLink 2.0
• Air/Water Cooled
Power Supplies (2x)
• 2200W
• 200VAC, 277VAC, 400VDC input
BMC Card
• IPMI
• 1 Gb Ethernet
• VGA
• 1 USB 3.0

IBM Systems | 11
Mechanical Overview
Operator Interface
• 1 USB 3.0
• Power Button
• Service LED’s
4X - Cooling Fans
• Counter- Rotating
• Hot swap
• 80mm
• 190W & 250W
BMC (Service Processor Card)
• IPMI
• 2x 1 Gb Ethernet
• 1 VGA
• 1 USB 3.0
PCIe slot (4x)
• Gen4 PCIe
• 1, x8,x8 Shared HHHL Adapter
NVidia Volta GPU
• 2 per socket
• 300W
• NVLink 2.0
• Air Cooled
Power Supplies (2x)
• 2200W
• Configuration limits for redundancy
• Hot Swap
Storage
• Optional 2x SFF SATA Disk
• Optional 2x SFF SATA SSD
• Disk are tray based for hot swap
Note: Front Bezel removed

IBM Systems | 12
Front & Rear Details
Front
Rear
80mm CR Cooling Fans (4x)
Note: Front bezel is removed in this illustration
USB 3.0
SFF-4 Carrier (2X)
• SFF SATA HDD or SSD
Service Indicators
USB 3.0
1Gb Eth (2x)
IPMI
VGA
PCIe Slot 2
• Gen4 Shared x8,x8
• HHHL Slot
• CAPI Enabled
PCIe Slot 1
• Gen4 x4 (x8 Connector)
• HHHL Slot
Power Supplies (2X)
Water lines
(Option)
Service Indicators
Power Button
PCIe Slot 3 & 4
• Gen4 x16
• HHHL Slot
• CAPI Enabled

IBM Systems
Witherspoon (2 GPUs / socket)
8 DIMMs 8 DIMMsX Bus 4B
P9 P9
NV Links
(3 Bricks ea)
NV Links
(3 Bricks ea)
GPU GPU GPU GPU
Mellanox
IB EDR NIC
Shared Slot
x8x8PCIe Gen4 x16
PCIe Gen4 x16
USB
Storage
Ctlr
PCIe
Switch
4 x2 PCIe Buses
One per GPU
PCIe Gen4 x4
BMC
CAPI
CAPI

IBM Systems | 14
NVIDIA Volta GPU Features
Peak double precision floating point
performance
7.8 TFLOPS
Memory bandwidth 900 GB/sec
GPU Memory Size 16 GB
NVLink “Bricks” (8 lane interface) 6
NVLink Interconnect Bi-Directional 300GB/s
Maximum Power 300W
NVIDIA Volta Specifications
https://www.nvidia.com/en-us/data-center/tesla-v100/

IBM Systems | 15
NVIDIA® Volta GPU Accelerator
Power Regulation
2x 400 Pin
Connectors
2x Grounding Pads
Bottom Side
Steel Stiffener
Multi Chip Module
4x Extraction
Springs
GPU Details
Top Side

IBM Systems
POWER AC922 Design – 6 GPU
• 18, 22C water cooled
• 16, 20C air cooled
PCIe slot (4x)
• Gen4 PCIe
• 1, Shared slot
• 8, 16, 32,64, 128GB DIMMs
NVidia Volta GPU
• 3 per socket
• 300W
• NVLink 2.0
• Air/Water Cooled
Power Supplies (2x)
• 2200W
BMC Card
• IPMI
• 1 Gb Ethernet
• VGA
• 1 USB 3.0

IBM Systems
Witherspoon (3 GPUs / socket)
8 DIMMs 8 DIMMsX Bus 4B
GPU
P9 P9
NV Links
(2 Bricks ea)
NV Links
(2 Bricks ea))
GPU GPU GPU GPU GPU
Mellanox
IB EDR NIC
Shared Slot
x8x8PCIe Gen4 x16
PCIe Gen4 x16
USB
Storage
Ctlr
PCIe
Switch
6 x2 PCIe Buses
One per GPU
PCIe Gen4 x4
BMC
CAPI
CAPI

IBM Systems
PCIe Gen4 x16 PCIe Gen4 x8
PCIe Gen4 x8
PCIe Gen4 x16 PCIe Gen4 x4
I/O Attachment Evolution in POWER HPC
IB-EDR NIC
Shared slot
CAPI
Mellanox “Multi-Host Socket Direct”.
X-Bus 4B @ 16Gbps
2016 2017-2018
First industry implementation of Gen4 PCIe
Multi-host attachment of POWER9 and the Mellanox EDR-IB adapter

The IO Difference – Faster Data Movement
19
• P9 with 2nd Gen NVLink enables 5.6x faster data
movement from CPU-GPU in 4 GPU system
• 6 GPU provide balance with compute and data
throughput
0
50
100
150
200
250
300
350
400
450
AVERAGEGBITS/S
SIZE IN BYTES
Comparing PCIe 3.0 vs 4.0 IB Dual Port
Bidirectional Bandwidth
• ~2x faster PCIe Gen 4 interconnect to IB network
cards
• Best server for clusters leveraging networking and
other devices as they become PCIe Gen4 ready
• Results are based on IBM Internal Measurements running the CUDA H2D Bandwidth Test
• Hardware: Power AC922; 32 cores (2 x 16c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU;
Ubuntu 16.04. S822LC for HPC; 20 cores (2 x 10c chips), POWER8 with NVLink; 2.86 GHz, 512 GB memory, Tesla P100 GPU
• Competitive HW: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory,
4xTesla V100 GPU, Ubuntu 16.04
GPU Attach Bandwidth Comparison,
PCIe Gen3 verses NVLink

Evolving from Compute Systems to Cognitive Systems
P8 P9 P10
Open Frameworks
Partnerships
Industry Alignment
DevEcosystem
Accelerator Roadmaps
Open Accelerator
Interfaces
Not Just About Hardware Design
It’s about co-optimization
which just works for ML, DL, and AI
IBM Software
20
hardware
software
+

enterprise-ready
software distribution
built on open source
tools for ease
of development
performance
faster training times
for data scientists

Designed for the AI era: Chainer provides a 3.7X
reduction in AI model training vs tested x86 systems
Maximize research productivity running training for
medical/satellite images with Chainer on the AC922
• 3.7X reduction vs tested x86 systems in runtime of 1000
iterations on x86 systems to train medical/satellite images
• Critical machine learning (ML) capabilities such as regression,
nearest neighbor, recommendation systems, clustering, etc.
operate on more than just the GPU memory
• Large Model Support - use system memory and GPU memory to
support more complex and higher resolution data
• Results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model on Enlarged Imagenet Dataset (2560x2560) .
• Hardware: Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU Pegas 1.0. Competitive stack: 2x Xeon
E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04.
• Software: Chainverv3 /LMS/Out of Core with CUDA 9 / CuDNN7 with patches found at https://github.com/cupy/cupy/pull/694 and
https://github.com/chainer/chainer/pull/3762
Chainer: More Accuracy
(3.7 iterations vs 1)
4 run
Accuracy
3 run
Accuracy
2 run
Accuracy
1 run
Accuracy
One
Iteration
One
Iteration
Two
Iterations
Three
Iterations
+ 70%
iteration
Xeon
4xV100
AC922
4xV100

Maximize research productivity running training for
medical/satellite images with Caffe with the AC922
• 3.8X reduction vs tested x86 systems 1000 iterations running
on competing systems to train on 2k x 2k images
• Critical machine learning (ML) capabilities such as regression,
nearest neighbor, recommendation systems, clustering, etc.
operate on more than just the GPU memory
• NVLink 2.0 enables enhanced Host to GPU communication
• Large Model Support - use system memory and GPU memory to
support more complex and higher resolution data
Designed for the AI era: Caffe provides a 3.8X
reduction in AI model training vs tested x86 systems
Results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model (mini-batch size=5) on Enlarged Imagenet Dataset (2240x2240) .
Hardware: Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU Pegas 1.0. Competitive stack: 2x
Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04.
Software: IBM Caffe with LMS Source code: https://github.ibm.com/TUNG/trlcaffe/tree/1.0-ibm-blc-bm-fix-hang+-p9collateral based on the branch "1.0-ibm-
blc-bm-fix-hang+" (base for PowerAI R4) and a PR#5972 from BVLC/Caffe (for supporting cudnn7).
Caffe: More Accuracy
(3.8 iterations vs 1)
4 run
Accuracy
3 run
Accuracy
2 run
Accuracy
1 run
Accuracy
One
Iteration
One
Iteration
Two
Iterations
Three
Iterations
+ 80%
iteration
Xeon
4xV100
AC922
4xV100

AC922 Exceptional Performance for accelerated workloads:
All results are based on running CPMD, a parallelized plane wave / pseudopotential implementation of Density Functional Theory Application. A Hybrid version of CPMD (e.g. MPI + OPENMP + GPU + streams) was implemented with runs are made for
256-Water Box, RANDOM initialization. Results are reported in Execution Time (seconds).. Effective measured data rate on PCIe bus of 10 GB/s and on Nvlink 2.0 of 50GB/s.
IBM Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU; Pegas 1.0 with ESSL PRPQ; Spectrum MPI: PRPQ release, XLF: 15.16, CUDA 9.1
IBM Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x Tesla P100 GPU; RHEL 7.4.with ESSL 5.3.2.0; PE2.2; XLF: 15.1,
CUDA 8.0
2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, Ubuntu 16.04 with OPENBLAS 0.2.18, OpenMPI: 1.10.2, GNU-5.4.0, CUDA-8.0
917
673
351
0
200
400
600
800
1000
Xeon x86 E5-2640 v4
2x10c + 4xTesla P100
Power S822LC 2x10c
POWER8 + 4xTesla
P100
Power AC922 2x 20c
POWER9 + 4xTesla
V100
Time(secs)
Molecular Dynamics (CPMD)
Runtime (secs)
– POWER9 with NVLink 2.0 unlocks the performance of GPU-
accelerated version of CPMD by enabling lightning fast CPU-
GPU data transfers
• IBM Power System AC922 delivers 2.6X reduction in
execution time of tested x86 systems
Lowerisbetter
2.6X faster running CPMD compared to tested x86 systems
3093
5737
0
2000
4000
6000
8000
Power S822LC
w/4xTesla P100
Power AC922 w/4xTesla
V100
System Throughput (Queries/min)
1.8X faster running Accelerated Databases
Improved application performance with Kinetica
filtering Twitter Tweets
– 80% more throughput on Power Systems AC922 than
Power System S822LC for HPC
• Throughput results are based on running Kinetica “Filter by geographic area” queries on data set of 280 million simulated Tweets with 80 to 600 concurrent clients each with 0 think time.
• Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 1024 GB memory, 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU. For more information on
Power Systems performance on Kinetica and other workloads see https://developer.ibm.com/linuxonpower/perfcol/
• Power System AC922; 32 cores (2 x 16c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU; .

The POWER9 difference for Deep Learning
25© 2017 IBM Corporation
3.7X reduction vs tested x86
systems in runtime of 1000
iterations running on
competing systems to train
on 2k x 2k images
POWER
CPU
DDR4
GPU
NVLink
Graphics
Memory
Differentiated productivity available with AC922
• Faster model training times
• Iterate models faster
• Train on larger / more complex datasets
• NVLink 2.0 enables enhanced Host to GPU
communication
• IBM’s LMS enables seamless use of Host +
GPU memory for improved performance
3.8X reduction vs tested x86
systems in runtime of 1000
iterations running on
competing systems to train
on 2k x 2k images
11215
2940
0
5000
10000
15000
Xeon x86 2640
v4/4xTesla V100
Power AC922
w/4xTesla V100
Time(secs)
Caffe
Runtime of 1000 Iterations
• Chainer results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model on Enlarged Imagenet Dataset (2560x2560) .
• Caffe results are based IBM Internal Measurementsrunning 1000 iterationsof Enlarged GoogleNet model (mini-batchsize=5) on Enlarged Imagenet Dataset (2240x2240) .
• Hardware: Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU Pegas 1.0. Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04.
• Software: IBM Caffe with LMS Source code: https://github.ibm.com/TUNG/trlcaffe/tree/1.0-ibm-blc-bm-fix-hang+-p9collateral based on the branch "1.0-ibm-blc-bm-fix-hang+" (base for PowerAI R4) and a PR#5972 from BVLC/Caffe (for supporting cudnn7).
• Software: Chainverv3 /LMS/Out of Core with CUDA 9 / CuDNN7 with patches found at https://github.com/cupy/cupy/pull/694 and https://github.com/chainer/chainer/pull/3762
9709
2622
0
5000
10000
15000
Xeon x86 2640
v4/4xTesla V100
Power AC922
w/4xTesla V100
Time(secs)
Chainer
Runtime of 1000 Iterations

POWER9 AC922 Newell System - HPC & AI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to POWER9 AC922 Newell System - HPC & AI

Similar to POWER9 AC922 Newell System - HPC & AI (20)

More from Anand Haridass

More from Anand Haridass (8)

Recently uploaded

Recently uploaded (20)

POWER9 AC922 Newell System - HPC & AI