SlideShare a Scribd company logo
1 of 26
Download to read offline
POWER Systems
AC922 Newell System:
The AI & HPC Platform
Anand Haridass
IBM Cognitive Systems
anharida@in.ibm.com
Client Briefing – Q1 2018
Charts from Chris Mann, Michael Fisher,
Dylan Boday & Performance teams
IBM Systems
IBM POWER HPC & ML/DL Platform Strategy
 High-performance computer and high-performance analytics drive common platform design
 Servers will be predominately 2-socket designs
 Developing deeper relationships with technology partners – ref OpenPOWER
 Majority of floating-point performance will come from GPUs
 OpenCAPI / Accelerators
 Utilize Industry-standard compliant 19” racks and electronics enclosures
 Air and water cooling options
 Platforms will be based on a common enclosure form factor
 Enclosure provides working envelope that we will continue to enhance with the latest
technology from IBM, NVIDIA, Mellanox and other OpenPOWER partners
 Enclosure provides a platform with sufficient power, cooling capability to support these
enhancements
An Acceleration Superhighway:
POWER 9 is IBM’s Latest Processor
2H12
POWER7+
32 nm
- 2.5x Larger L3 cache
- On-die acceleration
- Zero-power core idle state
- Up to 12 Cores
- SMT8
- CAPI Acceleration
- High Bandwidth GPU Attach
1H14 – 2H161H10
POWER7
45 nm
- 8 Cores
- SMT4
- eDRAM L3 Cache
POWER9 Family
14nm
POWER8 Family
22nm
Enterprise
Enterprise
Enterprise &
Big Data Optimized
2H17 – 2H18+
− Only processor with NVLink,
PCIe Gen 4 advanced IO
interfaces and coherence
− Premier Platform for
Accelerated Computing
− Processor Family with
Scale-Up and Scale-Out
Optimized Silicon
Built for the Cognitive Era
IBM Systems | 4
POWER9 Processor Family
Scale-Out – 2 Socket Optimized
Robust 2 socket SMP system
Direct Memory Attach
• Up to 8 DDR4 ports
• Up to 170 GB/s memory BW
• Commodity packaging form factor
Scale-Up – 4+-Socket Optimized
Scalable System Topology / Capacity
• Large multi-socket
Buffered Memory Attach
• 8 Buffered channels
• Up to 230 GB/s memory BW
SMT4 Core
24 SMT4 Cores / Chip
Linux Ecosystem Optimized
SMT8 Core
12 SMT8 Cores / Chip
PowerVM Ecosystem Continuity
Core Count / Size
SMP scalability / Memory subsystem
An Acceleration Superhighway:
POWER9 offers a variety of Acceleration Options
State of the Art I/O and Acceleration Attachment Signaling
– PCIe Gen 4 x 48 lanes – 192 GB/s duplex bandwidth
– 25G Link x 48 lanes – 300 GB/s duplex bandwidth
Robust Accelerated Compute Options with OPEN standards
– On-Chip Acceleration – Gzip x1, 842 Compression x2, AES/SHA x2
– CAPI 2.0 – 4x bandwidth of POWER8 using PCIe Gen 4
– OpenCAPI 3.0 – High bandwidth, low latency and open interface using 25G Link
– NVLink 2.0 – Next generation of GPU/CPU bandwidth and integration
POWER9
PowerAccel
• Extreme Processor / Accelerator Bandwidth and Reduced Latency
• Coherent Memory and Virtual Addressing Capability for all Accelerators
• OpenPOWER Community Enablement – Robust Accelerated Compute Options
Extreme CPU/Accelerator BandwidthSystem
bottleneck
Only Available with POWER
An Acceleration Superhighway:
POWER9 Introduces Acceleration Innovations
Seamless CPU/Accelerator Interaction
• Coherent memory sharing
• Enhanced virtual address translation
Broader Use of Heterogeneous Compute
• Designed for efficient programming models
• Accelerate complex analytic / cognitive applications
IBM Systems| 7
IBM POWER GPU Systems Roadmap
2017 - 2018
POWER S822LC
• 2 POWER8 Processors
- 190 Turismo module
• 2 x16 Gen 3 FHFL PCIe slots
- Supports 2 NVidia K80 GPU’s
- Supports 2 PCIe adapters
• 1 x8 Gen 3 HHHL PCIe, CAPI
• 1 x16 Gen 3 HHHL PCIe, CAPI
• 1 x8 Gen 3 PCIe
• 32 DDR3 IS DIMM’s
- 4, 8, 16, 32GB DIMMs
- 32 – 1024GB Memory Capacity
• 2 SATA SFF HDD / SSD
• 2 1300W Power Supplies
- 200VAC Input
• BMC support structure
- IPMI, USB, EN, VGA
• Air cooled
POWER S822LC for HPC
• 2 POWER8 w/ NVLink Processors
- 190 module
• 1, 2, 4 NVidia “Pascal” GPU’s
- 300W, SXM2 Form Factor,
NVLink 1.0
• 2 x16 Gen 3 HHHL PCIe, CAPI
enabled
• 1 x8 Gen3 HHHL PCIe, CAPI
enabled
• 32 DDR4 IS DIMM’s
- 4, 8, 16, 32GB DIMM’s
• 2 SATA SFF HDD / SSD
• Pluggable NVMe storage adapter
- 1.6, 3.2TB Capacity
• 2 1300W power supplies
- 200VAC Input
• BMC Support Structure
- IPMI, USB, EN, VGA
• Air and water cooled options
POWER AC922
• 2 POWER9 Processors
- 190, 250W modules
• 4-6 NVidia “Volta” GPU’s
- 300W, SXM2 Form Factor, NVLink 2.0
• 6 GPU configuration, water cooled
• 4 GPU configuration, air or water
cooled
• 2 Gen4 x16 HHHL PCIe, CAPI enabled
• 1 Gen4 x4 HHHL PCIe
• 1 Gen4 Shared x8 PCIe adapter
• 16 IS DIMM’s
- 8, 16, 32, 64, 128GB DIMMs
• 2 SATA SFF HDD / SSD
• 2 2200W power supplies
- 200 VAC, 277VAC, 400VDC input
- N+1 Redundant
• Second generation BMC Support
Structure
• Pluggable NVMe storage adapter
option
2015 2016 Future
SWIFT (Preliminary)
• 2 Axone Processors
- 190, 250W modules
- OpenCAPI 3.0
• 4 NVIDIA “Volta F.O.” GPU’s
- 300W, SXM3 Form Factor, NVLink 2.0
• 2 Gen4 x16 HHHL PCIe, CAPI enabled
• 2 Gen4 x8 HHHL PCIe
• 1 Gen4 x8 FHHL PCIe adapter
• 16 Buffered DIMMs
- x16 OMI interface
- 8, 16, 32, 64, 128GB DIMMs
• 4 SATA SFF HDD / SSD
• 2 NVME SSD
• 2 2200W power supplies
- N+1 Redundant
• Second generation BMC Support
Structure
• Next Generation HPC platform
• Air and water cooled
DEEP EDDY (Preliminary)
• 2 P10 Processors
- 190, 250W modules
- OpenCAPI 4.0
• 4 NVIDIA “Future” GPU’s
- 300W, SXM3 Form Factor, NVLink 3.0
• 2 Gen4 x16 HHHL PCIe, CAPI enabled
• 2 Gen4 x8 HHHL PCIe
• 1 Gen4 x8 FHHL PCIe adapter
• 16 Buffered DIMMs
- x16 OMI interface
- 8, 16, 32, 64, 128GB, 256GB DIMMs
• 4 SATA SFF HDD / SSD
• 2 NVME SSD
• 2 2200W power supplies
- N+1 Redundant
• Third generation BMC Support
Structure
• Next Generation HPC platform
• Air and water cooled
2019 - 2020
IBM Systems
High level System Overview
 2-Socket, 2U Packaging
 40 P9 Processor cores
 4 NVIDIA Volta 2.0 GPUs
 1 TB Memory (16x - 64GB DIMMs)
 4 PCIe Gen4 Slots
 2x SFF (HDD/SSD), SATA, Up to 7.7 TB storage
 Supports 1.6TB and 3.2TB NVMe Adapters
 Redundant Hot Swap Power Supplies and Fans
 Default 3 year 9x5 warranty, 100% CRU
AC922 Newell - POWER9 with increased GPU and IO bandwidth for differentiation
Realize unprecedented performance and application gains with POWER9 and NVLink 2.0
• 2 POWER9 CPUs and up to 4 “Volta” NVLink 2.0 GPUs in a versatile 2U Linux server
• PCIe Gen4 bus has double I/O Bandwidth vs. PCIe Gen3
• CPU (Turbo)/GPU (Boost) enabled for improved data center efficiency and performance to be
maintained at high levels
8
4 GPUs @150GB/s
CPU  GPU bandwidth
6 GPUs @100GB/s
CPU  GPU bandwidth
Coherent access to system memory
PCIe Gen 4 and CAPI 2.0 to InfiniBand
Air and Water cooled options
Coherent access to system memory
PCIe Gen 4 and CAPI 2.0 to InfiniBand
Water cooled only
NVLink
100GB/s
NVLink
100GB/s
NVDIA V100
Coherent
access to
system memory
(2TB)
NVLink
100GB/s
NVLink
100GB/s
NVLink
100GB/s
170GB/s
CPU
PCIe Gen 4
CAPI 2.0
NVDIA V100
NVDIA V100
DDR4
IB
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
CPU
PCIe Gen 4
CAPI 2.0
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
IB
IBM Systems | 10
POWER AC922 Design – 4 GPU
Power 9 Processor (2x)
• 18, 22C water cooled
• 16, 20C air cooled
PCIe slot (4x)
• Gen4 PCIe
• 2, x16 HHHL Adapter
• 1, Shared slot
• 1 x8 HHHL Adapter
Memory DIMM’s (16x)
• 8 DDR4 IS DIMMs per socket
• 8, 16, 32,64, 128GB DIMMs
NVidia Volta GPU
• 3 per socket
• SXM2 form factor
• 300W
• NVLink 2.0
• Air/Water Cooled
Power Supplies (2x)
• 2200W
• 200VAC, 277VAC, 400VDC input
BMC Card
• IPMI
• 1 Gb Ethernet
• VGA
• 1 USB 3.0
IBM Systems | 11
Mechanical Overview
Operator Interface
• 1 USB 3.0
• Power Button
• Service LED’s
4X - Cooling Fans
• Counter- Rotating
• Hot swap
• 80mm
Memory DIMM’s (16x)
• 8 DDR4 IS DIMMs per socket
Power 9 Processor (2x)
• 190W & 250W
BMC (Service Processor Card)
• IPMI
• 2x 1 Gb Ethernet
• 1 VGA
• 1 USB 3.0
PCIe slot (4x)
• Gen4 PCIe
• 2, x16 HHHL Adapter
• 1, x8,x8 Shared HHHL Adapter
• 1 x4 HHHL Adapter
NVidia Volta GPU
• 2 per socket
• SXM2 form factor
• 300W
• NVLink 2.0
• Air Cooled
Power Supplies (2x)
• 2200W
• Configuration limits for redundancy
• Hot Swap
• 200VAC, 277VAC, 400VDC input
Storage
• Optional 2x SFF SATA Disk
• Optional 2x SFF SATA SSD
• Disk are tray based for hot swap
Note: Front Bezel removed
IBM Systems | 12
Front & Rear Details
Front
Rear
80mm CR Cooling Fans (4x)
Note: Front bezel is removed in this illustration
USB 3.0
SFF-4 Carrier (2X)
• SFF SATA HDD or SSD
Service Indicators
USB 3.0
1Gb Eth (2x)
IPMI
VGA
PCIe Slot 2
• Gen4 Shared x8,x8
• HHHL Slot
• CAPI Enabled
PCIe Slot 1
• Gen4 x4 (x8 Connector)
• HHHL Slot
Power Supplies (2X)
Water lines
(Option)
Service Indicators
Power Button
PCIe Slot 3 & 4
• Gen4 x16
• HHHL Slot
• CAPI Enabled
IBM Systems
Witherspoon (2 GPUs / socket)
8 DIMMs 8 DIMMsX Bus 4B
P9 P9
NV Links
(3 Bricks ea)
NV Links
(3 Bricks ea)
GPU GPU GPU GPU
Mellanox
IB EDR NIC
Shared Slot
x8x8PCIe Gen4 x16
PCIe Gen4 x16
USB
Storage
Ctlr
PCIe
Switch
4 x2 PCIe Buses
One per GPU
PCIe Gen4 x4
BMC
CAPI
CAPI
IBM Systems | 14
NVIDIA Volta GPU Features
Peak double precision floating point
performance
7.8 TFLOPS
Memory bandwidth 900 GB/sec
GPU Memory Size 16 GB
NVLink “Bricks” (8 lane interface) 6
NVLink Interconnect Bi-Directional 300GB/s
Maximum Power 300W
NVIDIA Volta Specifications
https://www.nvidia.com/en-us/data-center/tesla-v100/
IBM Systems | 15
NVIDIA® Volta GPU Accelerator
Power Regulation
2x 400 Pin
Connectors
2x Grounding Pads
Bottom Side
Steel Stiffener
Multi Chip Module
4x Extraction
Springs
GPU Details
Top Side
IBM Systems
POWER AC922 Design – 6 GPU
Power 9 Processor (2x)
• 18, 22C water cooled
• 16, 20C air cooled
PCIe slot (4x)
• Gen4 PCIe
• 2, x16 HHHL Adapter
• 1, Shared slot
• 1 x8 HHHL Adapter
Memory DIMM’s (16x)
• 8 DDR4 IS DIMMs per socket
• 8, 16, 32,64, 128GB DIMMs
NVidia Volta GPU
• 3 per socket
• SXM2 form factor
• 300W
• NVLink 2.0
• Air/Water Cooled
Power Supplies (2x)
• 2200W
• 200VAC, 277VAC, 400VDC input
BMC Card
• IPMI
• 1 Gb Ethernet
• VGA
• 1 USB 3.0
IBM Systems
Witherspoon (3 GPUs / socket)
8 DIMMs 8 DIMMsX Bus 4B
GPU
P9 P9
NV Links
(2 Bricks ea)
NV Links
(2 Bricks ea))
GPU GPU GPU GPU GPU
Mellanox
IB EDR NIC
Shared Slot
x8x8PCIe Gen4 x16
PCIe Gen4 x16
USB
Storage
Ctlr
PCIe
Switch
6 x2 PCIe Buses
One per GPU
PCIe Gen4 x4
BMC
CAPI
CAPI
IBM Systems
PCIe Gen4 x16 PCIe Gen4 x8
PCIe Gen4 x8
PCIe Gen4 x16 PCIe Gen4 x4
I/O Attachment Evolution in POWER HPC
IB-EDR NIC
Shared slot
CAPI
Mellanox “Multi-Host Socket Direct”.
X-Bus 4B @ 16Gbps
2016 2017-2018
First industry implementation of Gen4 PCIe
Multi-host attachment of POWER9 and the Mellanox EDR-IB adapter
The IO Difference – Faster Data Movement
19
• P9 with 2nd Gen NVLink enables 5.6x faster data
movement from CPU-GPU in 4 GPU system
• 6 GPU provide balance with compute and data
throughput
0
50
100
150
200
250
300
350
400
450
AVERAGEGBITS/S
SIZE IN BYTES
Comparing PCIe 3.0 vs 4.0 IB Dual Port
Bidirectional Bandwidth
• ~2x faster PCIe Gen 4 interconnect to IB network
cards
• Best server for clusters leveraging networking and
other devices as they become PCIe Gen4 ready
• Results are based on IBM Internal Measurements running the CUDA H2D Bandwidth Test
• Hardware: Power AC922; 32 cores (2 x 16c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU;
Ubuntu 16.04. S822LC for HPC; 20 cores (2 x 10c chips), POWER8 with NVLink; 2.86 GHz, 512 GB memory, Tesla P100 GPU
• Competitive HW: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory,
4xTesla V100 GPU, Ubuntu 16.04
GPU Attach Bandwidth Comparison,
PCIe Gen3 verses NVLink
Evolving from Compute Systems to Cognitive Systems
P8 P9 P10
Open Frameworks
Partnerships
Industry Alignment
DevEcosystem
Accelerator Roadmaps
Open Accelerator
Interfaces
Not Just About Hardware Design
It’s about co-optimization
which just works for ML, DL, and AI
IBM Software
20
hardware
software
+
enterprise-ready
software distribution
built on open source
tools for ease
of development
performance
faster training times
for data scientists
Designed for the AI era: Chainer provides a 3.7X
reduction in AI model training vs tested x86 systems
Maximize research productivity running training for
medical/satellite images with Chainer on the AC922
• 3.7X reduction vs tested x86 systems in runtime of 1000
iterations on x86 systems to train medical/satellite images
• Critical machine learning (ML) capabilities such as regression,
nearest neighbor, recommendation systems, clustering, etc.
operate on more than just the GPU memory
• Large Model Support - use system memory and GPU memory to
support more complex and higher resolution data
• Results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model on Enlarged Imagenet Dataset (2560x2560) .
• Hardware: Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU Pegas 1.0. Competitive stack: 2x Xeon
E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04.
• Software: Chainverv3 /LMS/Out of Core with CUDA 9 / CuDNN7 with patches found at https://github.com/cupy/cupy/pull/694 and
https://github.com/chainer/chainer/pull/3762
Chainer: More Accuracy
(3.7 iterations vs 1)
4 run
Accuracy
3 run
Accuracy
2 run
Accuracy
1 run
Accuracy
One
Iteration
One
Iteration
Two
Iterations
Three
Iterations
+ 70%
iteration
Xeon
4xV100
AC922
4xV100
Maximize research productivity running training for
medical/satellite images with Caffe with the AC922
• 3.8X reduction vs tested x86 systems 1000 iterations running
on competing systems to train on 2k x 2k images
• Critical machine learning (ML) capabilities such as regression,
nearest neighbor, recommendation systems, clustering, etc.
operate on more than just the GPU memory
• NVLink 2.0 enables enhanced Host to GPU communication
• Large Model Support - use system memory and GPU memory to
support more complex and higher resolution data
Designed for the AI era: Caffe provides a 3.8X
reduction in AI model training vs tested x86 systems
Results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model (mini-batch size=5) on Enlarged Imagenet Dataset (2240x2240) .
Hardware: Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU Pegas 1.0. Competitive stack: 2x
Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04.
Software: IBM Caffe with LMS Source code: https://github.ibm.com/TUNG/trlcaffe/tree/1.0-ibm-blc-bm-fix-hang+-p9collateral based on the branch "1.0-ibm-
blc-bm-fix-hang+" (base for PowerAI R4) and a PR#5972 from BVLC/Caffe (for supporting cudnn7).
Caffe: More Accuracy
(3.8 iterations vs 1)
4 run
Accuracy
3 run
Accuracy
2 run
Accuracy
1 run
Accuracy
One
Iteration
One
Iteration
Two
Iterations
Three
Iterations
+ 80%
iteration
Xeon
4xV100
AC922
4xV100
AC922 Exceptional Performance for accelerated workloads:
All results are based on running CPMD, a parallelized plane wave / pseudopotential implementation of Density Functional Theory Application. A Hybrid version of CPMD (e.g. MPI + OPENMP + GPU + streams) was implemented with runs are made for
256-Water Box, RANDOM initialization. Results are reported in Execution Time (seconds).. Effective measured data rate on PCIe bus of 10 GB/s and on Nvlink 2.0 of 50GB/s.
IBM Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU; Pegas 1.0 with ESSL PRPQ; Spectrum MPI: PRPQ release, XLF: 15.16, CUDA 9.1
IBM Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x Tesla P100 GPU; RHEL 7.4.with ESSL 5.3.2.0; PE2.2; XLF: 15.1,
CUDA 8.0
2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, Ubuntu 16.04 with OPENBLAS 0.2.18, OpenMPI: 1.10.2, GNU-5.4.0, CUDA-8.0
917
673
351
0
200
400
600
800
1000
Xeon x86 E5-2640 v4
2x10c + 4xTesla P100
Power S822LC 2x10c
POWER8 + 4xTesla
P100
Power AC922 2x 20c
POWER9 + 4xTesla
V100
Time(secs)
Molecular Dynamics (CPMD)
Runtime (secs)
– POWER9 with NVLink 2.0 unlocks the performance of GPU-
accelerated version of CPMD by enabling lightning fast CPU-
GPU data transfers
• IBM Power System AC922 delivers 2.6X reduction in
execution time of tested x86 systems
Lowerisbetter
2.6X faster running CPMD compared to tested x86 systems
3093
5737
0
2000
4000
6000
8000
Power S822LC
w/4xTesla P100
Power AC922 w/4xTesla
V100
System Throughput (Queries/min)
1.8X faster running Accelerated Databases
Improved application performance with Kinetica
filtering Twitter Tweets
– 80% more throughput on Power Systems AC922 than
Power System S822LC for HPC
• Throughput results are based on running Kinetica “Filter by geographic area” queries on data set of 280 million simulated Tweets with 80 to 600 concurrent clients each with 0 think time.
• Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 1024 GB memory, 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU. For more information on
Power Systems performance on Kinetica and other workloads see https://developer.ibm.com/linuxonpower/perfcol/
• Power System AC922; 32 cores (2 x 16c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU; .
The POWER9 difference for Deep Learning
25© 2017 IBM Corporation
3.7X reduction vs tested x86
systems in runtime of 1000
iterations running on
competing systems to train
on 2k x 2k images
POWER
CPU
DDR4
GPU
NVLink
Graphics
Memory
Differentiated productivity available with AC922
• Faster model training times
• Iterate models faster
• Train on larger / more complex datasets
• NVLink 2.0 enables enhanced Host to GPU
communication
• IBM’s LMS enables seamless use of Host +
GPU memory for improved performance
3.8X reduction vs tested x86
systems in runtime of 1000
iterations running on
competing systems to train
on 2k x 2k images
11215
2940
0
5000
10000
15000
Xeon x86 2640
v4/4xTesla V100
Power AC922
w/4xTesla V100
Time(secs)
Caffe
Runtime of 1000 Iterations
• Chainer results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model on Enlarged Imagenet Dataset (2560x2560) .
• Caffe results are based IBM Internal Measurementsrunning 1000 iterationsof Enlarged GoogleNet model (mini-batchsize=5) on Enlarged Imagenet Dataset (2240x2240) .
• Hardware: Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU Pegas 1.0. Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04.
• Software: IBM Caffe with LMS Source code: https://github.ibm.com/TUNG/trlcaffe/tree/1.0-ibm-blc-bm-fix-hang+-p9collateral based on the branch "1.0-ibm-blc-bm-fix-hang+" (base for PowerAI R4) and a PR#5972 from BVLC/Caffe (for supporting cudnn7).
• Software: Chainverv3 /LMS/Out of Core with CUDA 9 / CuDNN7 with patches found at https://github.com/cupy/cupy/pull/694 and https://github.com/chainer/chainer/pull/3762
9709
2622
0
5000
10000
15000
Xeon x86 2640
v4/4xTesla V100
Power AC922
w/4xTesla V100
Time(secs)
Chainer
Runtime of 1000 Iterations
Thank you!

More Related Content

What's hot

AMD Bridges the X86 and ARM Ecosystems for the Data Center
AMD Bridges the X86 and ARM Ecosystems for the Data Center AMD Bridges the X86 and ARM Ecosystems for the Data Center
AMD Bridges the X86 and ARM Ecosystems for the Data Center AMD
 
Announcement Overview 4Q14 (ext)
Announcement Overview 4Q14 (ext)Announcement Overview 4Q14 (ext)
Announcement Overview 4Q14 (ext)David Spurway
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
 
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsGanesan Narayanasamy
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceOdinot Stanislas
 
AMD Hot Chips Bulldozer & Bobcat Presentation
AMD Hot Chips Bulldozer & Bobcat PresentationAMD Hot Chips Bulldozer & Bobcat Presentation
AMD Hot Chips Bulldozer & Bobcat PresentationAMD
 
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDHigh-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDinside-BigData.com
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computingRashid Ansari
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsGanesan Narayanasamy
 
A Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei EnterpriseA Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei Enterpriseinside-BigData.com
 

What's hot (20)

@IBM Power roadmap 8
@IBM Power roadmap 8 @IBM Power roadmap 8
@IBM Power roadmap 8
 
AMD Bridges the X86 and ARM Ecosystems for the Data Center
AMD Bridges the X86 and ARM Ecosystems for the Data Center AMD Bridges the X86 and ARM Ecosystems for the Data Center
AMD Bridges the X86 and ARM Ecosystems for the Data Center
 
Announcement Overview 4Q14 (ext)
Announcement Overview 4Q14 (ext)Announcement Overview 4Q14 (ext)
Announcement Overview 4Q14 (ext)
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
AMD Hot Chips Bulldozer & Bobcat Presentation
AMD Hot Chips Bulldozer & Bobcat PresentationAMD Hot Chips Bulldozer & Bobcat Presentation
AMD Hot Chips Bulldozer & Bobcat Presentation
 
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDHigh-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
 
InfiniBox z pohledu zákazníka
InfiniBox z pohledu zákazníkaInfiniBox z pohledu zákazníka
InfiniBox z pohledu zákazníka
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
 
A Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei EnterpriseA Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei Enterprise
 
IBM GPFS
IBM GPFSIBM GPFS
IBM GPFS
 

Similar to POWER9 AC922 Newell System - HPC & AI

The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems Rebekah Rodriguez
 
4 p9 architecture overview japan meetup
4 p9 architecture overview japan meetup4 p9 architecture overview japan meetup
4 p9 architecture overview japan meetupYutaka Kawai
 
High-Density Top-Loading Storage for Cloud Scale Applications
High-Density Top-Loading Storage for Cloud Scale Applications High-Density Top-Loading Storage for Cloud Scale Applications
High-Density Top-Loading Storage for Cloud Scale Applications Rebekah Rodriguez
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrjRoberto Brandao
 
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...Red_Hat_Storage
 
April 2014 IBM announcement webcast
April 2014 IBM announcement webcastApril 2014 IBM announcement webcast
April 2014 IBM announcement webcastHELP400
 
Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciênciaCampus Party Brasil
 
Sandy bridge platform from ttec
Sandy bridge platform from ttecSandy bridge platform from ttec
Sandy bridge platform from ttecTTEC
 
Hpe Proliant DL325 Gen10 Server Datasheet
Hpe Proliant DL325 Gen10 Server DatasheetHpe Proliant DL325 Gen10 Server Datasheet
Hpe Proliant DL325 Gen10 Server Datasheet美兰 曾
 
Blue Line Superserver 12-2013
Blue Line Superserver 12-2013Blue Line Superserver 12-2013
Blue Line Superserver 12-2013Blue Line
 
SUN主机产品介绍.ppt
SUN主机产品介绍.pptSUN主机产品介绍.ppt
SUN主机产品介绍.pptPencilData
 
Aewin network security appliance network management platform_scb9651_quad int...
Aewin network security appliance network management platform_scb9651_quad int...Aewin network security appliance network management platform_scb9651_quad int...
Aewin network security appliance network management platform_scb9651_quad int...Sirena Cheng
 
PowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptxPowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptxNeoKenj
 
Aewin network security appliance network management platform_scb9650_dual xeo...
Aewin network security appliance network management platform_scb9650_dual xeo...Aewin network security appliance network management platform_scb9650_dual xeo...
Aewin network security appliance network management platform_scb9650_dual xeo...Sirena Cheng
 
Product Roadmap iEi 2017
Product Roadmap iEi 2017Product Roadmap iEi 2017
Product Roadmap iEi 2017Andrei Teleanu
 
Aewin network security appliance network management platform_scb9610_intel e5...
Aewin network security appliance network management platform_scb9610_intel e5...Aewin network security appliance network management platform_scb9610_intel e5...
Aewin network security appliance network management platform_scb9610_intel e5...Sirena Cheng
 
Moving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressMoving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressOdinot Stanislas
 
IBM System x3850 X5 Technical Presentation
IBM System x3850 X5 Technical PresentationIBM System x3850 X5 Technical Presentation
IBM System x3850 X5 Technical PresentationCliff Kinard
 
Ibm power systems facts and features power 8
Ibm power systems facts and features  power 8 Ibm power systems facts and features  power 8
Ibm power systems facts and features power 8 Diego Alberto Tamayo
 
IBM Flex System p24L, p260 and p460 Compute Nodes
IBM Flex System p24L, p260 and p460 Compute NodesIBM Flex System p24L, p260 and p460 Compute Nodes
IBM Flex System p24L, p260 and p460 Compute NodesIBM India Smarter Computing
 

Similar to POWER9 AC922 Newell System - HPC & AI (20)

The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems
 
4 p9 architecture overview japan meetup
4 p9 architecture overview japan meetup4 p9 architecture overview japan meetup
4 p9 architecture overview japan meetup
 
High-Density Top-Loading Storage for Cloud Scale Applications
High-Density Top-Loading Storage for Cloud Scale Applications High-Density Top-Loading Storage for Cloud Scale Applications
High-Density Top-Loading Storage for Cloud Scale Applications
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
 
April 2014 IBM announcement webcast
April 2014 IBM announcement webcastApril 2014 IBM announcement webcast
April 2014 IBM announcement webcast
 
Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciência
 
Sandy bridge platform from ttec
Sandy bridge platform from ttecSandy bridge platform from ttec
Sandy bridge platform from ttec
 
Hpe Proliant DL325 Gen10 Server Datasheet
Hpe Proliant DL325 Gen10 Server DatasheetHpe Proliant DL325 Gen10 Server Datasheet
Hpe Proliant DL325 Gen10 Server Datasheet
 
Blue Line Superserver 12-2013
Blue Line Superserver 12-2013Blue Line Superserver 12-2013
Blue Line Superserver 12-2013
 
SUN主机产品介绍.ppt
SUN主机产品介绍.pptSUN主机产品介绍.ppt
SUN主机产品介绍.ppt
 
Aewin network security appliance network management platform_scb9651_quad int...
Aewin network security appliance network management platform_scb9651_quad int...Aewin network security appliance network management platform_scb9651_quad int...
Aewin network security appliance network management platform_scb9651_quad int...
 
PowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptxPowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptx
 
Aewin network security appliance network management platform_scb9650_dual xeo...
Aewin network security appliance network management platform_scb9650_dual xeo...Aewin network security appliance network management platform_scb9650_dual xeo...
Aewin network security appliance network management platform_scb9650_dual xeo...
 
Product Roadmap iEi 2017
Product Roadmap iEi 2017Product Roadmap iEi 2017
Product Roadmap iEi 2017
 
Aewin network security appliance network management platform_scb9610_intel e5...
Aewin network security appliance network management platform_scb9610_intel e5...Aewin network security appliance network management platform_scb9610_intel e5...
Aewin network security appliance network management platform_scb9610_intel e5...
 
Moving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressMoving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM Express
 
IBM System x3850 X5 Technical Presentation
IBM System x3850 X5 Technical PresentationIBM System x3850 X5 Technical Presentation
IBM System x3850 X5 Technical Presentation
 
Ibm power systems facts and features power 8
Ibm power systems facts and features  power 8 Ibm power systems facts and features  power 8
Ibm power systems facts and features power 8
 
IBM Flex System p24L, p260 and p460 Compute Nodes
IBM Flex System p24L, p260 and p460 Compute NodesIBM Flex System p24L, p260 and p460 Compute Nodes
IBM Flex System p24L, p260 and p460 Compute Nodes
 

More from Anand Haridass

2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...Anand Haridass
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit MumbaiAnand Haridass
 
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016Anand Haridass
 
Performance beyond moore's law
Performance beyond moore's lawPerformance beyond moore's law
Performance beyond moore's lawAnand Haridass
 
ISLPED 2015 FreqLeak (Presentation Charts)
ISLPED 2015 FreqLeak (Presentation Charts)ISLPED 2015 FreqLeak (Presentation Charts)
ISLPED 2015 FreqLeak (Presentation Charts)Anand Haridass
 
VLSID 2015 FirmLeak (Poster)
VLSID 2015 FirmLeak (Poster)VLSID 2015 FirmLeak (Poster)
VLSID 2015 FirmLeak (Poster)Anand Haridass
 
The Cloud & Its Impact on IT
The Cloud & Its Impact on ITThe Cloud & Its Impact on IT
The Cloud & Its Impact on ITAnand Haridass
 

More from Anand Haridass (8)

2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai
 
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
 
Performance beyond moore's law
Performance beyond moore's lawPerformance beyond moore's law
Performance beyond moore's law
 
ISLPED 2015 FreqLeak (Presentation Charts)
ISLPED 2015 FreqLeak (Presentation Charts)ISLPED 2015 FreqLeak (Presentation Charts)
ISLPED 2015 FreqLeak (Presentation Charts)
 
VLSID 2015 FirmLeak (Poster)
VLSID 2015 FirmLeak (Poster)VLSID 2015 FirmLeak (Poster)
VLSID 2015 FirmLeak (Poster)
 
The Cloud & Its Impact on IT
The Cloud & Its Impact on ITThe Cloud & Its Impact on IT
The Cloud & Its Impact on IT
 
Demystify OpenPOWER
Demystify OpenPOWERDemystify OpenPOWER
Demystify OpenPOWER
 

Recently uploaded

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 

Recently uploaded (20)

Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 

POWER9 AC922 Newell System - HPC & AI

  • 1. POWER Systems AC922 Newell System: The AI & HPC Platform Anand Haridass IBM Cognitive Systems anharida@in.ibm.com Client Briefing – Q1 2018 Charts from Chris Mann, Michael Fisher, Dylan Boday & Performance teams
  • 2. IBM Systems IBM POWER HPC & ML/DL Platform Strategy  High-performance computer and high-performance analytics drive common platform design  Servers will be predominately 2-socket designs  Developing deeper relationships with technology partners – ref OpenPOWER  Majority of floating-point performance will come from GPUs  OpenCAPI / Accelerators  Utilize Industry-standard compliant 19” racks and electronics enclosures  Air and water cooling options  Platforms will be based on a common enclosure form factor  Enclosure provides working envelope that we will continue to enhance with the latest technology from IBM, NVIDIA, Mellanox and other OpenPOWER partners  Enclosure provides a platform with sufficient power, cooling capability to support these enhancements
  • 3. An Acceleration Superhighway: POWER 9 is IBM’s Latest Processor 2H12 POWER7+ 32 nm - 2.5x Larger L3 cache - On-die acceleration - Zero-power core idle state - Up to 12 Cores - SMT8 - CAPI Acceleration - High Bandwidth GPU Attach 1H14 – 2H161H10 POWER7 45 nm - 8 Cores - SMT4 - eDRAM L3 Cache POWER9 Family 14nm POWER8 Family 22nm Enterprise Enterprise Enterprise & Big Data Optimized 2H17 – 2H18+ − Only processor with NVLink, PCIe Gen 4 advanced IO interfaces and coherence − Premier Platform for Accelerated Computing − Processor Family with Scale-Up and Scale-Out Optimized Silicon Built for the Cognitive Era
  • 4. IBM Systems | 4 POWER9 Processor Family Scale-Out – 2 Socket Optimized Robust 2 socket SMP system Direct Memory Attach • Up to 8 DDR4 ports • Up to 170 GB/s memory BW • Commodity packaging form factor Scale-Up – 4+-Socket Optimized Scalable System Topology / Capacity • Large multi-socket Buffered Memory Attach • 8 Buffered channels • Up to 230 GB/s memory BW SMT4 Core 24 SMT4 Cores / Chip Linux Ecosystem Optimized SMT8 Core 12 SMT8 Cores / Chip PowerVM Ecosystem Continuity Core Count / Size SMP scalability / Memory subsystem
  • 5. An Acceleration Superhighway: POWER9 offers a variety of Acceleration Options State of the Art I/O and Acceleration Attachment Signaling – PCIe Gen 4 x 48 lanes – 192 GB/s duplex bandwidth – 25G Link x 48 lanes – 300 GB/s duplex bandwidth Robust Accelerated Compute Options with OPEN standards – On-Chip Acceleration – Gzip x1, 842 Compression x2, AES/SHA x2 – CAPI 2.0 – 4x bandwidth of POWER8 using PCIe Gen 4 – OpenCAPI 3.0 – High bandwidth, low latency and open interface using 25G Link – NVLink 2.0 – Next generation of GPU/CPU bandwidth and integration POWER9 PowerAccel • Extreme Processor / Accelerator Bandwidth and Reduced Latency • Coherent Memory and Virtual Addressing Capability for all Accelerators • OpenPOWER Community Enablement – Robust Accelerated Compute Options
  • 6. Extreme CPU/Accelerator BandwidthSystem bottleneck Only Available with POWER An Acceleration Superhighway: POWER9 Introduces Acceleration Innovations Seamless CPU/Accelerator Interaction • Coherent memory sharing • Enhanced virtual address translation Broader Use of Heterogeneous Compute • Designed for efficient programming models • Accelerate complex analytic / cognitive applications
  • 7. IBM Systems| 7 IBM POWER GPU Systems Roadmap 2017 - 2018 POWER S822LC • 2 POWER8 Processors - 190 Turismo module • 2 x16 Gen 3 FHFL PCIe slots - Supports 2 NVidia K80 GPU’s - Supports 2 PCIe adapters • 1 x8 Gen 3 HHHL PCIe, CAPI • 1 x16 Gen 3 HHHL PCIe, CAPI • 1 x8 Gen 3 PCIe • 32 DDR3 IS DIMM’s - 4, 8, 16, 32GB DIMMs - 32 – 1024GB Memory Capacity • 2 SATA SFF HDD / SSD • 2 1300W Power Supplies - 200VAC Input • BMC support structure - IPMI, USB, EN, VGA • Air cooled POWER S822LC for HPC • 2 POWER8 w/ NVLink Processors - 190 module • 1, 2, 4 NVidia “Pascal” GPU’s - 300W, SXM2 Form Factor, NVLink 1.0 • 2 x16 Gen 3 HHHL PCIe, CAPI enabled • 1 x8 Gen3 HHHL PCIe, CAPI enabled • 32 DDR4 IS DIMM’s - 4, 8, 16, 32GB DIMM’s • 2 SATA SFF HDD / SSD • Pluggable NVMe storage adapter - 1.6, 3.2TB Capacity • 2 1300W power supplies - 200VAC Input • BMC Support Structure - IPMI, USB, EN, VGA • Air and water cooled options POWER AC922 • 2 POWER9 Processors - 190, 250W modules • 4-6 NVidia “Volta” GPU’s - 300W, SXM2 Form Factor, NVLink 2.0 • 6 GPU configuration, water cooled • 4 GPU configuration, air or water cooled • 2 Gen4 x16 HHHL PCIe, CAPI enabled • 1 Gen4 x4 HHHL PCIe • 1 Gen4 Shared x8 PCIe adapter • 16 IS DIMM’s - 8, 16, 32, 64, 128GB DIMMs • 2 SATA SFF HDD / SSD • 2 2200W power supplies - 200 VAC, 277VAC, 400VDC input - N+1 Redundant • Second generation BMC Support Structure • Pluggable NVMe storage adapter option 2015 2016 Future SWIFT (Preliminary) • 2 Axone Processors - 190, 250W modules - OpenCAPI 3.0 • 4 NVIDIA “Volta F.O.” GPU’s - 300W, SXM3 Form Factor, NVLink 2.0 • 2 Gen4 x16 HHHL PCIe, CAPI enabled • 2 Gen4 x8 HHHL PCIe • 1 Gen4 x8 FHHL PCIe adapter • 16 Buffered DIMMs - x16 OMI interface - 8, 16, 32, 64, 128GB DIMMs • 4 SATA SFF HDD / SSD • 2 NVME SSD • 2 2200W power supplies - N+1 Redundant • Second generation BMC Support Structure • Next Generation HPC platform • Air and water cooled DEEP EDDY (Preliminary) • 2 P10 Processors - 190, 250W modules - OpenCAPI 4.0 • 4 NVIDIA “Future” GPU’s - 300W, SXM3 Form Factor, NVLink 3.0 • 2 Gen4 x16 HHHL PCIe, CAPI enabled • 2 Gen4 x8 HHHL PCIe • 1 Gen4 x8 FHHL PCIe adapter • 16 Buffered DIMMs - x16 OMI interface - 8, 16, 32, 64, 128GB, 256GB DIMMs • 4 SATA SFF HDD / SSD • 2 NVME SSD • 2 2200W power supplies - N+1 Redundant • Third generation BMC Support Structure • Next Generation HPC platform • Air and water cooled 2019 - 2020
  • 8. IBM Systems High level System Overview  2-Socket, 2U Packaging  40 P9 Processor cores  4 NVIDIA Volta 2.0 GPUs  1 TB Memory (16x - 64GB DIMMs)  4 PCIe Gen4 Slots  2x SFF (HDD/SSD), SATA, Up to 7.7 TB storage  Supports 1.6TB and 3.2TB NVMe Adapters  Redundant Hot Swap Power Supplies and Fans  Default 3 year 9x5 warranty, 100% CRU AC922 Newell - POWER9 with increased GPU and IO bandwidth for differentiation Realize unprecedented performance and application gains with POWER9 and NVLink 2.0 • 2 POWER9 CPUs and up to 4 “Volta” NVLink 2.0 GPUs in a versatile 2U Linux server • PCIe Gen4 bus has double I/O Bandwidth vs. PCIe Gen3 • CPU (Turbo)/GPU (Boost) enabled for improved data center efficiency and performance to be maintained at high levels 8
  • 9. 4 GPUs @150GB/s CPU  GPU bandwidth 6 GPUs @100GB/s CPU  GPU bandwidth Coherent access to system memory PCIe Gen 4 and CAPI 2.0 to InfiniBand Air and Water cooled options Coherent access to system memory PCIe Gen 4 and CAPI 2.0 to InfiniBand Water cooled only NVLink 100GB/s NVLink 100GB/s NVDIA V100 Coherent access to system memory (2TB) NVLink 100GB/s NVLink 100GB/s NVLink 100GB/s 170GB/s CPU PCIe Gen 4 CAPI 2.0 NVDIA V100 NVDIA V100 DDR4 IB Coherent access to system memory (2TB) NVLink 150GB/s NVLink 150GB/s 170GB/s CPU PCIe Gen 4 CAPI 2.0 NVLink 150GB/s NVDIA V100NVDIA V100 DDR4 IB
  • 10. IBM Systems | 10 POWER AC922 Design – 4 GPU Power 9 Processor (2x) • 18, 22C water cooled • 16, 20C air cooled PCIe slot (4x) • Gen4 PCIe • 2, x16 HHHL Adapter • 1, Shared slot • 1 x8 HHHL Adapter Memory DIMM’s (16x) • 8 DDR4 IS DIMMs per socket • 8, 16, 32,64, 128GB DIMMs NVidia Volta GPU • 3 per socket • SXM2 form factor • 300W • NVLink 2.0 • Air/Water Cooled Power Supplies (2x) • 2200W • 200VAC, 277VAC, 400VDC input BMC Card • IPMI • 1 Gb Ethernet • VGA • 1 USB 3.0
  • 11. IBM Systems | 11 Mechanical Overview Operator Interface • 1 USB 3.0 • Power Button • Service LED’s 4X - Cooling Fans • Counter- Rotating • Hot swap • 80mm Memory DIMM’s (16x) • 8 DDR4 IS DIMMs per socket Power 9 Processor (2x) • 190W & 250W BMC (Service Processor Card) • IPMI • 2x 1 Gb Ethernet • 1 VGA • 1 USB 3.0 PCIe slot (4x) • Gen4 PCIe • 2, x16 HHHL Adapter • 1, x8,x8 Shared HHHL Adapter • 1 x4 HHHL Adapter NVidia Volta GPU • 2 per socket • SXM2 form factor • 300W • NVLink 2.0 • Air Cooled Power Supplies (2x) • 2200W • Configuration limits for redundancy • Hot Swap • 200VAC, 277VAC, 400VDC input Storage • Optional 2x SFF SATA Disk • Optional 2x SFF SATA SSD • Disk are tray based for hot swap Note: Front Bezel removed
  • 12. IBM Systems | 12 Front & Rear Details Front Rear 80mm CR Cooling Fans (4x) Note: Front bezel is removed in this illustration USB 3.0 SFF-4 Carrier (2X) • SFF SATA HDD or SSD Service Indicators USB 3.0 1Gb Eth (2x) IPMI VGA PCIe Slot 2 • Gen4 Shared x8,x8 • HHHL Slot • CAPI Enabled PCIe Slot 1 • Gen4 x4 (x8 Connector) • HHHL Slot Power Supplies (2X) Water lines (Option) Service Indicators Power Button PCIe Slot 3 & 4 • Gen4 x16 • HHHL Slot • CAPI Enabled
  • 13. IBM Systems Witherspoon (2 GPUs / socket) 8 DIMMs 8 DIMMsX Bus 4B P9 P9 NV Links (3 Bricks ea) NV Links (3 Bricks ea) GPU GPU GPU GPU Mellanox IB EDR NIC Shared Slot x8x8PCIe Gen4 x16 PCIe Gen4 x16 USB Storage Ctlr PCIe Switch 4 x2 PCIe Buses One per GPU PCIe Gen4 x4 BMC CAPI CAPI
  • 14. IBM Systems | 14 NVIDIA Volta GPU Features Peak double precision floating point performance 7.8 TFLOPS Memory bandwidth 900 GB/sec GPU Memory Size 16 GB NVLink “Bricks” (8 lane interface) 6 NVLink Interconnect Bi-Directional 300GB/s Maximum Power 300W NVIDIA Volta Specifications https://www.nvidia.com/en-us/data-center/tesla-v100/
  • 15. IBM Systems | 15 NVIDIA® Volta GPU Accelerator Power Regulation 2x 400 Pin Connectors 2x Grounding Pads Bottom Side Steel Stiffener Multi Chip Module 4x Extraction Springs GPU Details Top Side
  • 16. IBM Systems POWER AC922 Design – 6 GPU Power 9 Processor (2x) • 18, 22C water cooled • 16, 20C air cooled PCIe slot (4x) • Gen4 PCIe • 2, x16 HHHL Adapter • 1, Shared slot • 1 x8 HHHL Adapter Memory DIMM’s (16x) • 8 DDR4 IS DIMMs per socket • 8, 16, 32,64, 128GB DIMMs NVidia Volta GPU • 3 per socket • SXM2 form factor • 300W • NVLink 2.0 • Air/Water Cooled Power Supplies (2x) • 2200W • 200VAC, 277VAC, 400VDC input BMC Card • IPMI • 1 Gb Ethernet • VGA • 1 USB 3.0
  • 17. IBM Systems Witherspoon (3 GPUs / socket) 8 DIMMs 8 DIMMsX Bus 4B GPU P9 P9 NV Links (2 Bricks ea) NV Links (2 Bricks ea)) GPU GPU GPU GPU GPU Mellanox IB EDR NIC Shared Slot x8x8PCIe Gen4 x16 PCIe Gen4 x16 USB Storage Ctlr PCIe Switch 6 x2 PCIe Buses One per GPU PCIe Gen4 x4 BMC CAPI CAPI
  • 18. IBM Systems PCIe Gen4 x16 PCIe Gen4 x8 PCIe Gen4 x8 PCIe Gen4 x16 PCIe Gen4 x4 I/O Attachment Evolution in POWER HPC IB-EDR NIC Shared slot CAPI Mellanox “Multi-Host Socket Direct”. X-Bus 4B @ 16Gbps 2016 2017-2018 First industry implementation of Gen4 PCIe Multi-host attachment of POWER9 and the Mellanox EDR-IB adapter
  • 19. The IO Difference – Faster Data Movement 19 • P9 with 2nd Gen NVLink enables 5.6x faster data movement from CPU-GPU in 4 GPU system • 6 GPU provide balance with compute and data throughput 0 50 100 150 200 250 300 350 400 450 AVERAGEGBITS/S SIZE IN BYTES Comparing PCIe 3.0 vs 4.0 IB Dual Port Bidirectional Bandwidth • ~2x faster PCIe Gen 4 interconnect to IB network cards • Best server for clusters leveraging networking and other devices as they become PCIe Gen4 ready • Results are based on IBM Internal Measurements running the CUDA H2D Bandwidth Test • Hardware: Power AC922; 32 cores (2 x 16c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU; Ubuntu 16.04. S822LC for HPC; 20 cores (2 x 10c chips), POWER8 with NVLink; 2.86 GHz, 512 GB memory, Tesla P100 GPU • Competitive HW: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04 GPU Attach Bandwidth Comparison, PCIe Gen3 verses NVLink
  • 20. Evolving from Compute Systems to Cognitive Systems P8 P9 P10 Open Frameworks Partnerships Industry Alignment DevEcosystem Accelerator Roadmaps Open Accelerator Interfaces Not Just About Hardware Design It’s about co-optimization which just works for ML, DL, and AI IBM Software 20 hardware software +
  • 21. enterprise-ready software distribution built on open source tools for ease of development performance faster training times for data scientists
  • 22. Designed for the AI era: Chainer provides a 3.7X reduction in AI model training vs tested x86 systems Maximize research productivity running training for medical/satellite images with Chainer on the AC922 • 3.7X reduction vs tested x86 systems in runtime of 1000 iterations on x86 systems to train medical/satellite images • Critical machine learning (ML) capabilities such as regression, nearest neighbor, recommendation systems, clustering, etc. operate on more than just the GPU memory • Large Model Support - use system memory and GPU memory to support more complex and higher resolution data • Results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model on Enlarged Imagenet Dataset (2560x2560) . • Hardware: Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU Pegas 1.0. Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04. • Software: Chainverv3 /LMS/Out of Core with CUDA 9 / CuDNN7 with patches found at https://github.com/cupy/cupy/pull/694 and https://github.com/chainer/chainer/pull/3762 Chainer: More Accuracy (3.7 iterations vs 1) 4 run Accuracy 3 run Accuracy 2 run Accuracy 1 run Accuracy One Iteration One Iteration Two Iterations Three Iterations + 70% iteration Xeon 4xV100 AC922 4xV100
  • 23. Maximize research productivity running training for medical/satellite images with Caffe with the AC922 • 3.8X reduction vs tested x86 systems 1000 iterations running on competing systems to train on 2k x 2k images • Critical machine learning (ML) capabilities such as regression, nearest neighbor, recommendation systems, clustering, etc. operate on more than just the GPU memory • NVLink 2.0 enables enhanced Host to GPU communication • Large Model Support - use system memory and GPU memory to support more complex and higher resolution data Designed for the AI era: Caffe provides a 3.8X reduction in AI model training vs tested x86 systems Results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model (mini-batch size=5) on Enlarged Imagenet Dataset (2240x2240) . Hardware: Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU Pegas 1.0. Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04. Software: IBM Caffe with LMS Source code: https://github.ibm.com/TUNG/trlcaffe/tree/1.0-ibm-blc-bm-fix-hang+-p9collateral based on the branch "1.0-ibm- blc-bm-fix-hang+" (base for PowerAI R4) and a PR#5972 from BVLC/Caffe (for supporting cudnn7). Caffe: More Accuracy (3.8 iterations vs 1) 4 run Accuracy 3 run Accuracy 2 run Accuracy 1 run Accuracy One Iteration One Iteration Two Iterations Three Iterations + 80% iteration Xeon 4xV100 AC922 4xV100
  • 24. AC922 Exceptional Performance for accelerated workloads: All results are based on running CPMD, a parallelized plane wave / pseudopotential implementation of Density Functional Theory Application. A Hybrid version of CPMD (e.g. MPI + OPENMP + GPU + streams) was implemented with runs are made for 256-Water Box, RANDOM initialization. Results are reported in Execution Time (seconds).. Effective measured data rate on PCIe bus of 10 GB/s and on Nvlink 2.0 of 50GB/s. IBM Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU; Pegas 1.0 with ESSL PRPQ; Spectrum MPI: PRPQ release, XLF: 15.16, CUDA 9.1 IBM Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x Tesla P100 GPU; RHEL 7.4.with ESSL 5.3.2.0; PE2.2; XLF: 15.1, CUDA 8.0 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, Ubuntu 16.04 with OPENBLAS 0.2.18, OpenMPI: 1.10.2, GNU-5.4.0, CUDA-8.0 917 673 351 0 200 400 600 800 1000 Xeon x86 E5-2640 v4 2x10c + 4xTesla P100 Power S822LC 2x10c POWER8 + 4xTesla P100 Power AC922 2x 20c POWER9 + 4xTesla V100 Time(secs) Molecular Dynamics (CPMD) Runtime (secs) – POWER9 with NVLink 2.0 unlocks the performance of GPU- accelerated version of CPMD by enabling lightning fast CPU- GPU data transfers • IBM Power System AC922 delivers 2.6X reduction in execution time of tested x86 systems Lowerisbetter 2.6X faster running CPMD compared to tested x86 systems 3093 5737 0 2000 4000 6000 8000 Power S822LC w/4xTesla P100 Power AC922 w/4xTesla V100 System Throughput (Queries/min) 1.8X faster running Accelerated Databases Improved application performance with Kinetica filtering Twitter Tweets – 80% more throughput on Power Systems AC922 than Power System S822LC for HPC • Throughput results are based on running Kinetica “Filter by geographic area” queries on data set of 280 million simulated Tweets with 80 to 600 concurrent clients each with 0 think time. • Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 1024 GB memory, 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU. For more information on Power Systems performance on Kinetica and other workloads see https://developer.ibm.com/linuxonpower/perfcol/ • Power System AC922; 32 cores (2 x 16c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU; .
  • 25. The POWER9 difference for Deep Learning 25© 2017 IBM Corporation 3.7X reduction vs tested x86 systems in runtime of 1000 iterations running on competing systems to train on 2k x 2k images POWER CPU DDR4 GPU NVLink Graphics Memory Differentiated productivity available with AC922 • Faster model training times • Iterate models faster • Train on larger / more complex datasets • NVLink 2.0 enables enhanced Host to GPU communication • IBM’s LMS enables seamless use of Host + GPU memory for improved performance 3.8X reduction vs tested x86 systems in runtime of 1000 iterations running on competing systems to train on 2k x 2k images 11215 2940 0 5000 10000 15000 Xeon x86 2640 v4/4xTesla V100 Power AC922 w/4xTesla V100 Time(secs) Caffe Runtime of 1000 Iterations • Chainer results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model on Enlarged Imagenet Dataset (2560x2560) . • Caffe results are based IBM Internal Measurementsrunning 1000 iterationsof Enlarged GoogleNet model (mini-batchsize=5) on Enlarged Imagenet Dataset (2240x2240) . • Hardware: Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU Pegas 1.0. Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04. • Software: IBM Caffe with LMS Source code: https://github.ibm.com/TUNG/trlcaffe/tree/1.0-ibm-blc-bm-fix-hang+-p9collateral based on the branch "1.0-ibm-blc-bm-fix-hang+" (base for PowerAI R4) and a PR#5972 from BVLC/Caffe (for supporting cudnn7). • Software: Chainverv3 /LMS/Out of Core with CUDA 9 / CuDNN7 with patches found at https://github.com/cupy/cupy/pull/694 and https://github.com/chainer/chainer/pull/3762 9709 2622 0 5000 10000 15000 Xeon x86 2640 v4/4xTesla V100 Power AC922 w/4xTesla V100 Time(secs) Chainer Runtime of 1000 Iterations