SlideShare a Scribd company logo
1 of 29
Download to read offline
IBM POWER9 Processor,
NVIDIA V100 GPU and
IBM AC922 Node Hardware Architecture
IISc Meetup
04.12.2019 Dirk Pleiter, Forschungszentrum J¨ulich/JSC & University of Regensburg
Member of the Helmholtz Association
Overview of this Lecture
Introduction
IBM POWER9 processor
NVIDIA V100 GPU
IBM AC922 platform
Use case analysis
Summary
Member of the Helmholtz Association 04.12.2019 Slide 1
Part I: Introduction
Member of the Helmholtz Association
AC922 Server: Overview
[IBM Redbook, 2018]
processors
POWER9
V100 GPUs
Memory
Member of the Helmholtz Association 04.12.2019 Slide 2
AC922 Server: Building Block for Summit
4,608 nodes ⇒ 201 PFlop/s
9,216 POWER9 CPUs = 202,752 POWER9 cores
27,648 V100 GPUs = 2,211,840 V100 SMs
Dual-rail Infiniband EDR fat-tree network
2× 100Gbps injection bandwidth per node
Member of the Helmholtz Association 04.12.2019 Slide 3
Information Exchange Function
A computation implies that information is transferred from a
storage device x to a storage device y.
Information Exchange Function:
Ik
x,y (W) = data transferred between computer sub-
systems for specific computation k
x ... source storage device (e.g. memory)
y ... destination storage device (e.g. register file)
W ... problem size/work-load
Member of the Helmholtz Association 04.12.2019 Slide 4
Bandwidth-Latency Performance Model
Ansatz to predict latency:
∆tk
x,y λx,y + Ik
x,y /βx,y
Examples:
Throughput of arithmetic operations
x, y = R
βR,R = throughput arithmetic unit
Ik
R,R = number of operations
Load of data
x = M, y = R
βM,R = memory bandwidth
Ik
M,R = amount of data
R register file
arithmetic unit
R
M
memory bus
memory
register file
Member of the Helmholtz Association 04.12.2019 Slide 5
Balanced Architecture
Assume perfect overlap of computation and data transport
Condition for balanced architecture
∆tfp ∆tmem
⇓
Ifp
βfp
Imem
βmem
⇓
βfp
βmem
Ifp
Imem
= AI Arithmetic Intensity
Unbalanced cases
AI < βfp/βmem ⇒ memory bandwidth limited
AI > βfp/βmem ⇒ compute performance limited
Member of the Helmholtz Association 04.12.2019 Slide 6
Roofline Model
Attainable performance
bfp <
Ifp
max(∆tfp, ∆tmem)
min(βfp, AI · βmem)
0.2 1 5
AI [Flop/Byte]
0.001
0.01
0.1
1
10
Attainableperformance[GFlop/s]
Member of the Helmholtz Association 04.12.2019 Slide 7
Part II: IBM POWER9 Processor
Member of the Helmholtz Association
POWER9 Overview
Introduced in 2017 using 14nm
technology
POWER v3.0B ISA
Up to 24 cores, frequency up to 4 GHz
4 threads per core in SMT4
configuration
Core performance figures
Double precision floating-point
arithmetics:
2 · 2 FMA/cycle = 8 Flop/cycle
Load from L1:
2 · 16 Byte/cycle
Store to L1:
1 · 16 Byte/cycle
Member of the Helmholtz Association 04.12.2019 Slide 8
POWER9 Micro-Architecture
[IBM, 2018]
Member of the Helmholtz Association 04.12.2019 Slide 9
POWER9 Memory Subsystem
[IBM, 2018]
Directly attached memory:
8 DDR4 channels
Up to 2.667 GT/s or up to 170 GByte/s/socket
Member of the Helmholtz Association 04.12.2019 Slide 10
POWER9 Memory Hierarchy
Cache line size = 128 Byte
L1 data cache
64 kByte, 8-way set associative
Write update policy: store-through
Core private
L2 data cache
512 kByte, 8-way set associative, dual-banked
Fully inclusive of the L1 cache
Shared by 2 cores
L3 data cache
10 MByte region per 2-core slice, 20-way set associative
Victim for local L2 or other L3 caches, not inclusive of the L2
cache
Accessible from other cores
Member of the Helmholtz Association 04.12.2019 Slide 11
Part III: NVIDIA V100 GPU
Member of the Helmholtz Association
Volta Architecture Overview
Introduced in 2017 using 12 nm
Up to 80 Streaming Multiprocessors (SM)
SM performance figures
32 FP64 CUDA “cores”
64 Flop/cycle
Member of the Helmholtz Association 04.12.2019 Slide 12
V100 Memory Subsystem
In-package HBM2 memory
4 memory stacks,
4-8 GByte/stack
Very wide bus (2 · 512 bit) and
relative low data rate
(1.75 GT/s)
Aggregate bandwidth:
900 GByte/s
[H. Jun (SK Hynix), 2013]
[Wikipedia]
Member of the Helmholtz Association 04.12.2019 Slide 13
V100 Memory Hierarchy
Register file
Size: 256 kiByte/SM or
64 kByte/processing block
L1 data cache and shared
memory
Size 128 kiByte/SM or
10 MiByte/GPU
L1 hit latency 28 cycle
(measured)
L2 cache
6 MiByte/GPU
L2 hit latency 193 cycle
(measured)
900 GByte/s
2155 GByte/s (meas.)
250 Byte/cycle
Register file
L1 cache
L2 cache
Global memory
Device memory /
GPU private
SM private
Processing block private
[Zhe Jia et al., 2018]
Member of the Helmholtz Association 04.12.2019 Slide 14
Part IV: AC922 Nodes
Member of the Helmholtz Association
AC922 Server: Details
Member of the Helmholtz Association 04.12.2019 Slide 15
Comparison POWER9 vs. V100 on AC922
POWER9 V100
Number of cores/SMs ≥ 16 80
Double-precision Flop/cycle ≥ 128 5,120
Clock frequency [GHz] 3.8 1.312
Double-precision TFlop/s ≥ 0.5 6.7
Memory bandwidth read+write [GByte/s] 135 900
Balance [Flop/Byte] ≥ 3.7 7.5
Number of CPU or GPU 2 6
Aggregate double-precision TFlop/s 0.97 40.3
Aggregate memory bandwidth [TByte/s] 0.27 5.4
Aggregate memory capacity [GByte] 512 96
Member of the Helmholtz Association 04.12.2019 Slide 16
Part V: Use Case
Member of the Helmholtz Association
The Poisson Equation
Poisson equation in 2 dimensions:
−
∂2
v(x, y)
∂x2
−
∂2
v(x, y)
∂y2
= f(x, y)
Discretisation of 2nd-order derivative
−
∂2
v(x, y)
∂x2
←
2vi,j − vi−1,j − vi+1,j
h2
Discrete Poisson equaiton in 2 dimensions:
T v = h2
f where T =



4 −1 0 · · ·
−1 4 −1 · · ·
...
...
...



Member of the Helmholtz Association 04.12.2019 Slide 17
Discrete Poisson Equation: Data Locality
Graphical representation of T v:
Observations:
Matrix T acts as a stencil operator
Any element of vector v is reused 4 times
Member of the Helmholtz Association 04.12.2019 Slide 18
Jacobi Algorithm
Algorithm suitable for diagonally dominant systems
Matrix decomposition: T = D + R
D =



t0,0 0 0 · · ·
0 t1,1 0 · · ·
...
...
...


 , R =



0 t0,1 t0,2 · · ·
t1,0 0 t1,2 · · ·
...
...
...



Obtain solution through iterative precedure
v(k+1)
= D−1
h2
f − R v(k)
Member of the Helmholtz Association 04.12.2019 Slide 19
Jacobi Algorithm Implementation
for ( int i y = 1; i y < ny−1; i y ++)
{
for ( int i x = 1; i x < nx−1; i x ++ )
{
Anew[ i y ∗nx+ i x ] = −0.25 ∗ ( rhs [ i y ∗nx+ i x ]
− ( Aref [ i y ∗nx + i x +1] + Aref [ i y ∗nx + ix −1] +
Aref [ ( iy −1)∗nx + i x ] + Aref [ ( i y +1)∗nx + i x ] ) ) ;
error = fmaxr ( error , fabsr (Anew[ i y ∗nx+ i x ]−Aref [ i y ∗nx+ i x ] ) ) ;
}
}
Information exchange analysis assuming small cache:
Load of Anew, rhs, Aref: Ild = nx · ny · 3 · 8 Byte
Store Anew: Ist = nx · ny · 1 · 8 Byte
Floating-point arithmetics: Ifp = nx · ny · 6 Flop
Member of the Helmholtz Association 04.12.2019 Slide 20
Jacobi on POWER9/V100: Roofline Analysis
AI = Ifp/(Ild + Ist) = 0.19 Flop/Byte
0.1 1 10 100
AI [Flop/Byte]
0.01
0.1
1
10
Attainableperformance[TFlop/s]
POWER9
V100
Member of the Helmholtz Association 04.12.2019 Slide 21
Part VI: Summary
Member of the Helmholtz Association
Summary
Approach for systematic performance analysis and
modelling based on Information Exchange introduced
Discussed the AC922 node architecture as a building block
for supercomputers based on
POWER9 processors
V100 GPUs
Introduced use case for today: Solving 2-dimensional
Poisson equation using Jacobi solver
Memory bandwidth limited application
Roofline model defines upper performance limit
Member of the Helmholtz Association 04.12.2019 Slide 22

More Related Content

What's hot

GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用NVIDIA Taiwan
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Ural-PDC
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchRyousei Takano
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術Ryousei Takano
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
JavaScriptCore's DFG JIT (JSConf EU 2012)
JavaScriptCore's DFG JIT (JSConf EU 2012)JavaScriptCore's DFG JIT (JSConf EU 2012)
JavaScriptCore's DFG JIT (JSConf EU 2012)Igalia
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_ProcessingKohei KaiGai
 
Japan Lustre User Group 2014
Japan Lustre User Group 2014Japan Lustre User Group 2014
Japan Lustre User Group 2014Hitoshi Sato
 
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合Hitoshi Sato
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...InfluxData
 
DPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for AnalysisDPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for AnalysisHenry Schreiner
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?Shinnosuke Furuya
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)inside-BigData.com
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multiKohei KaiGai
 
Automatically Fusing Functions on CuPy
Automatically Fusing Functions on CuPyAutomatically Fusing Functions on CuPy
Automatically Fusing Functions on CuPyPreferred Networks
 

What's hot (20)

GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software research
 
Parallel io
Parallel ioParallel io
Parallel io
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
JavaScriptCore's DFG JIT (JSConf EU 2012)
JavaScriptCore's DFG JIT (JSConf EU 2012)JavaScriptCore's DFG JIT (JSConf EU 2012)
JavaScriptCore's DFG JIT (JSConf EU 2012)
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing
 
Japan Lustre User Group 2014
Japan Lustre User Group 2014Japan Lustre User Group 2014
Japan Lustre User Group 2014
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
DPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for AnalysisDPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for Analysis
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
 
Automatically Fusing Functions on CuPy
Automatically Fusing Functions on CuPyAutomatically Fusing Functions on CuPy
Automatically Fusing Functions on CuPy
 

Similar to OpenPOWER Application Optimisation meet up

DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceLEGATO project
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...In-Memory Computing Summit
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm
 
PASTE: A Network Programming Interface for Non-Volatile Main Memory
PASTE: A Network Programming Interface for Non-Volatile Main MemoryPASTE: A Network Programming Interface for Non-Volatile Main Memory
PASTE: A Network Programming Interface for Non-Volatile Main Memorymicchie
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Dell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation WebinarDell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation WebinarBill Wong
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Danielle Womboldt
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Community
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdfhellobank1
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Using GZIP Data Compression to Reduce Power Consumption in IoT Devices
Using GZIP Data Compression to Reduce Power Consumption in IoT DevicesUsing GZIP Data Compression to Reduce Power Consumption in IoT Devices
Using GZIP Data Compression to Reduce Power Consumption in IoT DevicesCAST, Inc.
 
Energy Savings Using GZIP IP Within IoT Devices
Energy Savings Using GZIP IP Within IoT DevicesEnergy Savings Using GZIP IP Within IoT Devices
Energy Savings Using GZIP IP Within IoT DevicesCAST, Inc.
 
Xilinx Data Center Strategy and CCIX
Xilinx Data Center Strategy and CCIXXilinx Data Center Strategy and CCIX
Xilinx Data Center Strategy and CCIXYoshihiro Horie
 

Similar to OpenPOWER Application Optimisation meet up (20)

DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
PASTE: A Network Programming Interface for Non-Volatile Main Memory
PASTE: A Network Programming Interface for Non-Volatile Main MemoryPASTE: A Network Programming Interface for Non-Volatile Main Memory
PASTE: A Network Programming Interface for Non-Volatile Main Memory
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Dell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation WebinarDell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation Webinar
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Using GZIP Data Compression to Reduce Power Consumption in IoT Devices
Using GZIP Data Compression to Reduce Power Consumption in IoT DevicesUsing GZIP Data Compression to Reduce Power Consumption in IoT Devices
Using GZIP Data Compression to Reduce Power Consumption in IoT Devices
 
Energy Savings Using GZIP IP Within IoT Devices
Energy Savings Using GZIP IP Within IoT DevicesEnergy Savings Using GZIP IP Within IoT Devices
Energy Savings Using GZIP IP Within IoT Devices
 
Xilinx Data Center Strategy and CCIX
Xilinx Data Center Strategy and CCIXXilinx Data Center Strategy and CCIX
Xilinx Data Center Strategy and CCIX
 

More from Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency programGanesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISAGanesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsGanesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsGanesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
 

Recently uploaded

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

OpenPOWER Application Optimisation meet up

  • 1. IBM POWER9 Processor, NVIDIA V100 GPU and IBM AC922 Node Hardware Architecture IISc Meetup 04.12.2019 Dirk Pleiter, Forschungszentrum J¨ulich/JSC & University of Regensburg Member of the Helmholtz Association
  • 2. Overview of this Lecture Introduction IBM POWER9 processor NVIDIA V100 GPU IBM AC922 platform Use case analysis Summary Member of the Helmholtz Association 04.12.2019 Slide 1
  • 3. Part I: Introduction Member of the Helmholtz Association
  • 4. AC922 Server: Overview [IBM Redbook, 2018] processors POWER9 V100 GPUs Memory Member of the Helmholtz Association 04.12.2019 Slide 2
  • 5. AC922 Server: Building Block for Summit 4,608 nodes ⇒ 201 PFlop/s 9,216 POWER9 CPUs = 202,752 POWER9 cores 27,648 V100 GPUs = 2,211,840 V100 SMs Dual-rail Infiniband EDR fat-tree network 2× 100Gbps injection bandwidth per node Member of the Helmholtz Association 04.12.2019 Slide 3
  • 6. Information Exchange Function A computation implies that information is transferred from a storage device x to a storage device y. Information Exchange Function: Ik x,y (W) = data transferred between computer sub- systems for specific computation k x ... source storage device (e.g. memory) y ... destination storage device (e.g. register file) W ... problem size/work-load Member of the Helmholtz Association 04.12.2019 Slide 4
  • 7. Bandwidth-Latency Performance Model Ansatz to predict latency: ∆tk x,y λx,y + Ik x,y /βx,y Examples: Throughput of arithmetic operations x, y = R βR,R = throughput arithmetic unit Ik R,R = number of operations Load of data x = M, y = R βM,R = memory bandwidth Ik M,R = amount of data R register file arithmetic unit R M memory bus memory register file Member of the Helmholtz Association 04.12.2019 Slide 5
  • 8. Balanced Architecture Assume perfect overlap of computation and data transport Condition for balanced architecture ∆tfp ∆tmem ⇓ Ifp βfp Imem βmem ⇓ βfp βmem Ifp Imem = AI Arithmetic Intensity Unbalanced cases AI < βfp/βmem ⇒ memory bandwidth limited AI > βfp/βmem ⇒ compute performance limited Member of the Helmholtz Association 04.12.2019 Slide 6
  • 9. Roofline Model Attainable performance bfp < Ifp max(∆tfp, ∆tmem) min(βfp, AI · βmem) 0.2 1 5 AI [Flop/Byte] 0.001 0.01 0.1 1 10 Attainableperformance[GFlop/s] Member of the Helmholtz Association 04.12.2019 Slide 7
  • 10. Part II: IBM POWER9 Processor Member of the Helmholtz Association
  • 11. POWER9 Overview Introduced in 2017 using 14nm technology POWER v3.0B ISA Up to 24 cores, frequency up to 4 GHz 4 threads per core in SMT4 configuration Core performance figures Double precision floating-point arithmetics: 2 · 2 FMA/cycle = 8 Flop/cycle Load from L1: 2 · 16 Byte/cycle Store to L1: 1 · 16 Byte/cycle Member of the Helmholtz Association 04.12.2019 Slide 8
  • 12. POWER9 Micro-Architecture [IBM, 2018] Member of the Helmholtz Association 04.12.2019 Slide 9
  • 13. POWER9 Memory Subsystem [IBM, 2018] Directly attached memory: 8 DDR4 channels Up to 2.667 GT/s or up to 170 GByte/s/socket Member of the Helmholtz Association 04.12.2019 Slide 10
  • 14. POWER9 Memory Hierarchy Cache line size = 128 Byte L1 data cache 64 kByte, 8-way set associative Write update policy: store-through Core private L2 data cache 512 kByte, 8-way set associative, dual-banked Fully inclusive of the L1 cache Shared by 2 cores L3 data cache 10 MByte region per 2-core slice, 20-way set associative Victim for local L2 or other L3 caches, not inclusive of the L2 cache Accessible from other cores Member of the Helmholtz Association 04.12.2019 Slide 11
  • 15. Part III: NVIDIA V100 GPU Member of the Helmholtz Association
  • 16. Volta Architecture Overview Introduced in 2017 using 12 nm Up to 80 Streaming Multiprocessors (SM) SM performance figures 32 FP64 CUDA “cores” 64 Flop/cycle Member of the Helmholtz Association 04.12.2019 Slide 12
  • 17. V100 Memory Subsystem In-package HBM2 memory 4 memory stacks, 4-8 GByte/stack Very wide bus (2 · 512 bit) and relative low data rate (1.75 GT/s) Aggregate bandwidth: 900 GByte/s [H. Jun (SK Hynix), 2013] [Wikipedia] Member of the Helmholtz Association 04.12.2019 Slide 13
  • 18. V100 Memory Hierarchy Register file Size: 256 kiByte/SM or 64 kByte/processing block L1 data cache and shared memory Size 128 kiByte/SM or 10 MiByte/GPU L1 hit latency 28 cycle (measured) L2 cache 6 MiByte/GPU L2 hit latency 193 cycle (measured) 900 GByte/s 2155 GByte/s (meas.) 250 Byte/cycle Register file L1 cache L2 cache Global memory Device memory / GPU private SM private Processing block private [Zhe Jia et al., 2018] Member of the Helmholtz Association 04.12.2019 Slide 14
  • 19. Part IV: AC922 Nodes Member of the Helmholtz Association
  • 20. AC922 Server: Details Member of the Helmholtz Association 04.12.2019 Slide 15
  • 21. Comparison POWER9 vs. V100 on AC922 POWER9 V100 Number of cores/SMs ≥ 16 80 Double-precision Flop/cycle ≥ 128 5,120 Clock frequency [GHz] 3.8 1.312 Double-precision TFlop/s ≥ 0.5 6.7 Memory bandwidth read+write [GByte/s] 135 900 Balance [Flop/Byte] ≥ 3.7 7.5 Number of CPU or GPU 2 6 Aggregate double-precision TFlop/s 0.97 40.3 Aggregate memory bandwidth [TByte/s] 0.27 5.4 Aggregate memory capacity [GByte] 512 96 Member of the Helmholtz Association 04.12.2019 Slide 16
  • 22. Part V: Use Case Member of the Helmholtz Association
  • 23. The Poisson Equation Poisson equation in 2 dimensions: − ∂2 v(x, y) ∂x2 − ∂2 v(x, y) ∂y2 = f(x, y) Discretisation of 2nd-order derivative − ∂2 v(x, y) ∂x2 ← 2vi,j − vi−1,j − vi+1,j h2 Discrete Poisson equaiton in 2 dimensions: T v = h2 f where T =    4 −1 0 · · · −1 4 −1 · · · ... ... ...    Member of the Helmholtz Association 04.12.2019 Slide 17
  • 24. Discrete Poisson Equation: Data Locality Graphical representation of T v: Observations: Matrix T acts as a stencil operator Any element of vector v is reused 4 times Member of the Helmholtz Association 04.12.2019 Slide 18
  • 25. Jacobi Algorithm Algorithm suitable for diagonally dominant systems Matrix decomposition: T = D + R D =    t0,0 0 0 · · · 0 t1,1 0 · · · ... ... ...    , R =    0 t0,1 t0,2 · · · t1,0 0 t1,2 · · · ... ... ...    Obtain solution through iterative precedure v(k+1) = D−1 h2 f − R v(k) Member of the Helmholtz Association 04.12.2019 Slide 19
  • 26. Jacobi Algorithm Implementation for ( int i y = 1; i y < ny−1; i y ++) { for ( int i x = 1; i x < nx−1; i x ++ ) { Anew[ i y ∗nx+ i x ] = −0.25 ∗ ( rhs [ i y ∗nx+ i x ] − ( Aref [ i y ∗nx + i x +1] + Aref [ i y ∗nx + ix −1] + Aref [ ( iy −1)∗nx + i x ] + Aref [ ( i y +1)∗nx + i x ] ) ) ; error = fmaxr ( error , fabsr (Anew[ i y ∗nx+ i x ]−Aref [ i y ∗nx+ i x ] ) ) ; } } Information exchange analysis assuming small cache: Load of Anew, rhs, Aref: Ild = nx · ny · 3 · 8 Byte Store Anew: Ist = nx · ny · 1 · 8 Byte Floating-point arithmetics: Ifp = nx · ny · 6 Flop Member of the Helmholtz Association 04.12.2019 Slide 20
  • 27. Jacobi on POWER9/V100: Roofline Analysis AI = Ifp/(Ild + Ist) = 0.19 Flop/Byte 0.1 1 10 100 AI [Flop/Byte] 0.01 0.1 1 10 Attainableperformance[TFlop/s] POWER9 V100 Member of the Helmholtz Association 04.12.2019 Slide 21
  • 28. Part VI: Summary Member of the Helmholtz Association
  • 29. Summary Approach for systematic performance analysis and modelling based on Information Exchange introduced Discussed the AC922 node architecture as a building block for supercomputers based on POWER9 processors V100 GPUs Introduced use case for today: Solving 2-dimensional Poisson equation using Jacobi solver Memory bandwidth limited application Roofline model defines upper performance limit Member of the Helmholtz Association 04.12.2019 Slide 22