SlideShare a Scribd company logo
1 of 39
P R E S E N T E D B Y
Sandia National Laboratories is a multimission
laboratory managed and operated by National
Technology & Engineering Solutions of Sandia,
LLC, a wholly owned subsidiary of Honeywell
International Inc., for the U.S. Department of
Energy’s National Nuclear Security
Administration under contract DE-NA0003525.
GPGPU-Sim Overview
Mengchi Zhang, Purdue University
Clay Hughes, Gwen Voskuilen, Sandia National
Laboratories
Mengchi Zhang, Purdue University
09/22/2019
SAND2019-11321 C
Welcome!
2
Part 1: Introduction to SST 1:00 – 2:30
◦ SST Overview
◦ Demos: Running a simple simulation; Enabling statistics;
Running in parallel
◦ SST Element Libraries: A tour
Part 2: GPGPU-Sim 2:30-3:30
◦ GPGPU-Sim Overview
◦ New Features
Break 3:30 – 4:00
Part 3: The SST/GPGPU-Sim Integration 4:00 - 5:00
PACT 2019 TUTORIAL
Outline
GPGPU-Sim Introduction
◦ GPU and programming model
◦ Functional model
◦ Performance model
◦ GPUWattch: power model
New Features in GPGPU-Sim
◦ Volta model
◦ Run closed source libraries
◦ Tensor Core
◦ Run CUTLASS library
3
Acknowledgements
Some slides are credited to GPGPU-Sim Micro Tutorial 2012
Some data and figures are credited to papers below:
◦ Akshay Jain, Mahmoud Khairy, Timothy G. Rogers, A Quantitative Evaluation of
Contemporary GPU Simulation Methodology. SIGMETRICS 2018
◦ Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G Rogers, Exploring Modern
GPU Memory System Design Challenges through Accurate Modeling,
arXiv:1810.07269, ( and ISPASS 2019 poster)
◦ Jonathan Lew, Deval Shah, Suchita Pati, Shaylin Cattell, Mengchi Zhang, Amruth
Sandhupatla, Christopher Ng, Negar Goli, Matthew D. Sinclair, Timothy G. Rogers, Tor
M. Aamodt, Analyzing Machine Learning Workloads Using a Detailed GPU Simulator,
arXiv:1811.08933, ( and ISPASS 2019 poster)
◦ Md Aamir Raihan, Negar Goli, Tor Aamodt, Modeling Deep Learning Accelerator
Enabled GPUs, ISPASS 2019
We thank all contributors to GPGPU-Sim
All new features are based on the dev branch of GPGPU-Sim:
https://github.com/gpgpu-sim/gpgpu-sim_distribution/tree/dev
4
Outline
GPGPU-Sim Introduction
◦ GPU and programming model
◦ Functional model
◦ Performance model
◦ GPUWattch: power model
New Features in GPGPU-Sim
◦ Volta model
◦ Run closed source libraries
◦ Tensor Core
◦ Run CUTLASS library
5
GPU Introduction
GPU = Graphics Processing Unit
◦ Optimized for Highly Parallel Workloads
◦ Highly Programmable
◦ Heterogeneous computing
NVidia Tesla GV100: 80 Stream Multiprocessors(SMs)
◦ Each SM has 64 INT32, 64 FP32, 32 FP64, 8 Tensor core
6
GPGPU Programming model evolution
GPGPU programming model:
◦ CUDA and OpenCL
◦ Support more features with newer architectures
7
CUDA
OpenCL
CUDA 1.X
C compiler
(Tesla)
2007
CUDA 3.X
C++ compiler
(Fermi)
2010 2014 2017
CUDA 4.X
C++ new/delete
No-copy system
memory
CUDA 6.X
Unified memory
Dynamic Parallelism
cuDNN
(Kepler, Maxwell)
CUDA 9.X
Cooperative groups
MPS
CUTLASS
Tensor core operation
(Volta)
OpenCL 1.0
C99 support
2008
OpenCL 2.1
C++14
SPIR-V
for Vulkan
subgroup
2015
OpenCL 2.0
Shared virtual memory
Nested Parallelism
Pipe
2013
2019
CUDA 10.X
CUDA Graphs
(Turing)
OpenCL 1.2
Device Partition
Built-in kernels
2011
OpenCL 2.2
C++14
SPIR-V 1.1
GPU Microarchitecture1
NVidia GV100:
◦ Hierarchical compute unit: SM=>TPC=>GPC=>GPU
◦ Multi-level memory: L1/shmem=>L2=>HBM
◦ GPGPU-Sim simulator models components above
8
[1]Volta Whitepaper. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
GPGPU-Sim Introduction
GPGPU-Sim
◦ Widely used GPU simulator in the research community (1300+ citations).
◦ The third most cited simulator in computer architecture field (after GEM5(+GEMS) and
SimpleScalar)
Functional model
◦ Virtual ISA (vISA)
◦ Machine ISA (mISA)
Performance model
◦ Model microarchitecture timing relevant to GPU compute
9
GPGPU-Sim Introduction
GPGPU-Sim simulates kernel
◦ Transfer data to GPU memory
◦ GPU kernels runs on GPGPU-Sim:
◦ Reports statistics for the kernels
◦ Transfer data back to CPU memory
10
Time
GPU HW
CPU
Async. Kernel Launch
Done
GPU HW
Done
CPU
GPU HW
Sync. Kernel Launch
Done
CPU
Blocking
GPGPU-Sim
GPGPU-Sim
GPGPU-Sim
Functional model
Single Instruction Multiple Thread(SIMT):
◦ SIMD + multithreading
◦ Grid, Block, Warp, Thread
Virtual ISA vs. Machine ISA
◦ vISA: PTX = Parallel Thread eXecution: virtual ISA defined by Nvidia
◦ mISA: SASS = Native ISA for Nvidia GPUs
◦ GPGPU-Sim use PTXPlus to represent SASS
◦ 1:1 mapping from SASS to PTXPlus
GPGPU-Sim supports:
◦ PTX for new architectures(CUDA 10): new, inaccurate, well documented
◦ SASS for architecture before Fermi(SM_1.X): old, accurate, less documented
11
Performance model
GPGPU-Sim models timing
◦ SIMT Core
◦ Caches and texture/constant/shared memory
◦ Interconnection network(Booksim)
◦ DRAM(GDDR5/HBM)
DO NOT model
◦ Graphic Specific Hardware
12
Single-Instruction, Multiple-Threads
GPU
Interconnection Network
SIMT Core Cluster
SIMT
Core
SIMT
Core
Memory
Partition
GDDR5/HBM
Memory
Partition
GDDR5/HBM
Memory
Partition
GDDR5/HBM Off-chip DRAM
SIMT Core Cluster
SIMT
Core
SIMT
Core
SIMT Core Cluster
SIMT
Core
SIMT
Core
GPUWattch: Power model
Estimate power consumed by the GPU according to the timing behavior
Ideal for evaluating fine-grained power management mechanisms
Validated with power measurements from GV100
13
Modified McPAT
• Modifications: Specific
micro-architectural
components
GPGPU-Sim
• Modifications: Add
required performance
counters
A Comprehensive Framework for Performance and Energy Optimization Research
Interface
Interface
Static
Power
Dynamic
Power
Detailed Power
Stats
Detailed
Performance Stats
Feedback-driven
Optimizations
Performance counters
GPUWattch: Power model
Correlation for GV100
◦ Coefficient of 0.8
◦ Average relative error of −2.67% ± 5.66% (95% confidence interval)
14
Outline
GPGPU-Sim Introduction
◦ GPU and programming model
◦ Functional model
◦ Performance model
◦ GPUWattch: power model
New Features in GPGPU-Sim
◦ Volta model
◦ Run closed source libraries
◦ Tensor Core
◦ Run CUTLASS library
15
New Feature:Volta Model Motivation
ISA cycle correlation1 before new Volta model2
◦ For Pascal Titan X
16
[1] Akshay Jain, Mahmoud Khairy, Timothy G. Rogers, A Quantitative Evaluation of Contemporary GPU Simulation Methodology. SIGMETRICS 2018
[2] Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G Rogers, Exploring Modern GPU Memory System Design Challenges through Accurate Modeling, arXiv:1810.07269
Benchmark
Means Abs Error Correlation
vISA mISA vISA mISA
Compute
Intensive
43.3% 21.9% 91.1% 99.0%
Cache
Sensitive
104.8% 100.4% 81.2% 82.0%
Memory
Sensitive
31.6% 29.8% 96.0% 95.1%
Compute
Balanced
58.5% 70.7% 96.1% 93.5%
Result
◦ Compute intensive: mISA > vISA
◦ Cache sensitive: both show
inaccurate cache model
◦ Memory sensitive(streaming): not
related to cache model
◦ Compute balanced: vISA > mISA
New features:Volta model1
17
Fermi-based Model
[1] Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G Rogers, Exploring Modern GPU Memory System Design Challenges through Accurate Modeling, arXiv:1810.07269
Instruction
Cache
Warp
Scheduler
Register
File
(16
banks)
Warp
Scheduler
SFU
Exec
Units
SP
SP
L1D
cache/Shared
Memory
Programmable-specified
cache
L2
cache
GDDR5
Model
Fermi
coalescer
SIMT Core Memory Hierarchy
New features:Volta model1
18 Volta-based Model
[1] Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G Rogers, Exploring Modern GPU Memory System Design Challenges through Accurate Modeling, arXiv:1810.07269
Instruction
Cache
Warp
Scheduler
Register
File
Exec
Units
L1D
Cache
/
Shared
Memory
+Sectored,
+Adaptive
cache
(128
KB),
+Streaming
cache
+Banked
L2
Cache
+Sectored,
+memory
copy
engine
model
+New
Lazy_Fetch_on_Read
write
policy,
+partition-camping-aware
hashing
HBM
+
HBM
Model,
+dual-bus
interface
+
Read/Write
buffers
Volta
coalescer
(8
threads
coalescer)
SIMT Core Memory Hierarchy
SP
DP
SFU
INT
Warp
Scheduler
Warp
Scheduler
Warp
Scheduler
Tensor
Register
File
Exec
Units
SP
DP
SFU
INT
Tensor
Register
File
Exec
Units
SP
DP
SFU
INT
Tensor
Register
File
Exec
Units
SP
DP
SFU
INT
Tensor
Subcore model
disaggregate RF
(2 banks) +DP, INT, Tensor
High-bandwidth
Memory
Hardware correlation for Volta model1
Statistic
Means Abs Error Correlation
Old
Model
New
Model
Old
Model
New
Model
Execution
Cycles
68% 27% 71% 96%
L1 Reqs 48% 0.5% 92% 100%
L1 Hit
Ratio
41% 18% 89% 93%
L2 Reads 66% 1% 49% 94%
L2 Writes 56% 1% 99% 100%
L2 Read
Hits
80% 15% 68% 81%
DRAM
Reads
89% 11% 60% 95%
19
7X error reduction
in DRAM reads
1% error in L2 behavior
(66x read error reduction)
2.5X error reduction
in exec time
[1] Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G Rogers, Exploring Modern GPU Memory System Design Challenges through Accurate Modeling, arXiv:1810.07269
0.5% error in L1 reqs
(96x reqs error reduction)
SIMT Core
Memory Hierarchy
Run closed source model1
Run applications with cuDNN/cuBLAS
◦ Static linking closed source libraries
◦ LeNet for MNIST using cuDNN/cuBLAS
20
CUDA API layer
Application
CUDA runtime API layer
Functional model
Performance model
static cuDNN library
Executable
static cuBLAS library
GPGPU-Sim
[1] Jonathan Lew, Deval Shah, Suchita Pati, Shaylin Cattell, Mengchi Zhang, Amruth Sandhupatla, Christopher Ng, Negar Goli, Matthew D. Sinclair, Timothy G. Rogers,
Tor M. Aamodt Analyzing Machine Learning Workloads Using a Detailed GPU Simulator, arXiv:1811.08933
Tensor Core in GV100
Accelerate FP operations
8 Tensor Core/SM
Each perform 64 FP FMA/clock
512 FMA/clock/SM
Or 1024 FP ops/clock/SM
21
Tensor Core in Tesla Titan V1
Standard deviation of less than 5%
99.60% IPC correlation
GPGPU-Sim shows higher perf than HW as matrix size increase
22
[1] Md Aamir Raihan, Negar Goli, Tor Aamodt, Modeling Deep Learning Accelerator Enabled GPUs, ISPASS 2019
Run CUTLASS library
Combine CUTLASS, tensor core1, and volta model2
Correlation: using Deepbench training/inference test from real scenario
23
[1] Md Aamir Raihan, Negar Goli, Tor Aamodt, Modeling Deep Learning Accelerator Enabled GPUs, ISPASS 2019
[2] Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G Rogers, Exploring Modern GPU Memory System Design Challenges through Accurate Modeling, arXiv:1810.07269
1E+2
1E+4
1E+6
1E+8
1E+10
1E+2 1E+4 1E+6 1E+8 1E+10
Simulation
Hardware
CUTLASS cycles[Cor=99.91%, Err=5.22%]
Summary
GPGPU-Sim:
◦ Simulates GPU compute unit and memory hierarchies.
◦ an evolutional simulator for new features in CUDA and GPUs.
◦ actively including new features
GPGPU-Sim with SST:
◦ GPGPU-Sim brings promising GPU model to SST
◦ GPGPU-Sim benefits for SST parallel simulation architecture
24
25
BACKUP SLIDES
Thread Hierarchy Revisited
27
SIMT Core
• Recall, kernel = grid of blocks of
warps of threads
• Threads are grouped into warps in
hardware
Each block is dispatched
to a SIMT core as a unit
of work: All of its warps
run in the core’s pipeline
until they are all done.
Source: NVIDIA
Thread Block
(CTA)
32 Threads
32 Threads
32 Threads
Thread Block
(CTA)
32 Threads
32 Threads
32 Threads
Thread Block
(CTA)
32 Threads
32 Threads
32 Threads
Warps
Inside a SIMT Core
28
• Fine-grained multithreading
• Interleave warp execution to hide latency
• Register values of all threads stays in core
SIMT
Front End SIMD Datapath
Fetch
Decode
Schedule
Branch
Done (Warp ID)
Memory Subsystem Icnt.
Network
SMem L1 D$ Tex $ Const$
Reg
File
SIMT Stack
29
A: v = foo[tid.x];
B: if (v < 10)
C: v = 0;
else
D: v = 10;
E: w = bar[tid.x]+v;
Time
Handles Branch Divergence
D E 0011
C E 1100
B - 1111
PC RPC Active Mask
E - 1111
A T1T2T3T4
B T1T2T3T4
C T1T2
D T3T4
E T1T2T3T4
SIMT Stack
foo[] = {4,8,12,16};
One stack per warp
Constant Cache
A Read-only cache for constant memory
GPGPU-Sim simulates 1 read ports
◦ A warp can access 1 constant cache locations in a single memory unit cycle
◦ If more than 1 locations accessed
◦ reads are serialized causing pipeline stalls
◦ # of ports is not configurable
30
Coalescing
Combining memory accesses made by threads in a warp into
fewer transactions
◦ E.g. if threads in a warp are accessing consecutive 4-byte sized
locations in memory
◦ Send one 128–byte request to DRAM (coalescing)
◦ Instead of 32 4-byte requests
This reduces the number of transactions between SIMT cores
and DRAM
◦ Less work for Interconnect, Memory Partition
and DRAM
31
Interconnection Network Model
Intersim (Booksim) a flit level simulator
◦ Topologies (Mesh, Torus, Butterfly, …)
◦ Routing (Dimension Order, Adaptive, etc. )
◦ Flow Control (Virtual Channels, Credits)
We simulate two separate networks
◦ From SIMT cores to memory partitions
◦ Read Requests, Write Requests
◦ From memory partitions to SIMT cores
◦ Read Replies, Write Acks
32
PTX
Low-level data-parallel virtual ISA
◦ Instruction level
◦ Unlimited registers
◦ Parallel threads running in blocks; barrier synchronization instruction
Scalar ISA
◦ SIMT execution model
Intermediate representation in CUDA tool chain
33
.cu
.cl
NVCC
OpenCL Drv
PTX ptxas
Fermi
Kepler
Maxwell
Pascal
Volta
SASS
Native ISA for Nvidia GPUs
◦ Better correlation with HW but less documented
◦ Related with Stream Multiprocessor(SM) version.
Scalar ISA
GPGPU-Sim use PTXPlus to represent both SASS and PTX
SASS mapped 1:1 into PTXPlus instruction
34
CUDA
Executable
cuobjdump SASS conversion PTXPlus
PTX vs. SASS
35
SASS (PTXPlus)
l0x00000060:
add.half.u32 $r7, $r4, 0x00000400;
ld.global.u32 $r8, [$r4];
ld.global.u32 $r7, [$r7];
add.half.u32 $r0, $r5, $r0;
add.half.u32 $r6, $r8, $r6;
set.gt.u32.u32 $p0/$o127, s[0x0020], $r0;
add.half.u32 $r6, $r7, $r6;
add.half.u32 $r4, $r4, $r3;
@$p0.ne bra l0x00000060;
...
set.gt.u32.u32 $p0/$o127, $r2, const [0x0000];
@$p0.equ add.u32 $ofs2, $ofs1, 0x00000230;
@$p0.equ add.u32 $r6, s[$ofs2+0x0000], $r6;
@$p0.equ mov.u32 s[$ofs1+0x0030], $r6;
bar.sync 0x00000000;
PTX
$Lt_25_13570:
ld.global.s32 %r9, [%rd5+0];
add.s32 %r10, %r9, %r8;
ld.global.s32 %r11, [%rd5+1024];
add.s32 %r8, %r11, %r10;
add.u32 %r5, %r7, %r5;
add.u64 %rd5, %rd5, %rd6;
ld.param.u32 %r6, [size];
setp.lt.u32 %p2, %r5, %r6;
@%p2 bra $Lt_25_13570;
...
mov.u32 %r12, 127;
setp.gt.u32 %p3, %r3, %r12;
@%p3 bra $Lt_25_14082;
ld.shared.s32 %r13, [%rd10+512];
add.s32 %r8, %r13, %r8;
st.shared.s32 [%rd10+0], %r8;
$Lt_25_14082:
bar.sync 0;
Interfacing GPGPU-Sim to applications
GPGPU-Sim compiles into a shared runtime library and implements
the API:
◦ libcudart.so  CUDA runtime API
◦ libOpenCL.so  OpenCL API
36
Application libcudart.so
CUDA runtime API
CUDA Driver
User level
Operating system
Hardware
GPU hardware
CUDA API layer
Interfacing GPGPU-Sim to applications
GPGPU-Sim compiles into a shared runtime library and implements
the API:
◦ libcudart.so  CUDA runtime API
◦ libOpenCL.so  OpenCL API
37
Application CUDA runtime API layer
CUDA runtime API
Functional model
Performance model
libcudart.so
GPGPU-Sim Runtime Flow
38
CLICK TO EDIT MASTER TITLE STYLE
Slide Left Blank

More Related Content

Similar to PACT_conference_2019_Tutorial_02_gpgpusim.pptx

Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievVolodymyr Saviak
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?Shinnosuke Furuya
 
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)Kohei KaiGai
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsJoshua Mora
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsRebekah Rodriguez
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer FugakuRCCSRENKEI
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_ENKohei KaiGai
 
CGYRO Performance on Power9 CPUs and Volta GPUS
CGYRO Performance on Power9 CPUs and Volta GPUSCGYRO Performance on Power9 CPUs and Volta GPUS
CGYRO Performance on Power9 CPUs and Volta GPUSIgor Sfiligoi
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 

Similar to PACT_conference_2019_Tutorial_02_gpgpusim.pptx (20)

Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
 
CGYRO Performance on Power9 CPUs and Volta GPUS
CGYRO Performance on Power9 CPUs and Volta GPUSCGYRO Performance on Power9 CPUs and Volta GPUS
CGYRO Performance on Power9 CPUs and Volta GPUS
 
Deep learning with FPGA
Deep learning with FPGADeep learning with FPGA
Deep learning with FPGA
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Nvidia GTC 2014 Talk
Nvidia GTC 2014 TalkNvidia GTC 2014 Talk
Nvidia GTC 2014 Talk
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 

More from ssuser30e7d2

isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxssuser30e7d2
 
xeon phi_mattan erez.pptx
xeon phi_mattan erez.pptxxeon phi_mattan erez.pptx
xeon phi_mattan erez.pptxssuser30e7d2
 
xeon phi_mattan erez.pptx
xeon phi_mattan erez.pptxxeon phi_mattan erez.pptx
xeon phi_mattan erez.pptxssuser30e7d2
 
FOSDEM_2019_Buildroot_RISCV.pdf
FOSDEM_2019_Buildroot_RISCV.pdfFOSDEM_2019_Buildroot_RISCV.pdf
FOSDEM_2019_Buildroot_RISCV.pdfssuser30e7d2
 
Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfssuser30e7d2
 
shift register.ppt
shift register.pptshift register.ppt
shift register.pptssuser30e7d2
 

More from ssuser30e7d2 (7)

isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptx
 
xeon phi_mattan erez.pptx
xeon phi_mattan erez.pptxxeon phi_mattan erez.pptx
xeon phi_mattan erez.pptx
 
xeon phi_mattan erez.pptx
xeon phi_mattan erez.pptxxeon phi_mattan erez.pptx
xeon phi_mattan erez.pptx
 
FOSDEM_2019_Buildroot_RISCV.pdf
FOSDEM_2019_Buildroot_RISCV.pdfFOSDEM_2019_Buildroot_RISCV.pdf
FOSDEM_2019_Buildroot_RISCV.pdf
 
Project3.ppt
Project3.pptProject3.ppt
Project3.ppt
 
Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdf
 
shift register.ppt
shift register.pptshift register.ppt
shift register.ppt
 

Recently uploaded

microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 

Recently uploaded (20)

microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 

PACT_conference_2019_Tutorial_02_gpgpusim.pptx

  • 1. P R E S E N T E D B Y Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. GPGPU-Sim Overview Mengchi Zhang, Purdue University Clay Hughes, Gwen Voskuilen, Sandia National Laboratories Mengchi Zhang, Purdue University 09/22/2019 SAND2019-11321 C
  • 2. Welcome! 2 Part 1: Introduction to SST 1:00 – 2:30 ◦ SST Overview ◦ Demos: Running a simple simulation; Enabling statistics; Running in parallel ◦ SST Element Libraries: A tour Part 2: GPGPU-Sim 2:30-3:30 ◦ GPGPU-Sim Overview ◦ New Features Break 3:30 – 4:00 Part 3: The SST/GPGPU-Sim Integration 4:00 - 5:00 PACT 2019 TUTORIAL
  • 3. Outline GPGPU-Sim Introduction ◦ GPU and programming model ◦ Functional model ◦ Performance model ◦ GPUWattch: power model New Features in GPGPU-Sim ◦ Volta model ◦ Run closed source libraries ◦ Tensor Core ◦ Run CUTLASS library 3
  • 4. Acknowledgements Some slides are credited to GPGPU-Sim Micro Tutorial 2012 Some data and figures are credited to papers below: ◦ Akshay Jain, Mahmoud Khairy, Timothy G. Rogers, A Quantitative Evaluation of Contemporary GPU Simulation Methodology. SIGMETRICS 2018 ◦ Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G Rogers, Exploring Modern GPU Memory System Design Challenges through Accurate Modeling, arXiv:1810.07269, ( and ISPASS 2019 poster) ◦ Jonathan Lew, Deval Shah, Suchita Pati, Shaylin Cattell, Mengchi Zhang, Amruth Sandhupatla, Christopher Ng, Negar Goli, Matthew D. Sinclair, Timothy G. Rogers, Tor M. Aamodt, Analyzing Machine Learning Workloads Using a Detailed GPU Simulator, arXiv:1811.08933, ( and ISPASS 2019 poster) ◦ Md Aamir Raihan, Negar Goli, Tor Aamodt, Modeling Deep Learning Accelerator Enabled GPUs, ISPASS 2019 We thank all contributors to GPGPU-Sim All new features are based on the dev branch of GPGPU-Sim: https://github.com/gpgpu-sim/gpgpu-sim_distribution/tree/dev 4
  • 5. Outline GPGPU-Sim Introduction ◦ GPU and programming model ◦ Functional model ◦ Performance model ◦ GPUWattch: power model New Features in GPGPU-Sim ◦ Volta model ◦ Run closed source libraries ◦ Tensor Core ◦ Run CUTLASS library 5
  • 6. GPU Introduction GPU = Graphics Processing Unit ◦ Optimized for Highly Parallel Workloads ◦ Highly Programmable ◦ Heterogeneous computing NVidia Tesla GV100: 80 Stream Multiprocessors(SMs) ◦ Each SM has 64 INT32, 64 FP32, 32 FP64, 8 Tensor core 6
  • 7. GPGPU Programming model evolution GPGPU programming model: ◦ CUDA and OpenCL ◦ Support more features with newer architectures 7 CUDA OpenCL CUDA 1.X C compiler (Tesla) 2007 CUDA 3.X C++ compiler (Fermi) 2010 2014 2017 CUDA 4.X C++ new/delete No-copy system memory CUDA 6.X Unified memory Dynamic Parallelism cuDNN (Kepler, Maxwell) CUDA 9.X Cooperative groups MPS CUTLASS Tensor core operation (Volta) OpenCL 1.0 C99 support 2008 OpenCL 2.1 C++14 SPIR-V for Vulkan subgroup 2015 OpenCL 2.0 Shared virtual memory Nested Parallelism Pipe 2013 2019 CUDA 10.X CUDA Graphs (Turing) OpenCL 1.2 Device Partition Built-in kernels 2011 OpenCL 2.2 C++14 SPIR-V 1.1
  • 8. GPU Microarchitecture1 NVidia GV100: ◦ Hierarchical compute unit: SM=>TPC=>GPC=>GPU ◦ Multi-level memory: L1/shmem=>L2=>HBM ◦ GPGPU-Sim simulator models components above 8 [1]Volta Whitepaper. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
  • 9. GPGPU-Sim Introduction GPGPU-Sim ◦ Widely used GPU simulator in the research community (1300+ citations). ◦ The third most cited simulator in computer architecture field (after GEM5(+GEMS) and SimpleScalar) Functional model ◦ Virtual ISA (vISA) ◦ Machine ISA (mISA) Performance model ◦ Model microarchitecture timing relevant to GPU compute 9
  • 10. GPGPU-Sim Introduction GPGPU-Sim simulates kernel ◦ Transfer data to GPU memory ◦ GPU kernels runs on GPGPU-Sim: ◦ Reports statistics for the kernels ◦ Transfer data back to CPU memory 10 Time GPU HW CPU Async. Kernel Launch Done GPU HW Done CPU GPU HW Sync. Kernel Launch Done CPU Blocking GPGPU-Sim GPGPU-Sim GPGPU-Sim
  • 11. Functional model Single Instruction Multiple Thread(SIMT): ◦ SIMD + multithreading ◦ Grid, Block, Warp, Thread Virtual ISA vs. Machine ISA ◦ vISA: PTX = Parallel Thread eXecution: virtual ISA defined by Nvidia ◦ mISA: SASS = Native ISA for Nvidia GPUs ◦ GPGPU-Sim use PTXPlus to represent SASS ◦ 1:1 mapping from SASS to PTXPlus GPGPU-Sim supports: ◦ PTX for new architectures(CUDA 10): new, inaccurate, well documented ◦ SASS for architecture before Fermi(SM_1.X): old, accurate, less documented 11
  • 12. Performance model GPGPU-Sim models timing ◦ SIMT Core ◦ Caches and texture/constant/shared memory ◦ Interconnection network(Booksim) ◦ DRAM(GDDR5/HBM) DO NOT model ◦ Graphic Specific Hardware 12 Single-Instruction, Multiple-Threads GPU Interconnection Network SIMT Core Cluster SIMT Core SIMT Core Memory Partition GDDR5/HBM Memory Partition GDDR5/HBM Memory Partition GDDR5/HBM Off-chip DRAM SIMT Core Cluster SIMT Core SIMT Core SIMT Core Cluster SIMT Core SIMT Core
  • 13. GPUWattch: Power model Estimate power consumed by the GPU according to the timing behavior Ideal for evaluating fine-grained power management mechanisms Validated with power measurements from GV100 13 Modified McPAT • Modifications: Specific micro-architectural components GPGPU-Sim • Modifications: Add required performance counters A Comprehensive Framework for Performance and Energy Optimization Research Interface Interface Static Power Dynamic Power Detailed Power Stats Detailed Performance Stats Feedback-driven Optimizations Performance counters
  • 14. GPUWattch: Power model Correlation for GV100 ◦ Coefficient of 0.8 ◦ Average relative error of −2.67% ± 5.66% (95% confidence interval) 14
  • 15. Outline GPGPU-Sim Introduction ◦ GPU and programming model ◦ Functional model ◦ Performance model ◦ GPUWattch: power model New Features in GPGPU-Sim ◦ Volta model ◦ Run closed source libraries ◦ Tensor Core ◦ Run CUTLASS library 15
  • 16. New Feature:Volta Model Motivation ISA cycle correlation1 before new Volta model2 ◦ For Pascal Titan X 16 [1] Akshay Jain, Mahmoud Khairy, Timothy G. Rogers, A Quantitative Evaluation of Contemporary GPU Simulation Methodology. SIGMETRICS 2018 [2] Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G Rogers, Exploring Modern GPU Memory System Design Challenges through Accurate Modeling, arXiv:1810.07269 Benchmark Means Abs Error Correlation vISA mISA vISA mISA Compute Intensive 43.3% 21.9% 91.1% 99.0% Cache Sensitive 104.8% 100.4% 81.2% 82.0% Memory Sensitive 31.6% 29.8% 96.0% 95.1% Compute Balanced 58.5% 70.7% 96.1% 93.5% Result ◦ Compute intensive: mISA > vISA ◦ Cache sensitive: both show inaccurate cache model ◦ Memory sensitive(streaming): not related to cache model ◦ Compute balanced: vISA > mISA
  • 17. New features:Volta model1 17 Fermi-based Model [1] Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G Rogers, Exploring Modern GPU Memory System Design Challenges through Accurate Modeling, arXiv:1810.07269 Instruction Cache Warp Scheduler Register File (16 banks) Warp Scheduler SFU Exec Units SP SP L1D cache/Shared Memory Programmable-specified cache L2 cache GDDR5 Model Fermi coalescer SIMT Core Memory Hierarchy
  • 18. New features:Volta model1 18 Volta-based Model [1] Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G Rogers, Exploring Modern GPU Memory System Design Challenges through Accurate Modeling, arXiv:1810.07269 Instruction Cache Warp Scheduler Register File Exec Units L1D Cache / Shared Memory +Sectored, +Adaptive cache (128 KB), +Streaming cache +Banked L2 Cache +Sectored, +memory copy engine model +New Lazy_Fetch_on_Read write policy, +partition-camping-aware hashing HBM + HBM Model, +dual-bus interface + Read/Write buffers Volta coalescer (8 threads coalescer) SIMT Core Memory Hierarchy SP DP SFU INT Warp Scheduler Warp Scheduler Warp Scheduler Tensor Register File Exec Units SP DP SFU INT Tensor Register File Exec Units SP DP SFU INT Tensor Register File Exec Units SP DP SFU INT Tensor Subcore model disaggregate RF (2 banks) +DP, INT, Tensor High-bandwidth Memory
  • 19. Hardware correlation for Volta model1 Statistic Means Abs Error Correlation Old Model New Model Old Model New Model Execution Cycles 68% 27% 71% 96% L1 Reqs 48% 0.5% 92% 100% L1 Hit Ratio 41% 18% 89% 93% L2 Reads 66% 1% 49% 94% L2 Writes 56% 1% 99% 100% L2 Read Hits 80% 15% 68% 81% DRAM Reads 89% 11% 60% 95% 19 7X error reduction in DRAM reads 1% error in L2 behavior (66x read error reduction) 2.5X error reduction in exec time [1] Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G Rogers, Exploring Modern GPU Memory System Design Challenges through Accurate Modeling, arXiv:1810.07269 0.5% error in L1 reqs (96x reqs error reduction) SIMT Core Memory Hierarchy
  • 20. Run closed source model1 Run applications with cuDNN/cuBLAS ◦ Static linking closed source libraries ◦ LeNet for MNIST using cuDNN/cuBLAS 20 CUDA API layer Application CUDA runtime API layer Functional model Performance model static cuDNN library Executable static cuBLAS library GPGPU-Sim [1] Jonathan Lew, Deval Shah, Suchita Pati, Shaylin Cattell, Mengchi Zhang, Amruth Sandhupatla, Christopher Ng, Negar Goli, Matthew D. Sinclair, Timothy G. Rogers, Tor M. Aamodt Analyzing Machine Learning Workloads Using a Detailed GPU Simulator, arXiv:1811.08933
  • 21. Tensor Core in GV100 Accelerate FP operations 8 Tensor Core/SM Each perform 64 FP FMA/clock 512 FMA/clock/SM Or 1024 FP ops/clock/SM 21
  • 22. Tensor Core in Tesla Titan V1 Standard deviation of less than 5% 99.60% IPC correlation GPGPU-Sim shows higher perf than HW as matrix size increase 22 [1] Md Aamir Raihan, Negar Goli, Tor Aamodt, Modeling Deep Learning Accelerator Enabled GPUs, ISPASS 2019
  • 23. Run CUTLASS library Combine CUTLASS, tensor core1, and volta model2 Correlation: using Deepbench training/inference test from real scenario 23 [1] Md Aamir Raihan, Negar Goli, Tor Aamodt, Modeling Deep Learning Accelerator Enabled GPUs, ISPASS 2019 [2] Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G Rogers, Exploring Modern GPU Memory System Design Challenges through Accurate Modeling, arXiv:1810.07269 1E+2 1E+4 1E+6 1E+8 1E+10 1E+2 1E+4 1E+6 1E+8 1E+10 Simulation Hardware CUTLASS cycles[Cor=99.91%, Err=5.22%]
  • 24. Summary GPGPU-Sim: ◦ Simulates GPU compute unit and memory hierarchies. ◦ an evolutional simulator for new features in CUDA and GPUs. ◦ actively including new features GPGPU-Sim with SST: ◦ GPGPU-Sim brings promising GPU model to SST ◦ GPGPU-Sim benefits for SST parallel simulation architecture 24
  • 25. 25
  • 27. Thread Hierarchy Revisited 27 SIMT Core • Recall, kernel = grid of blocks of warps of threads • Threads are grouped into warps in hardware Each block is dispatched to a SIMT core as a unit of work: All of its warps run in the core’s pipeline until they are all done. Source: NVIDIA Thread Block (CTA) 32 Threads 32 Threads 32 Threads Thread Block (CTA) 32 Threads 32 Threads 32 Threads Thread Block (CTA) 32 Threads 32 Threads 32 Threads Warps
  • 28. Inside a SIMT Core 28 • Fine-grained multithreading • Interleave warp execution to hide latency • Register values of all threads stays in core SIMT Front End SIMD Datapath Fetch Decode Schedule Branch Done (Warp ID) Memory Subsystem Icnt. Network SMem L1 D$ Tex $ Const$ Reg File
  • 29. SIMT Stack 29 A: v = foo[tid.x]; B: if (v < 10) C: v = 0; else D: v = 10; E: w = bar[tid.x]+v; Time Handles Branch Divergence D E 0011 C E 1100 B - 1111 PC RPC Active Mask E - 1111 A T1T2T3T4 B T1T2T3T4 C T1T2 D T3T4 E T1T2T3T4 SIMT Stack foo[] = {4,8,12,16}; One stack per warp
  • 30. Constant Cache A Read-only cache for constant memory GPGPU-Sim simulates 1 read ports ◦ A warp can access 1 constant cache locations in a single memory unit cycle ◦ If more than 1 locations accessed ◦ reads are serialized causing pipeline stalls ◦ # of ports is not configurable 30
  • 31. Coalescing Combining memory accesses made by threads in a warp into fewer transactions ◦ E.g. if threads in a warp are accessing consecutive 4-byte sized locations in memory ◦ Send one 128–byte request to DRAM (coalescing) ◦ Instead of 32 4-byte requests This reduces the number of transactions between SIMT cores and DRAM ◦ Less work for Interconnect, Memory Partition and DRAM 31
  • 32. Interconnection Network Model Intersim (Booksim) a flit level simulator ◦ Topologies (Mesh, Torus, Butterfly, …) ◦ Routing (Dimension Order, Adaptive, etc. ) ◦ Flow Control (Virtual Channels, Credits) We simulate two separate networks ◦ From SIMT cores to memory partitions ◦ Read Requests, Write Requests ◦ From memory partitions to SIMT cores ◦ Read Replies, Write Acks 32
  • 33. PTX Low-level data-parallel virtual ISA ◦ Instruction level ◦ Unlimited registers ◦ Parallel threads running in blocks; barrier synchronization instruction Scalar ISA ◦ SIMT execution model Intermediate representation in CUDA tool chain 33 .cu .cl NVCC OpenCL Drv PTX ptxas Fermi Kepler Maxwell Pascal Volta
  • 34. SASS Native ISA for Nvidia GPUs ◦ Better correlation with HW but less documented ◦ Related with Stream Multiprocessor(SM) version. Scalar ISA GPGPU-Sim use PTXPlus to represent both SASS and PTX SASS mapped 1:1 into PTXPlus instruction 34 CUDA Executable cuobjdump SASS conversion PTXPlus
  • 35. PTX vs. SASS 35 SASS (PTXPlus) l0x00000060: add.half.u32 $r7, $r4, 0x00000400; ld.global.u32 $r8, [$r4]; ld.global.u32 $r7, [$r7]; add.half.u32 $r0, $r5, $r0; add.half.u32 $r6, $r8, $r6; set.gt.u32.u32 $p0/$o127, s[0x0020], $r0; add.half.u32 $r6, $r7, $r6; add.half.u32 $r4, $r4, $r3; @$p0.ne bra l0x00000060; ... set.gt.u32.u32 $p0/$o127, $r2, const [0x0000]; @$p0.equ add.u32 $ofs2, $ofs1, 0x00000230; @$p0.equ add.u32 $r6, s[$ofs2+0x0000], $r6; @$p0.equ mov.u32 s[$ofs1+0x0030], $r6; bar.sync 0x00000000; PTX $Lt_25_13570: ld.global.s32 %r9, [%rd5+0]; add.s32 %r10, %r9, %r8; ld.global.s32 %r11, [%rd5+1024]; add.s32 %r8, %r11, %r10; add.u32 %r5, %r7, %r5; add.u64 %rd5, %rd5, %rd6; ld.param.u32 %r6, [size]; setp.lt.u32 %p2, %r5, %r6; @%p2 bra $Lt_25_13570; ... mov.u32 %r12, 127; setp.gt.u32 %p3, %r3, %r12; @%p3 bra $Lt_25_14082; ld.shared.s32 %r13, [%rd10+512]; add.s32 %r8, %r13, %r8; st.shared.s32 [%rd10+0], %r8; $Lt_25_14082: bar.sync 0;
  • 36. Interfacing GPGPU-Sim to applications GPGPU-Sim compiles into a shared runtime library and implements the API: ◦ libcudart.so  CUDA runtime API ◦ libOpenCL.so  OpenCL API 36 Application libcudart.so CUDA runtime API CUDA Driver User level Operating system Hardware GPU hardware
  • 37. CUDA API layer Interfacing GPGPU-Sim to applications GPGPU-Sim compiles into a shared runtime library and implements the API: ◦ libcudart.so  CUDA runtime API ◦ libOpenCL.so  OpenCL API 37 Application CUDA runtime API layer CUDA runtime API Functional model Performance model libcudart.so
  • 39. CLICK TO EDIT MASTER TITLE STYLE Slide Left Blank

Editor's Notes

  1. Description provided in ASC CSSE Milestone 6812. Prior to this milestone, SST lacked the capability to model any external accelerator. We are now able to model a GPU accelerator in detail and analyze application and system behavior in fine detail.
  2. Description provided in ASC CSSE Milestone 6812. Prior to this milestone, SST lacked the capability to model any external accelerator. We are now able to model a GPU accelerator in detail and analyze application and system behavior in fine detail.
  3. Description provided in ASC CSSE Milestone 6812. Prior to this milestone, SST lacked the capability to model any external accelerator. We are now able to model a GPU accelerator in detail and analyze application and system behavior in fine detail.
  4. Completion criteria for ASC CSSE Milestone 6812. Please see the last slide on how we have met each of these criteria.
  5. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  6. Completion criteria for ASC CSSE Milestone 6812. Please see the last slide on how we have met each of these criteria.
  7. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  8. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  9. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  10. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  11. Description provided in ASC CSSE Milestone 6812. Prior to this milestone, SST lacked the capability to model any external accelerator. We are now able to model a GPU accelerator in detail and analyze application and system behavior in fine detail.
  12. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  13. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  14. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  15. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  16. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  17. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  18. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  19. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  20. Description provided in ASC CSSE Milestone 6812. Prior to this milestone, SST lacked the capability to model any external accelerator. We are now able to model a GPU accelerator in detail and analyze application and system behavior in fine detail.
  21. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  22. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  23. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  24. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.
  25. Impact statement from ASC CSSE Milestone 6812. This is a stepping stone to achieve these goals; we still have some work to do. SST+GPU is the state-of-the-art in simulation capabilities. A literature review will reveal that, with this addition, SST provides more than other node-level simulation tool. As we continue working on the GPU, memory, and interconnect models, we will get closer and closer to each of these impact statements.