SlideShare a Scribd company logo
Renaldas Zioma
Unity Labs
Trip down the GPU lane
with Machine Learning
@__ReJ__
Despite architectural differences between CPU & GPU
what dominates the speed of training Convolutional Neural Net
is the raw TFLOPs of a given chip!
CPU -vs- GPU
Single Instruction Multiple Data Single Instruction Multiple Threads
SIMD SIMT
+
+
Single Instruction Multiple Data Single Instruction Multiple Threads
SIMD SIMT
+
Lane
Instruction
applied to all
lanes
Data
Multiple lanes
aka vector
+
4 lanes (SSE)
8 lanes (AVX)
32 lanes
16 lanes (Mobile GPU)
64 lanes (AMD)
SIMD SIMT
SIMT is almost the same as SIMD, but much wider
+
+
Executes “any” instruction that
has all source operands ready
Warp is 32 threads working in-sync
Threads must share the same program!
1 core* can keep 64** warps in flight
Executes warp that has all source operands ready
Very lightweight switch between warps
Out Of Order Warp or Wavefront
Aim is to hide memory latency
By Core here I mean Streaming Multiprocessor
(SM) in Nvidia or Compute Unit (CU) in AMD
GPU, but not “CUDA core”. While SM and CU are
comparable to CPU core, “CUDA core” is more of
a marketing term than a full-fledged core.
Depends actually - 40 on AMD, 128 on GP100…
*)
**)
SIMT in one picture
+
+
Thread = Lane
1 thread is mapped to 1 hardware lane
Warp of threads
Each core
maintains many
warps in flight
Warp
Warp is 32* threads working in a lockstep
All threads execute the same program
Each thread is mapped to a single SIMT lane
Processor will “juggle” warps to hide memory latency
The number of threads per warp
is actually hardware dependent
and ranges from 16 on Mobile to
64 on AMD
*)
Take Away
Need lots of parallel work - in 1000s* of threads
IF-ELSE block executes both IF and ELSE**
If SIMT lane is not doing work, it is wasted!
Number of cores (SM/CU) ×
number of threads in warp ×
number of warps in flight
For example optimised matrix
multiplication kernel would
need at least 10240 threads
to saturate GTX 1080

= 20 SMs × 32 warps × 16
*)
Unless of course all threads
in the warp agree on the
result of conditional
statement. In such case only
one path needs to be
executed, either IF or ELSE
**)
Wasted!
L3 + L2 + L1 L2 + L1
Constant + Texture cache
LDS (Local Data Store)
explicitly programmable “cache”
Lots of registers!
Cache Cache + “Scratchpad”
LDS - Local Data Share
Piece of small, but ultra fast memory - up to 8TB/s!
Explicitly programmable “cache”
Fast data exchange between threads
NVIDIA calls it “Shared Memory”*
But “Shared Memory” is such a generic term, lets be more
specific and call it Local Data Share for the rest of the slides.
*
Memory and execution is
sequential
Some instructions see
memory as 2D / 3D blocks
Work scheduled in groups
One dimensional Multidimensional view
Memory Bandwidth
320 GB/s
38* GB/s
PCIe 3.0 x1616 GB/s
Desktop
GTX 1080
VRAM
GTX 1080
i7-7700
Numbers provided here and onward are
for peak bandwidth. Practical achievable
bandwidth is usually 75% of peak and can
be even lower for CPU.
*)
VRAM
38 GB/s
16 GB/s
Desktop
GTX 1080
4.1 TB/s
320 GB/s
LDS
PCIe 3.0 x16
GTX 1080
i7-7700
81 GB/s

1.7 TB/s
24 GB/s
PCIe 3.0 x88 GB/s
Laptop
MacBookPro2016
VRAM
Radeon PRO 460
LDS
i7-6920HQ
Take Away
MYTH: PCIe is very slow
CPU GPU (PCIe) speed is not too terrible comparing
with common CPU memory ~ 1:3 ratio
PCIe speed is mostly an issue when training with multi-GPU setup
Take Away
Need to access the same data several times on GPU to
make worth the transfer
FLOPs per byte metric
Take Away
Getting results back from GPU can be slow,
but NOT because of PCIe speed
Rather due to latency - CPU & GPU work async!
PS4 Pro — 218 GB/s
iPad Pro — 51 GB/s
Mobile/Console
Unified Memory Architecture
CPU & GPU share same memory
Take Away
CPU & GPU exchange data much faster
But overall bandwidth is limited
RAM
Multi-GPU
Tesla V100
VRAM
8..16* GB/s
7.8 TB/s
900 GB/s
V100
LDS
8..16* GB/s
V100
RAM
…
Motherboard might not have
enough PCIe lanes to provide
16 dedicated lanes per GPU
*
Cloud
P3 instance on AWS
VRAM
68 GB/s
7.8 TB/s
900 GB/s
V100
LDS
V100
NVLink 2.0
25 GB/s
RAM
…
Xeon E5 v4
8 GB/s 8 GB/s
Cloud
P3.16xlarge on AWS
68 GB/s
Xeon E5 v4
V100 V100 V100 V100
V100 V100 V100 V100
RAM
8 GB/s 8 8 8
8 GB/s 8 8 8
25 GB/s 25 GB/s
25 GB/s25 GB/s25 GB/s
25 GB/s
NVLink2.0 provides up to 6 links
2 GPUs out of 8 are not interlinked
Take Away
PCIe speed is the bottleneck, if you need to synchronise multiple GPU
NVLink is essential for fast multi-GPU setup
Programming Model
Programming model is scalar!
Compiler maps 32 scalar “threads” to 1 SIMT instruction
Memory access across 32 “threads” are grouped as well…
32 threads are executed
in lock-step, each step is
1 SIMT instruction
Memory accesses are grouped… wait, what?
Accesses are grouped into large transactions to
max memory bandwidth
Single memory transaction 256 bit
imagine a single
memory read
worth of 32 floats
imagine a single
memory write
worth of 32 floats
Memory accesses are grouped
Called “Coalesced Access to Memory”
Will automatically map to as few
cache line accesses as possible!
address #0
address #4
address #8
address #12
address #16
address #20
address #24
address #28
address #32
address #36
address #40
address #44
address #48
address #52
address #56
address #60
address #64
address #68
address #72
address #76
address #80
Naive “sequential” memory access
Naive “sequential” memory access
Good access pattern on CPU
Naive “sequential” memory access
Good access pattern on CPU
Naive “sequential” memory access
Good access pattern on CPU
Naive “sequential” memory access
Good access pattern on CPU
Naive “sequential” memory access
Turns BAD on GPU!
Naive “sequential” memory access
GPU runs many threads in parallel…
Naive “sequential” memory access
Each thread access different cache line
Naive “sequential” memory access
Each thread access different cache line
Naive “sequential” memory access
Each thread access different cache line
Warp-aware memory access
Good!
Warp-aware memory access
Good! Now nearby threads share cache lines
Warp-aware memory access
Good! Now nearby threads share cache lines
Warp-aware memory access
Another good!
Warp-aware memory access
Another good!
Warp-aware memory access
Another good!
Programming Model part 2
CUDA pipeline
CUDA C PTX cubin / SASS
CUDA C — that is what you usually write
cubin / SASS — that is what actually runs on GPU
CUDA
CUDA C PTX cubin / SASS
PTX — bytecode for GPU
Intermediate step
NVIDIA only, but architecture* independent
cubin / SASS — binary for GPU
NVIDIA only and architecture* specific
By saying ‘architecture’ here, I mean:
Volta, Pascal, Maxwell, Kepler, etc
*
CUDA
CUDA C PTX
nvcc — nvidia open source LLVM based compiler
PTX cubin / SASS
ptxas — nvidia closed source assembler
OpenCL
Doesn’t integrate well with the cross-platform engine
DirectX + OpenCL = ?
PS4 + OpenCL = ?
Mobile + OpenCL = ?
Code is compiled by OpenCL run-time (“driver”)
Result might be hit-or-miss in terms of performance
Performance is not portable
Platform specific Zoo
DirectCompute — Windows, XBOX
Metal Compute — iOS, MacOS
GLSL Compute — PS4, Linux, Android ES3.1
Vulkan Compute is cross-platform, but not widely implemented yet!
All Integrate well with the rendering engine
Asyncronous compute - graphics and compute workloads simultaneously
Unity Compute ☛ bit.ly/unity-compute-docs
Cross compiles to platform specific Compute:
Integrated well with the rendering engine
Performance is not portable
DirectCompute — Windows, XBOX
Metal Compute — iOS, MacOS
GLSL Compute — PS4, Linux, Android ES3.1
Vulkan Compute — Android, …
Practical Performance
Large Matrix Multiplication
Workhorse of Machine Learning
• Fully Connected layer is matrix multiplication
• Convolutional layer has a lot in common /w matrix multiplication
SGEMM in BLAS
Matrix multiplication
Math ops = O(N3)
Memory ops = O(N2+N2)
Math ops = O(N3)
Memory ops = O(N2+N2)
Classical solution
Work in blocks aka tiles!
Load source tiles from the memory to the cache (LDS)
Accumulate multiplication result in the cache
Store accumulator tile back to the memory
Matrix multiplication
Math ops = O(N3)
Memory ops = O(N2+N2)
Work in blocks aka tiles!
Load source tiles from the memory to the cache (LDS)
Accumulate multiplication result in the cache
Store accumulator tile back to the memory × =
Classical solution
Matrix multiplication
Math ops = O(N3)
Memory ops = O(N2+N2)
Work in blocks aka tiles!
Load source tiles from the memory to the cache (LDS)
Accumulate multiplication result in the cache
Store accumulator tile back to the memory
Classical solution
Matrix multiplication
Math ops = O(N3)
Memory ops = O(N2+N2)
Work in blocks aka tiles!
Load source tiles from the memory to the cache (LDS)
Accumulate multiplication result in the cache
Store accumulator tile back to the memory × =
Classical solution
Matrix multiplication
Not too fast!
Still far from the maximum TFLOPs
Why?
More ALU than LoaD/STore
GPU can issue more MultiplyAdd instructions than
memory reads per cycle
GPU packs more arithmetic (ALU)
than memory access units (LD/ST)
4:1 is a common ratio
GTX 1080
GTX 1080
320 GB/s — VRAM

4.1 TB/s — LDS
30+ TB/s — Registers
GTX 1080 320 GB/s — VRAM

4.1 TB/s — LDS
30+ TB/s — Registers
GTX 1080
File of 16K scalar registers shared for up to 16 warps
Up to 256 scalar (FP32) registers per thread
Matrix multiplication #2
Cache big tiles in LDS
to minimize VRAM bandwidth
Cache small tiles (4x4) in registers
to minimize LD/ST issue
Arithmetic operations on 4x4 blocks
VRAM
LDS
registers
× =
ALU
Take away
Reaching the best performance requires data dimensions to be
multiple of tile size
Minibatch size is especially crucial for Fully Connected layers
Take away
Sweet spot for Convolutional layers is between 16 and 32
Preferably all dimensions divisible by 4 or 8
small tile loaded into registers
Volta TensorCore
Speed of Convolutional Nets
due to minibatch size
Sweet spot:
VGG — 16+
ResNet — 32+
Despite architectural differences between CPU & GPU
what dominates the speed of training Convolutional Neural Net
is the raw TFLOPs of a given chip!*
Xeon E5 v4 @ 2.2Ghz × 20 cores
0.7 TFLOPs**
i7-7700 @ 3.6Ghz × 4 cores
0.36 TFLOPs
iPad Pro, A9Xm @ 2.26Ghz × 2 cores
0.08 TFLOPs
Tesla V100 @ 1.4Ghz × 80 SM cores
14.9 TFLOPs***
GTX 1080 Ti @ 1.5Ghz × 28 SM cores
11.34 TFLOPs
iPad Pro, A9X-PVR-7XT @ 0.45Ghz × 12
0.35 TFLOPs
Numbers for both CPU & GPU are
specified at full FP32 precision
***)CPU numbers here are measured and
do not completely agree with theoretical -
some errors might have crept in ;)
**)
Given that reasonably optimised code is used - like
cuDNN lib for GPU and Intel-MKL-DNN for CPU
*)
Hiring ML + Graphics experts

More Related Content

What's hot

Khronos Munich 2018 - Halcyon and Vulkan
Khronos Munich 2018 - Halcyon and VulkanKhronos Munich 2018 - Halcyon and Vulkan
Khronos Munich 2018 - Halcyon and Vulkan
Electronic Arts / DICE
 
Siggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsSiggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentials
Tristan Lorach
 
NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017
Mark Kilgard
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering Techniques
Narann29
 
Past, Present and Future Challenges of Global Illumination in Games
Past, Present and Future Challenges of Global Illumination in GamesPast, Present and Future Challenges of Global Illumination in Games
Past, Present and Future Challenges of Global Illumination in Games
Colin Barré-Brisebois
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
Brendan Gregg
 
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game Engine
Johan Andersson
 
Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2
Philip Hammer
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
Johan Andersson
 
Building an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache PulsarBuilding an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache Pulsar
ScyllaDB
 
Logging system of Android
Logging system of AndroidLogging system of Android
Logging system of Android
Tetsuyuki Kobayashi
 
Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)Tiago Sousa
 
Killzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo PostmortemKillzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo Postmortem
Guerrilla
 
Cgroups in android
Cgroups in androidCgroups in android
Cgroups in android
ramalinga prasad tadepalli
 
Speed up your asset imports for big projects - Unite Copenhagen 2019
Speed up your asset imports for big projects - Unite Copenhagen 2019Speed up your asset imports for big projects - Unite Copenhagen 2019
Speed up your asset imports for big projects - Unite Copenhagen 2019
Unity Technologies
 
Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666
Tiago Sousa
 
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Electronic Arts / DICE
 
Rendering Battlefield 4 with Mantle
Rendering Battlefield 4 with MantleRendering Battlefield 4 with Mantle
Rendering Battlefield 4 with Mantle
Electronic Arts / DICE
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
AMD Developer Central
 
Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)
Alexander Dolbilov
 

What's hot (20)

Khronos Munich 2018 - Halcyon and Vulkan
Khronos Munich 2018 - Halcyon and VulkanKhronos Munich 2018 - Halcyon and Vulkan
Khronos Munich 2018 - Halcyon and Vulkan
 
Siggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsSiggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentials
 
NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering Techniques
 
Past, Present and Future Challenges of Global Illumination in Games
Past, Present and Future Challenges of Global Illumination in GamesPast, Present and Future Challenges of Global Illumination in Games
Past, Present and Future Challenges of Global Illumination in Games
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game Engine
 
Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 
Building an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache PulsarBuilding an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache Pulsar
 
Logging system of Android
Logging system of AndroidLogging system of Android
Logging system of Android
 
Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)
 
Killzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo PostmortemKillzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo Postmortem
 
Cgroups in android
Cgroups in androidCgroups in android
Cgroups in android
 
Speed up your asset imports for big projects - Unite Copenhagen 2019
Speed up your asset imports for big projects - Unite Copenhagen 2019Speed up your asset imports for big projects - Unite Copenhagen 2019
Speed up your asset imports for big projects - Unite Copenhagen 2019
 
Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666
 
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
 
Rendering Battlefield 4 with Mantle
Rendering Battlefield 4 with MantleRendering Battlefield 4 with Mantle
Rendering Battlefield 4 with Mantle
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)
 

Similar to Trip down the GPU lane with Machine Learning

Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
ShapeBlue
 
GPU Introduction.pptx
 GPU Introduction.pptx GPU Introduction.pptx
GPU Introduction.pptx
SherazMunawar5
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
npinto
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
In-Memory Computing Summit
 
Offloading for Databases - Deep Dive
Offloading for Databases - Deep DiveOffloading for Databases - Deep Dive
Offloading for Databases - Deep Dive
UniFabric
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
Zoltan Arnold Nagy
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
ugur candan
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
CESGA Centro de Supercomputación de Galicia
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
L05 parallel
L05 parallelL05 parallel
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Slide_N
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton inserts
Chris Adkin
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptx
ssuser0de10a
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4UniFabric
 
corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptx
Pranita602627
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
brouer
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
Adrien Mahieux
 

Similar to Trip down the GPU lane with Machine Learning (20)

Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
GPU Introduction.pptx
 GPU Introduction.pptx GPU Introduction.pptx
GPU Introduction.pptx
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Offloading for Databases - Deep Dive
Offloading for Databases - Deep DiveOffloading for Databases - Deep Dive
Offloading for Databases - Deep Dive
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
L05 parallel
L05 parallelL05 parallel
L05 parallel
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton inserts
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptx
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
 
corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptx
 
Larrabee
LarrabeeLarrabee
Larrabee
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 

More from Renaldas Zioma

Overview of AI Startups in Lithuania
Overview of AI Startups in LithuaniaOverview of AI Startups in Lithuania
Overview of AI Startups in Lithuania
Renaldas Zioma
 
New Tools for Art and Content - Artificial Intelligence and Machine Learning
New Tools for Art and Content - Artificial Intelligence and Machine LearningNew Tools for Art and Content - Artificial Intelligence and Machine Learning
New Tools for Art and Content - Artificial Intelligence and Machine Learning
Renaldas Zioma
 
Practical AI in Games
Practical AI in GamesPractical AI in Games
Practical AI in Games
Renaldas Zioma
 
THE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CG
THE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CGTHE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CG
THE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CG
Renaldas Zioma
 
Unite2014: Mastering Physically Based Shading in Unity 5
Unite2014: Mastering Physically Based Shading in Unity 5Unite2014: Mastering Physically Based Shading in Unity 5
Unite2014: Mastering Physically Based Shading in Unity 5
Renaldas Zioma
 
FMX2013: Butterfly Effect
FMX2013: Butterfly EffectFMX2013: Butterfly Effect
FMX2013: Butterfly Effect
Renaldas Zioma
 

More from Renaldas Zioma (7)

Overview of AI Startups in Lithuania
Overview of AI Startups in LithuaniaOverview of AI Startups in Lithuania
Overview of AI Startups in Lithuania
 
New Tools for Art and Content - Artificial Intelligence and Machine Learning
New Tools for Art and Content - Artificial Intelligence and Machine LearningNew Tools for Art and Content - Artificial Intelligence and Machine Learning
New Tools for Art and Content - Artificial Intelligence and Machine Learning
 
Practical AI in Games
Practical AI in GamesPractical AI in Games
Practical AI in Games
 
THE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CG
THE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CGTHE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CG
THE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CG
 
Unite2014: Mastering Physically Based Shading in Unity 5
Unite2014: Mastering Physically Based Shading in Unity 5Unite2014: Mastering Physically Based Shading in Unity 5
Unite2014: Mastering Physically Based Shading in Unity 5
 
FMX2013: Butterfly Effect
FMX2013: Butterfly EffectFMX2013: Butterfly Effect
FMX2013: Butterfly Effect
 
Buttefly
ButteflyButtefly
Buttefly
 

Recently uploaded

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

Trip down the GPU lane with Machine Learning

  • 1. Renaldas Zioma Unity Labs Trip down the GPU lane with Machine Learning @__ReJ__
  • 2. Despite architectural differences between CPU & GPU what dominates the speed of training Convolutional Neural Net is the raw TFLOPs of a given chip!
  • 4. Single Instruction Multiple Data Single Instruction Multiple Threads SIMD SIMT + +
  • 5. Single Instruction Multiple Data Single Instruction Multiple Threads SIMD SIMT + Lane Instruction applied to all lanes Data Multiple lanes aka vector +
  • 6. 4 lanes (SSE) 8 lanes (AVX) 32 lanes 16 lanes (Mobile GPU) 64 lanes (AMD) SIMD SIMT SIMT is almost the same as SIMD, but much wider + +
  • 7. Executes “any” instruction that has all source operands ready Warp is 32 threads working in-sync Threads must share the same program! 1 core* can keep 64** warps in flight Executes warp that has all source operands ready Very lightweight switch between warps Out Of Order Warp or Wavefront Aim is to hide memory latency By Core here I mean Streaming Multiprocessor (SM) in Nvidia or Compute Unit (CU) in AMD GPU, but not “CUDA core”. While SM and CU are comparable to CPU core, “CUDA core” is more of a marketing term than a full-fledged core. Depends actually - 40 on AMD, 128 on GP100… *) **)
  • 8. SIMT in one picture + + Thread = Lane 1 thread is mapped to 1 hardware lane Warp of threads Each core maintains many warps in flight
  • 9. Warp Warp is 32* threads working in a lockstep All threads execute the same program Each thread is mapped to a single SIMT lane Processor will “juggle” warps to hide memory latency The number of threads per warp is actually hardware dependent and ranges from 16 on Mobile to 64 on AMD *)
  • 10. Take Away Need lots of parallel work - in 1000s* of threads IF-ELSE block executes both IF and ELSE** If SIMT lane is not doing work, it is wasted! Number of cores (SM/CU) × number of threads in warp × number of warps in flight For example optimised matrix multiplication kernel would need at least 10240 threads to saturate GTX 1080
 = 20 SMs × 32 warps × 16 *) Unless of course all threads in the warp agree on the result of conditional statement. In such case only one path needs to be executed, either IF or ELSE **)
  • 12. L3 + L2 + L1 L2 + L1 Constant + Texture cache LDS (Local Data Store) explicitly programmable “cache” Lots of registers! Cache Cache + “Scratchpad”
  • 13. LDS - Local Data Share Piece of small, but ultra fast memory - up to 8TB/s! Explicitly programmable “cache” Fast data exchange between threads NVIDIA calls it “Shared Memory”* But “Shared Memory” is such a generic term, lets be more specific and call it Local Data Share for the rest of the slides. *
  • 14. Memory and execution is sequential Some instructions see memory as 2D / 3D blocks Work scheduled in groups One dimensional Multidimensional view
  • 16. 320 GB/s 38* GB/s PCIe 3.0 x1616 GB/s Desktop GTX 1080 VRAM GTX 1080 i7-7700 Numbers provided here and onward are for peak bandwidth. Practical achievable bandwidth is usually 75% of peak and can be even lower for CPU. *)
  • 17. VRAM 38 GB/s 16 GB/s Desktop GTX 1080 4.1 TB/s 320 GB/s LDS PCIe 3.0 x16 GTX 1080 i7-7700
  • 18. 81 GB/s
 1.7 TB/s 24 GB/s PCIe 3.0 x88 GB/s Laptop MacBookPro2016 VRAM Radeon PRO 460 LDS i7-6920HQ
  • 19. Take Away MYTH: PCIe is very slow CPU GPU (PCIe) speed is not too terrible comparing with common CPU memory ~ 1:3 ratio PCIe speed is mostly an issue when training with multi-GPU setup
  • 20. Take Away Need to access the same data several times on GPU to make worth the transfer FLOPs per byte metric
  • 21. Take Away Getting results back from GPU can be slow, but NOT because of PCIe speed Rather due to latency - CPU & GPU work async!
  • 22. PS4 Pro — 218 GB/s iPad Pro — 51 GB/s Mobile/Console Unified Memory Architecture CPU & GPU share same memory Take Away CPU & GPU exchange data much faster But overall bandwidth is limited RAM
  • 23. Multi-GPU Tesla V100 VRAM 8..16* GB/s 7.8 TB/s 900 GB/s V100 LDS 8..16* GB/s V100 RAM … Motherboard might not have enough PCIe lanes to provide 16 dedicated lanes per GPU *
  • 24. Cloud P3 instance on AWS VRAM 68 GB/s 7.8 TB/s 900 GB/s V100 LDS V100 NVLink 2.0 25 GB/s RAM … Xeon E5 v4 8 GB/s 8 GB/s
  • 25. Cloud P3.16xlarge on AWS 68 GB/s Xeon E5 v4 V100 V100 V100 V100 V100 V100 V100 V100 RAM 8 GB/s 8 8 8 8 GB/s 8 8 8 25 GB/s 25 GB/s 25 GB/s25 GB/s25 GB/s 25 GB/s NVLink2.0 provides up to 6 links 2 GPUs out of 8 are not interlinked
  • 26. Take Away PCIe speed is the bottleneck, if you need to synchronise multiple GPU NVLink is essential for fast multi-GPU setup
  • 28. Programming model is scalar! Compiler maps 32 scalar “threads” to 1 SIMT instruction Memory access across 32 “threads” are grouped as well… 32 threads are executed in lock-step, each step is 1 SIMT instruction
  • 29. Memory accesses are grouped… wait, what? Accesses are grouped into large transactions to max memory bandwidth Single memory transaction 256 bit imagine a single memory read worth of 32 floats imagine a single memory write worth of 32 floats
  • 30. Memory accesses are grouped Called “Coalesced Access to Memory” Will automatically map to as few cache line accesses as possible! address #0 address #4 address #8 address #12 address #16 address #20 address #24 address #28 address #32 address #36 address #40 address #44 address #48 address #52 address #56 address #60 address #64 address #68 address #72 address #76 address #80
  • 32. Naive “sequential” memory access Good access pattern on CPU
  • 33. Naive “sequential” memory access Good access pattern on CPU
  • 34. Naive “sequential” memory access Good access pattern on CPU
  • 35. Naive “sequential” memory access Good access pattern on CPU
  • 36. Naive “sequential” memory access Turns BAD on GPU!
  • 37. Naive “sequential” memory access GPU runs many threads in parallel…
  • 38. Naive “sequential” memory access Each thread access different cache line
  • 39. Naive “sequential” memory access Each thread access different cache line
  • 40. Naive “sequential” memory access Each thread access different cache line
  • 42. Warp-aware memory access Good! Now nearby threads share cache lines
  • 43. Warp-aware memory access Good! Now nearby threads share cache lines
  • 48. CUDA pipeline CUDA C PTX cubin / SASS CUDA C — that is what you usually write cubin / SASS — that is what actually runs on GPU
  • 49. CUDA CUDA C PTX cubin / SASS PTX — bytecode for GPU Intermediate step NVIDIA only, but architecture* independent cubin / SASS — binary for GPU NVIDIA only and architecture* specific By saying ‘architecture’ here, I mean: Volta, Pascal, Maxwell, Kepler, etc *
  • 50. CUDA CUDA C PTX nvcc — nvidia open source LLVM based compiler PTX cubin / SASS ptxas — nvidia closed source assembler
  • 51. OpenCL Doesn’t integrate well with the cross-platform engine DirectX + OpenCL = ? PS4 + OpenCL = ? Mobile + OpenCL = ? Code is compiled by OpenCL run-time (“driver”) Result might be hit-or-miss in terms of performance Performance is not portable
  • 52. Platform specific Zoo DirectCompute — Windows, XBOX Metal Compute — iOS, MacOS GLSL Compute — PS4, Linux, Android ES3.1 Vulkan Compute is cross-platform, but not widely implemented yet! All Integrate well with the rendering engine Asyncronous compute - graphics and compute workloads simultaneously
  • 53. Unity Compute ☛ bit.ly/unity-compute-docs Cross compiles to platform specific Compute: Integrated well with the rendering engine Performance is not portable DirectCompute — Windows, XBOX Metal Compute — iOS, MacOS GLSL Compute — PS4, Linux, Android ES3.1 Vulkan Compute — Android, …
  • 55. Large Matrix Multiplication Workhorse of Machine Learning • Fully Connected layer is matrix multiplication • Convolutional layer has a lot in common /w matrix multiplication SGEMM in BLAS
  • 56. Matrix multiplication Math ops = O(N3) Memory ops = O(N2+N2)
  • 57. Math ops = O(N3) Memory ops = O(N2+N2) Classical solution Work in blocks aka tiles! Load source tiles from the memory to the cache (LDS) Accumulate multiplication result in the cache Store accumulator tile back to the memory Matrix multiplication
  • 58. Math ops = O(N3) Memory ops = O(N2+N2) Work in blocks aka tiles! Load source tiles from the memory to the cache (LDS) Accumulate multiplication result in the cache Store accumulator tile back to the memory × = Classical solution Matrix multiplication
  • 59. Math ops = O(N3) Memory ops = O(N2+N2) Work in blocks aka tiles! Load source tiles from the memory to the cache (LDS) Accumulate multiplication result in the cache Store accumulator tile back to the memory Classical solution Matrix multiplication
  • 60. Math ops = O(N3) Memory ops = O(N2+N2) Work in blocks aka tiles! Load source tiles from the memory to the cache (LDS) Accumulate multiplication result in the cache Store accumulator tile back to the memory × = Classical solution Matrix multiplication
  • 61. Not too fast! Still far from the maximum TFLOPs Why?
  • 62.
  • 63. More ALU than LoaD/STore GPU can issue more MultiplyAdd instructions than memory reads per cycle GPU packs more arithmetic (ALU) than memory access units (LD/ST) 4:1 is a common ratio
  • 64. GTX 1080 GTX 1080 320 GB/s — VRAM
 4.1 TB/s — LDS 30+ TB/s — Registers
  • 65. GTX 1080 320 GB/s — VRAM
 4.1 TB/s — LDS 30+ TB/s — Registers GTX 1080 File of 16K scalar registers shared for up to 16 warps Up to 256 scalar (FP32) registers per thread
  • 66. Matrix multiplication #2 Cache big tiles in LDS to minimize VRAM bandwidth Cache small tiles (4x4) in registers to minimize LD/ST issue Arithmetic operations on 4x4 blocks VRAM LDS registers × = ALU
  • 67. Take away Reaching the best performance requires data dimensions to be multiple of tile size Minibatch size is especially crucial for Fully Connected layers
  • 68. Take away Sweet spot for Convolutional layers is between 16 and 32 Preferably all dimensions divisible by 4 or 8 small tile loaded into registers Volta TensorCore
  • 69. Speed of Convolutional Nets due to minibatch size Sweet spot: VGG — 16+ ResNet — 32+
  • 70. Despite architectural differences between CPU & GPU what dominates the speed of training Convolutional Neural Net is the raw TFLOPs of a given chip!* Xeon E5 v4 @ 2.2Ghz × 20 cores 0.7 TFLOPs** i7-7700 @ 3.6Ghz × 4 cores 0.36 TFLOPs iPad Pro, A9Xm @ 2.26Ghz × 2 cores 0.08 TFLOPs Tesla V100 @ 1.4Ghz × 80 SM cores 14.9 TFLOPs*** GTX 1080 Ti @ 1.5Ghz × 28 SM cores 11.34 TFLOPs iPad Pro, A9X-PVR-7XT @ 0.45Ghz × 12 0.35 TFLOPs Numbers for both CPU & GPU are specified at full FP32 precision ***)CPU numbers here are measured and do not completely agree with theoretical - some errors might have crept in ;) **) Given that reasonably optimised code is used - like cuDNN lib for GPU and Intel-MKL-DNN for CPU *)
  • 71. Hiring ML + Graphics experts